Building a specialized comparable corpus: PARSA

Alayiaboozar, Elham; Hojjatpanah, Aliasghar

doi:10.22051/jlr.2023.44928.2348

Building a specialized comparable corpus: PARSA

Document Type : Research

Authors

¹ Assistant Professor, Iranian Research Institute for Information Science and Technology (IranDoc). Tehran.Iran

² Iranian Research Institute for Information Science and Technology (IranDoc); Tehran. Iran

10.22051/jlr.2023.44928.2348

Abstract

Based on the language used in their constituent texts, corpora are categorized as monolingual, bilingual, or multilingual. A comparable corpus is a bilingual or multilingual corpus that includes similar texts in the same subject areas. In other words, a comparable corpus is a collection of documents in two different languages that cover similar topics. Comparable corpora can be composed of general texts, providing various possibilities for discourse analysis, pragmatics, analysis of text genres, and sociolinguistics. Examples of such corpora could include collections of encyclopedia entries, or literary texts from a certain period of time. However, the most common types of comparable corpora, which attract many audiences are those related to specialized fields and containing a high density of vocabulary and technical terms. Such a corpus is called a specialized comparable corpus. In this study, a specialized comparable corpus was built from the Persian and English abstracts of theses and dissertations registered in IranDoc. The corpus is named PARSA.

Keywords

References

Alayiaboozar, E., & Hojjatpanah, A (2022). Steps for creating two Persian specialized corpora. International Journal of Information Science and Management (IJISM), 20(4), 231-243. https://ijism.isc.ac/article_698428.html
Alayiaboozar, E., Pakniat, N., Zali, M., & Aghalooyi Aghmiyooni,.M.H. (2021). Building a corpus from the published articles of Iranian Journal of Information Management and Processing. Iranian Research Institute for Information Science and Technology (Irandoc). https://irandoc.ac.ir/sites/fa/files/attach/research/559pf.pdf [In Persian]
Asghari, H., Khoshnava, Kh., Fatemi, O., & Faili, H. (2015, September 8-11). Developing bilingual plagiarism detection corpus using sentence aligned parallel corpus [Conference presentation]. Working Notes of CLEF 2015 - Conference and Labs of the Evaluation forum, Toulouse, France. https://ceur-ws.org/Vol-1391/148-CR.pdf
Atkins, S. J. Clear., & Ostler, N. (1992). Corpus design criteria. Literary and Linguistic Computing, 7(1), 1-16. https://doi.org/10.1093/llc/7.1.1
Beloso, B. S. (2015). Designing, describing and compiling a corpus of English for architecture. Procedia-social and behavioral sciences, 198, 459-464. https://doi.org/10.1016/j.sbspro.2015.07.466
Bijankhan, M., Sheykhzadegan, J., Bahrani, M., & Ghayoomi, M. (2011). Lesson from building a Persian written corpus: Peykare. Language resources and evolution, 45(2), 143-164. https://doi.org/10.1007/s10579-010-9132-x
Claude Toriida, M. (2016). Steps for creating specialized corpus and developing an annotated frequence-based vocabulary list. TESL Canada journal/ revue TESL du Canada, 34(11), 87-105. https://doi.org/10.18806/tesl.v34i1.1257
Dashtbani, Sh., Mansoorizade, M., & Nasiri, M. (2014). English-Persian comparable textual corpus in FAVA domain. Comparative linguistic research, 4(8), 121-141. https://rjhll.basu.ac.ir/article_972.html [In Persian]
Emrayi, A., Hesabi, A., & Eslami Rasekh, A. (2019). Designing corpus and bilingual traffic terms based on frame semantics. Language and translation studies, 52(2), 65-97. https://doi.org/10.22067/lts.v52i2.80823 [In Persian]
Ghatre, F. (2007). Inflectional features in modern Persian. Dastoor, 3, 52-81. https://ensani.ir/fa/article/99232 [In Persian]
Ghayoomi, M. (2022). Preprocessing and basic tools. In Shams Fard, M. & Bijan Khan, M. (Eds.), Text and speech processing for the Persian language: the state of art and a brief review of the theoretical foundations (pp. 86-113). SAMT. https://samt.ac.ir/fa/book/6143 [In Persian]
Ghayoomi, M., Momtazi, S., & Bijankhan, M. (2010). A Study of Corpus Development for Persian. International Journal of Asian Language Processing, 20(1), 17-34. https://www.colips.org/journals/volume20/20.1.02-Masood-Ghayoomi.pdf
Karimi, A., Ansari, E., & Sadeghi Bigham, B. (2017). Extracting an English-Persian parallel corpus from comparable corpora. (Project: Machin translation. Parallel sentence extraction from comparable corpora using statistical machine translation). Arxiv: 1711.00681v3 [cs.CL]. https://doi.org/10.48550/arXiv.1711.00681
Kenning, M. M. (2010). What are parallel and comparable corpora and how can we use them. In O’Keeffe, A., McCarthy, M. (Eds.), The Routledge Handbook of Corpus Linguistics (pp. 487–500). Routledge. https://www.routledge.com/
Keshani, Kh. (1992). Derivation suffix in modern Persian. Markaz Nashr Daneshgahi. https://daneshnegar.com/fa/product/39614 [In Persian]
Kokabi, A., Nourian, A., Ghafourzadeh, E., Imani, M., Fallah, M., Mahdavi Mortazavi, M., Ghorbani, M., Ruhollah, R., Ebrahimi, M., Riasati, R., Khallash, M., Khosrotabar, M., Bashari, H., Mahdizade, M., Souri, Y., Kharazi, V… Qayyoomi, A. (2023, October 5). Persian NLP Toolkit. github. https://github.com/roshan-research/hazm
Koltunski, E. L. (2013). VARTRA: A comparable corpus for analysis of translation variation. In Sharoff, S., Zweigenbaum, P., & Rapp, R. (Eds.), Proceedings of the 6^th workshop on building and using comparable corpora. (pp. 77-86). Association for computational linguistics. https://www.researchgate.net/publication/
Kouhestani, M. (2010). Studying written errors In Persian weblogs and their linguistic nature [Unpublished master’s thesis]. University of Tehran. [In Persian]
Lazard, G. (2010). Persian Grammar. Hermes. https://www.hermespub.ir/product/ [In Persian]
Mohammadi, A. M. (2023). A study of the relationship between discoursal elements in parallel corpora: a case study of simultaneous interpretation. ZABANPAZHUHI (journal of language research), 15(47), 236-262. https://doi.org/10.22051/jlr.2021.36750.2056 [In Persian]
Mohammadi, R. (2012). Building Persian-English comparable corpus and extracting parallel sentences [Unpublished master’s thesis]. Alzahra University. https://elmnet.ir/doc/10526832-12611 [In Persian]
Sadeghi, A. A. (1991-1993). Word formation methods In Persian. Danesh publication. https://ensani.ir/fa/article/293365/ [In Persian]
Sinclair, J. (2004). Corpus and Text-Basic Principles. In Wynne, M. (Ed.), Developing Linguistic Corpora: A Guide to Good Practice (pp. 5-25). The Oxford Text Archive. https://users.ox.ac.uk/~martinw/dlc/chapter1.htm

ZABANPAZHUHI (Journal of Language Research)

Volume 16, Issue 52 - Serial Number 52
October 2024
Pages 219-246

Article View: 297
PDF Download: 282

Building a specialized comparable corpus: PARSA

References

Volume 16, Issue 52 - Serial Number 52
October 2024
Pages 219-246

Files

Share

How to cite

Statistics

Building a specialized comparable corpus: PARSA

References

Volume 16, Issue 52 - Serial Number 52October 2024Pages 219-246

Files

Share

How to cite

Statistics

Volume 16, Issue 52 - Serial Number 52
October 2024
Pages 219-246