Document Type : Research

Authors

1 Iranian Research Institute for Information Science and Technology (Irandoc)

2 Iranian Research Institute for Information Science and Technology (IranDoc)

Abstract

Nowadays, the advent of computer technologies and the generation of a very large amount of texts in different languages have provided enormous corpus resources for researchers who are interested in building corpora. A comparable corpus is a bilingual or multilingual corpus, which is a collection of similar texts in different languages or in different varieties of a language from the same domain. Although such a corpus could be used in comparative linguistics, machine translation, automatic cross-lingual information retrieval systems, researchers have always faced a lack of comparative corpora. In present study, a specialized comparable corpus has been constructed (PARSA) from registered Persian and English abstracts of theses and dissertations available in the Iranian Research Institute for Information Science and Technology (IranDoc). This corpus contains more than 89 million Persian words and 79 million English words. The content of this corpus is not general and contains very specialized texts in major subject areas such as social sciences, humanities and arts, engineering and related fields, and is very valuable for language processing that requires using specialized texts. To construct this corpus, after sampling process, the Persian data were preprocessed (normalized and tokenized), and tagged (POS tagging). Then, the Persian tags were manually controlled. The English data were also tagged automatically. The corpus has a high capability for data mining, machine translation related researches as well as the linguistic studies, which require using specialized texts.

Keywords