Annually, researchers in various scientific fields publish their research results as technical reports or articles in proceedings or journals. The collocation of this type of data is used by search engines and digital libraries to search and access research publications, which usually retrieve related articles based on the query keywords instead of the article’s subjects. Consequently, accurate classification of scientific articles can increase the quality of users’ searches when seeking a scientific document in databases. The primary purpose of this paper is to provide a classification model to determine the scope of scientific articles. To this end, we proposed a model which uses the enriched contextualized knowledge of Persian articles through distributional semantics. Accordingly, identifying the specific field of each document and defining its domain by prominent enriched knowledge enhances the accuracy of scientific articles’ classification. To reach the goal, we enriched the contextualized embedding models, either ParsBERT or XLM-RoBERTa, with the latent topics to train a multilayer perceptron model. According to the experimental results, overall performance of the ParsBERT-NMF-1HT was 72.37% (macro) and 75.21% (micro) according to F-measure, with a statistical significance compared to the baseline (p<0.05).
Bijankhan, M., Sheikhzadegan, J., & Samareh, M. R. Y. (1994). FARSDAT - The Speech Database of Farsi Spoken Language. Proceedings of the 5th Internationa Conference on Speech Science and Technology, 2, 826–831. https://www.researchgate.net/publication/292798168_The_speech_database_of_Farsi_spoken_language
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3(Jan), 993–1022. https://dl.acm.org/doi/10.5555/944919.944937
Chen, Y., Zhang, H., Liu, R., Ye, Z., & Lin, J. (2019). Experimental explorations on short text topic mining between LDA and NMF based Schemes. Knowledge-Based Systems, 163, 1–13. https://doi.org/10.1016/j.knosys.2018.08.011
Chowdhury, S., & Schoen, M. P. (2020). Research paper classification using supervised machine learning techniques. 2020 Intermountain Engineering, Technology and Computing (IETC), 1–6. https://doi.org/10.1109/IETC47856.2020.9249211
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., & Stoyanov, V. (2020). Unsupervised cross-lingual representation learning at scale. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 8440–8451. https://doi.org/10.18653/v1/2020.acl-main.747
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. https://doi.org/10.18653/v1/N19-1423
EmamiAzadi, T., & AlmasGanj, F. (2006). Topic classification of Persian texts based on the improved probabilistic latent semantic analysis. The 12th Conference of Iran’s Computer Society, Tehran. https://civilica.com/doc/44669/
Farahani, M., Gharachorloo, M., Farahani, M., & Manthouri, M. (2021). Parsbert: Transformer-based model for Persian language understanding. Neural Processing Letters, 53(6), 3831–3847. https://doi.org/10.1007/s11063-021-10528-4
Févotte, C., & Idier, J. (2011). Algorithms for nonnegative matrix factorization with the $β$-divergence. Neural Computing, 23(9), 2421–2456. https://doi.org/10.1162/NECO_a_00168
Ghayoomi, M., & Mousavian, M. (2022). Application of the neural network-based machine learning method to classify scientific articles. Iranian Journal of Information Processing & Management, 37(4), 1217-1244. https://doi.org/10.35050/JIPM010.2022.008
Jelodar, H., Wang, Y., Yuan, C., Feng, X., Jiang, X., Li, Y., & Zhao, L. (2019). Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey. Multimedia Tools and Applications, 78(11), 15169–15211. https://doi.org/10.1007/s11042-018-6894-4
Jurafsky, D., & Martin, J. H. (2000). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice Hall: Upper Saddle River, New Jersey.
Karami, A., Gangopadhyay, A., Zhou, B., & Kharrazi, H. (2018). Fuzzy approach topic discovery in health and medical corpora. International Journal of Fuzzy Systems, 20(4), 1334–1345. https://doi.org/10.1007/s40815-017-0327-9
Kim, S.-W., & Gil, J.-M. (2019). Research paper classification systems based on TF-IDF and LDA schemes. Human-Centric Computing and Information Sciences, 9(1), 1–21. https://doi.org/10.1186/s13673-019-0192-7
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). Roberta: A robustly optimized BERT pretraining approach. ArXiv Preprint ArXiv:1907.11692. https://arxiv.org/abs/1907.11692
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, 1, 281–297. https://projecteuclid.org/Proceedings/berkeley-symposium-on-mathematical-statistics-and-probability/proceedings-of-the-fifth-berkeley-symposium-on-mathematical-statistics-and/toc/bsmsp/1200512974
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, & K. Q. Weinberger (Eds.), Proceedings of the 26th International Conference on Neural Information Processing Systems (pp. 3111–3119). Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf
Mustafa, G., Usman, M., Yu, L., Sulaiman, M., & Shahid, A. (2021). Multi-label classification of research articles using Word2Vec and identification of similarity threshold. Scientific Reports, 11(1), 1–20. https://doi.org/10.1038/s41598-021-01460-7
Papadimitriou, C. H., Raghavan, P., Tamaki, H., & Vempala, S. (2000). Latent semantic indexing: A probabilistic analysis. Journal of Computer and System Sciences, 61(2), 217–235. https://doi.org/10.1006/jcss.2000.1711
Rabiei, M., HosseiniMotlagh, S. M., & MinaeiBidgoli, B. (2019). Using One-Class SVM for Scientific Documents Classification Case study: Iranian Environmental Thesis. Iranian Journal of Information Processing and Management, 34(3), 1211–1234. https://doi.org/10.35050/JIPM010.2019.036
Rivest, M., Vignola-Gagné, E., & Archambault, É. (2021). level classification of scientific publications: A comparison of deep learning, direct citation and bibliographic coupling. PloS One, 16(5), e0251493. https://doi.org/10.1371/journal.pone.0251493
Salton, G. (1971). The SMART Retrieval System — Experiments in Automatic Document Processing. Prentice-Hall, Inc.
Shokouhian, M., Asemi, A., Shabani, A., & Cheshmesohrabi, M. (2020). Presenting a Thematic Model of Health Scientific Productions Using Text-Mining Methods. Iranian Journal of Information Processing and Management, 35(2), https://doi.org/553-574. 10.35050/JIPM010.2020.061
Teymoorpoor, B., Sepehri, M.-M., & Pezeshk, L. (2009). A new method for topic classification of scientific texts (case study on the articles of the nanotechnology of Iranian specialists). Policy of Science and Technology, 2(2), 1–15. https://doi.org/20.1001.1.20080840.1388.2.2.2.7
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Moosaviyan, M., & Ghayoomi, M. (2024). Contextualized Text Representation Using Latent Topics for Classifying Scientific Papers. ZABANPAZHUHI (Journal of Language Research), 15(49), 31-60. doi: 10.22051/jlr.2023.44640.2331
MLA
Maryam Moosaviyan; Masood Ghayoomi. "Contextualized Text Representation Using Latent Topics for Classifying Scientific Papers", ZABANPAZHUHI (Journal of Language Research), 15, 49, 2024, 31-60. doi: 10.22051/jlr.2023.44640.2331
HARVARD
Moosaviyan, M., Ghayoomi, M. (2024). 'Contextualized Text Representation Using Latent Topics for Classifying Scientific Papers', ZABANPAZHUHI (Journal of Language Research), 15(49), pp. 31-60. doi: 10.22051/jlr.2023.44640.2331
VANCOUVER
Moosaviyan, M., Ghayoomi, M. Contextualized Text Representation Using Latent Topics for Classifying Scientific Papers. ZABANPAZHUHI (Journal of Language Research), 2024; 15(49): 31-60. doi: 10.22051/jlr.2023.44640.2331