Contextualized Text Representation Using Latent Topics for Classifying Scientific Papers

Moosaviyan, Maryam; Ghayoomi, Masood

doi:10.22051/jlr.2023.44640.2331

Contextualized Text Representation Using Latent Topics for Classifying Scientific Papers

Document Type : Research

Authors

Maryam Moosaviyan ¹
Masood Ghayoomi ²

¹ Computer Engineering Department, Amirkabir University of Technology, Tehran, Iran

² Faculty of Linguistics, Institute for Humanities and Cultural Studies, Tehran, Iran

10.22051/jlr.2023.44640.2331

Abstract

Annually, researchers in various scientific fields publish their research results as technical reports or articles in proceedings or journals. The collocation of this type of data is used by search engines and digital libraries to search and access research publications, which usually retrieve related articles based on the query keywords instead of the article’s subjects. Consequently, accurate classification of scientific articles can increase the quality of users’ searches when seeking a scientific document in databases. The primary purpose of this paper is to provide a classification model to determine the scope of scientific articles. To this end, we proposed a model which uses the enriched contextualized knowledge of Persian articles through distributional semantics. Accordingly, identifying the specific field of each document and defining its domain by prominent enriched knowledge enhances the accuracy of scientific articles’ classification. To reach the goal, we enriched the contextualized embedding models, either ParsBERT or XLM-RoBERTa, with the latent topics to train a multilayer perceptron model. According to the experimental results, overall performance of the ParsBERT-NMF-1HT was 72.37% (macro) and 75.21% (micro) according to F-measure, with a statistical significance compared to the baseline (p<0.05).

Keywords

Main Subjects

Other linguistic interdisciplinary fields

References

Bijankhan, M., Sheikhzadegan, J., & Samareh, M. R. Y. (1994). FARSDAT - The Speech Database of Farsi Spoken Language. Proceedings of the 5th Internationa Conference on Speech Science and Technology, 2, 826–831. https://www.researchgate.net/publication/292798168_The_speech_database_of_Farsi_spoken_language
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3(Jan), 993–1022. https://dl.acm.org/doi/10.5555/944919.944937
Borko, H. (1968). Information science: What is it? American Documentation, 19(1), 3–5. https://doi.org/10.1002/asi.5090190103
Chen, Y., Zhang, H., Liu, R., Ye, Z., & Lin, J. (2019). Experimental explorations on short text topic mining between LDA and NMF based Schemes. Knowledge-Based Systems, 163, 1–13. https://doi.org/10.1016/j.knosys.2018.08.011
Chowdhury, S., & Schoen, M. P. (2020). Research paper classification using supervised machine learning techniques. 2020 Intermountain Engineering, Technology and Computing (IETC), 1–6. https://doi.org/10.1109/IETC47856.2020.9249211
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., & Stoyanov, V. (2020). Unsupervised cross-lingual representation learning at scale. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 8440–8451. https://doi.org/10.18653/v1/2020.acl-main.747
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. https://doi.org/10.18653/v1/N19-1423
EmamiAzadi, T., & AlmasGanj, F. (2006). Topic classification of Persian texts based on the improved probabilistic latent semantic analysis. The 12th Conference of Iran’s Computer Society, Tehran. https://civilica.com/doc/44669/
Farahani, M., Gharachorloo, M., Farahani, M., & Manthouri, M. (2021). Parsbert: Transformer-based model for Persian language understanding. Neural Processing Letters, 53(6), 3831–3847. https://doi.org/10.1007/s11063-021-10528-4
Févotte, C., & Idier, J. (2011). Algorithms for nonnegative matrix factorization with the $β$-divergence. Neural Computing, 23(9), 2421–2456. https://doi.org/10.1162/NECO_a_00168
Ghayoomi, M., & Mousavian, M. (2022). Application of the neural network-based machine learning method to classify scientific articles. Iranian Journal of Information Processing & Management, 37(4), 1217-1244. https://doi.org/10.35050/JIPM010.2022.008
Harris, Z. S. (1954). Distributional structure. Word, 10(2–3), 146–162. https://doi.org/10.1080/00437956.1954.11659520
Jelodar, H., Wang, Y., Yuan, C., Feng, X., Jiang, X., Li, Y., & Zhao, L. (2019). Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey. Multimedia Tools and Applications, 78(11), 15169–15211. https://doi.org/10.1007/s11042-018-6894-4
Jurafsky, D., & Martin, J. H. (2000). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice Hall: Upper Saddle River, New Jersey.
Karami, A., Gangopadhyay, A., Zhou, B., & Kharrazi, H. (2018). Fuzzy approach topic discovery in health and medical corpora. International Journal of Fuzzy Systems, 20(4), 1334–1345. https://doi.org/10.1007/s40815-017-0327-9
Kim, S.-W., & Gil, J.-M. (2019). Research paper classification systems based on TF-IDF and LDA schemes. Human-Centric Computing and Information Sciences, 9(1), 1–21. https://doi.org/10.1186/s13673-019-0192-7
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). Roberta: A robustly optimized BERT pretraining approach. ArXiv Preprint ArXiv:1907.11692. https://arxiv.org/abs/1907.11692
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, 1, 281–297. https://projecteuclid.org/Proceedings/berkeley-symposium-on-mathematical-statistics-and-probability/proceedings-of-the-fifth-berkeley-symposium-on-mathematical-statistics-and/toc/bsmsp/1200512974
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, & K. Q. Weinberger (Eds.), Proceedings of the 26th International Conference on Neural Information Processing Systems (pp. 3111–3119). Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf
Mustafa, G., Usman, M., Yu, L., Sulaiman, M., & Shahid, A. (2021). Multi-label classification of research articles using Word2Vec and identification of similarity threshold. Scientific Reports, 11(1), 1–20. https://doi.org/10.1038/s41598-021-01460-7
Papadimitriou, C. H., Raghavan, P., Tamaki, H., & Vempala, S. (2000). Latent semantic indexing: A probabilistic analysis. Journal of Computer and System Sciences, 61(2), 217–235. https://doi.org/10.1006/jcss.2000.1711
Rabiei, M., HosseiniMotlagh, S. M., & MinaeiBidgoli, B. (2019). Using One-Class SVM for Scientific Documents Classification Case study: Iranian Environmental Thesis. Iranian Journal of Information Processing and Management, 34(3), 1211–1234. https://doi.org/10.35050/JIPM010.2019.036
Rivest, M., Vignola-Gagné, E., & Archambault, É. (2021). level classification of scientific publications: A comparison of deep learning, direct citation and bibliographic coupling. PloS One, 16(5), e0251493. https://doi.org/10.1371/journal.pone.0251493
Salton, G. (1971). The SMART Retrieval System — Experiments in Automatic Document Processing. Prentice-Hall, Inc.
Shokouhian, M., Asemi, A., Shabani, A., & Cheshmesohrabi, M. (2020). Presenting a Thematic Model of Health Scientific Productions Using Text-Mining Methods. Iranian Journal of Information Processing and Management, 35(2), https://doi.org/553-574. 10.35050/JIPM010.2020.061
Teymoorpoor, B., Sepehri, M.-M., & Pezeshk, L. (2009). A new method for topic classification of scientific texts (case study on the articles of the nanotechnology of Iranian specialists). Policy of Science and Technology, 2(2), 1–15. https://doi.org/20.1001.1.20080840.1388.2.2.2.7
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

ZABANPAZHUHI (Journal of Language Research)

Volume 15, Issue 49 - Serial Number 49
February 2024
Pages 31-60

Article View: 283
PDF Download: 247

Contextualized Text Representation Using Latent Topics for Classifying Scientific Papers

References

Volume 15, Issue 49 - Serial Number 49
February 2024
Pages 31-60

Files

Share

How to cite

Statistics

Contextualized Text Representation Using Latent Topics for Classifying Scientific Papers

References

Volume 15, Issue 49 - Serial Number 49February 2024Pages 31-60

Files

Share

How to cite

Statistics

Volume 15, Issue 49 - Serial Number 49
February 2024
Pages 31-60