Contextualized Text Representation Using Latent Topics for Classifying Scientific Papers

Document Type : Research

Authors

1 Computer Engineering Department, Amirkabir University of Technology, Tehran, Iran

2 Faculty of Linguistics, Institute for Humanities and Cultural Studies, Tehran, Iran

Abstract

Annually, researchers in various scientific fields publish their research results as technical reports or articles in proceedings or journals. The collocation of this type of data is used by search engines and digital libraries to search and access research publications, which usually retrieve related articles based on the query keywords instead of the article’s subjects. Consequently, accurate classification of scientific articles can increase the quality of users’ searches when seeking a scientific document in databases. The primary purpose of this paper is to provide a classification model to determine the scope of scientific articles. To this end, we proposed a model which uses the enriched contextualized knowledge of Persian articles through distributional semantics. Accordingly, identifying the specific field of each document and defining its domain by prominent enriched knowledge enhances the accuracy of scientific articles’ classification. To reach the goal, we enriched the contextualized embedding models, either ParsBERT or XLM-RoBERTa, with the latent topics to train a multilayer perceptron model. According to the experimental results, overall performance of the ParsBERT-NMF-1HT was 72.37% (macro) and 75.21% (micro) according to F-measure, with a statistical significance compared to the baseline (p<0.05).

Keywords

Main Subjects


  1. Bijankhan, M., Sheikhzadegan, J., & Samareh, M. R. Y. (1994). FARSDAT - The Speech Database of Farsi Spoken Language. Proceedings of the 5th Internationa Conference on Speech Science and Technology, 2, 826–831. https://www.researchgate.net/publication/292798168_The_speech_database_of_Farsi_spoken_language
  2. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3(Jan), 993–1022. https://dl.acm.org/doi/10.5555/944919.944937
  3. Borko, H. (1968). Information science: What is it? American Documentation, 19(1), 3–5. https://doi.org/10.1002/asi.5090190103
  4. Chen, Y., Zhang, H., Liu, R., Ye, Z., & Lin, J. (2019). Experimental explorations on short text topic mining between LDA and NMF based Schemes. Knowledge-Based Systems, 163, 1–13. https://doi.org/10.1016/j.knosys.2018.08.011
  5. Chowdhury, S., & Schoen, M. P. (2020). Research paper classification using supervised machine learning techniques. 2020 Intermountain Engineering, Technology and Computing (IETC), 1–6. https://doi.org/10.1109/IETC47856.2020.9249211
  6. Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., & Stoyanov, V. (2020). Unsupervised cross-lingual representation learning at scale. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 8440–8451. https://doi.org/10.18653/v1/2020.acl-main.747
  7. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. https://doi.org/10.18653/v1/N19-1423
  8. EmamiAzadi, T., & AlmasGanj, F. (2006). Topic classification of Persian texts based on the improved probabilistic latent semantic analysis. The 12th Conference of Iran’s Computer Society, Tehran. https://civilica.com/doc/44669/
  9. Farahani, M., Gharachorloo, M., Farahani, M., & Manthouri, M. (2021). Parsbert: Transformer-based model for Persian language understanding. Neural Processing Letters, 53(6), 3831–3847. https://doi.org/10.1007/s11063-021-10528-4
  10. Févotte, C., & Idier, J. (2011). Algorithms for nonnegative matrix factorization with the $β$-divergence. Neural Computing, 23(9), 2421–2456. https://doi.org/10.1162/NECO_a_00168
  11. Ghayoomi, M., & Mousavian, M. (2022). Application of the neural network-based machine learning method to classify scientific articles. Iranian Journal of Information Processing & Management, 37(4), 1217-1244. https://doi.org/10.35050/JIPM010.2022.008
  12. Harris, Z. S. (1954). Distributional structure. Word, 10(2–3), 146–162. https://doi.org/10.1080/00437956.1954.11659520
  13. Jelodar, H., Wang, Y., Yuan, C., Feng, X., Jiang, X., Li, Y., & Zhao, L. (2019). Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey. Multimedia Tools and Applications, 78(11), 15169–15211. https://doi.org/10.1007/s11042-018-6894-4
  14. Jurafsky, D., & Martin, J. H. (2000). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice Hall: Upper Saddle River, New Jersey.
  15. Karami, A., Gangopadhyay, A., Zhou, B., & Kharrazi, H. (2018). Fuzzy approach topic discovery in health and medical corpora. International Journal of Fuzzy Systems, 20(4), 1334–1345. https://doi.org/10.1007/s40815-017-0327-9
  16. Kim, S.-W., & Gil, J.-M. (2019). Research paper classification systems based on TF-IDF and LDA schemes. Human-Centric Computing and Information Sciences, 9(1), 1–21. https://doi.org/10.1186/s13673-019-0192-7
  17. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). Roberta: A robustly optimized BERT pretraining approach. ArXiv Preprint ArXiv:1907.11692. https://arxiv.org/abs/1907.11692
  18. MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, 1, 281–297. https://projecteuclid.org/Proceedings/berkeley-symposium-on-mathematical-statistics-and-probability/proceedings-of-the-fifth-berkeley-symposium-on-mathematical-statistics-and/toc/bsmsp/1200512974
  19. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, & K. Q. Weinberger (Eds.), Proceedings of the 26th International Conference on Neural Information Processing Systems (pp. 3111–3119). Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf
  20. Mustafa, G., Usman, M., Yu, L., Sulaiman, M., & Shahid, A. (2021). Multi-label classification of research articles using Word2Vec and identification of similarity threshold. Scientific Reports, 11(1), 1–20. https://doi.org/10.1038/s41598-021-01460-7
  21. Papadimitriou, C. H., Raghavan, P., Tamaki, H., & Vempala, S. (2000). Latent semantic indexing: A probabilistic analysis. Journal of Computer and System Sciences, 61(2), 217–235. https://doi.org/10.1006/jcss.2000.1711
  22. Rabiei, M., HosseiniMotlagh, S. M., & MinaeiBidgoli, B. (2019). Using One-Class SVM for Scientific Documents Classification Case study: Iranian Environmental Thesis. Iranian Journal of Information Processing and Management, 34(3), 1211–1234. https://doi.org/10.35050/JIPM010.2019.036
  23. Rivest, M., Vignola-Gagné, E., & Archambault, É. (2021). level classification of scientific publications: A comparison of deep learning, direct citation and bibliographic coupling. PloS One, 16(5), e0251493. https://doi.org/10.1371/journal.pone.0251493
  24. Salton, G. (1971). The SMART Retrieval System — Experiments in Automatic Document Processing. Prentice-Hall, Inc.
  25. Shokouhian, M., Asemi, A., Shabani, A., & Cheshmesohrabi, M. (2020). Presenting a Thematic Model of Health Scientific Productions Using Text-Mining Methods. Iranian Journal of Information Processing and Management, 35(2), https://doi.org/553-574. 10.35050/JIPM010.2020.061
  26. Teymoorpoor, B., Sepehri, M.-M., & Pezeshk, L. (2009). A new method for topic classification of scientific texts (case study on the articles of the nanotechnology of Iranian specialists). Policy of Science and Technology, 2(2), 1–15. https://doi.org/20.1001.1.20080840.1388.2.2.2.7
  27. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf