بازنمایی متن مبتنی‌ بر بافت با استفاده از موضوعات پنهان برای دسته‌بندی مقالات علمی

موسویان, مریم; قیومی, مسعود

doi:10.22051/jlr.2023.44640.2331

بازنمایی متن مبتنی‌ بر بافت با استفاده از موضوعات پنهان برای دسته‌بندی مقالات علمی

نوع مقاله : مقاله پژوهشی

نویسندگان

¹ گروه مهندسی کامپیوتر، دانشکده مهندسی، دانشگاه صنعتی امیرکبیر، تهران، ایران

² پژوهشکده زبان‌شناسی، پژوهشگاه علوم انسانی و مطالعات فرهنگی، تهران، ایران

10.22051/jlr.2023.44640.2331

چکیده

سالانه، پژوهشگران در حوزه‌های گوناگون علمی یافته‌های پژوهش‌های خود را به‌صورت گزارش‌های فنی یا مقاله‌هایی در مجموعه‌مقالات یا مجله‌ها چاپ می‌کنند. گردآوری این نوع داده توسط موتورهای جست‌وجو و کتابخانه‌های دیجیتال، برای جست‌وجو و دسترسی به نشریه‌های پژوهشی به کار گرفته می‌شود که معمولاً مقاله‌های مرتبط بر اساس کلیدواژه‌های پرسمان به‌جای موضوعات مقاله بازیابی می‌گردد. در نتیجه، دسته‌بندی دقیق مقاله‌های علمی می‌تواند کیفیت جست‌وجوی کاربران را هنگام جست‌وجوی یک سند علمی در پایگاه‌های اطلاعاتی افزایش دهد. هدف اصلی این مقاله، ارائه یک مدل دسته‌بندی برای تعیین موضوع مقاله‌های علمی است. به این منظور، مدلی را پیشنهاد کردیم که از دانش بافتی غنی‌شده مقاله‌های فارسی مبتنی‌بر معناشناسی توزیعی بهره می‌برد. بر این اساس، شناسایی حوزۀ خاص هر سند و تعیین دامنۀ آن توسط دانش غنی‌شدة برجسته، دقت دسته‌بندی مقاله‌های علمی را افزایش می‌دهد. برای دست‌یابی به هدف، ما مدل‌های درونه‌یابی بافتی، اعم از ParsBERT یا XLM-RoBERTa را با موضوع‌های پنهان در مقاله‌ها را برای آموزش یک مدل پرسپترون چندلایه غنی می‌کنیم. بر اساس یافته‌های تجربی، عملکرد کلیParsBERT-NMF-1HT 72/37 درصد (ماکرو) و 75/21 درصد (میکرو) بر اساس معیار-اف بود که تفاوت عملکرد این مدل در مقایسه با مدل پایه از نظر آماری معنادار (p<0/05) بود.

کلیدواژه‌ها

موضوعات

سایر حوزه‌های بین‌رشته‌ای مرتبط با زبان‌شناسی

عنوان مقاله [English]

Contextualized Text Representation Using Latent Topics for Classifying Scientific Papers

نویسندگان [English]

Maryam Moosaviyan ¹
Masood Ghayoomi ²

¹ Computer Engineering Department, Amirkabir University of Technology, Tehran, Iran

² Faculty of Linguistics, Institute for Humanities and Cultural Studies, Tehran, Iran

چکیده [English]

Annually, researchers in various scientific fields publish their research results as technical reports or articles in proceedings or journals. The collocation of this type of data is used by search engines and digital libraries to search and access research publications, which usually retrieve related articles based on the query keywords instead of the article’s subjects. Consequently, accurate classification of scientific articles can increase the quality of users’ searches when seeking a scientific document in databases. The primary purpose of this paper is to provide a classification model to determine the scope of scientific articles. To this end, we proposed a model which uses the enriched contextualized knowledge of Persian articles through distributional semantics. Accordingly, identifying the specific field of each document and defining its domain by prominent enriched knowledge enhances the accuracy of scientific articles’ classification. To reach the goal, we enriched the contextualized embedding models, either ParsBERT or XLM-RoBERTa, with the latent topics to train a multilayer perceptron model. According to the experimental results, overall performance of the ParsBERT-NMF-1HT was 72.37% (macro) and 75.21% (micro) according to F-measure, with a statistical significance compared to the baseline (p<0.05).

کلیدواژه‌ها [English]

Article Content Analysis
Contextualized Representation
Distributional Semantics
Neural Network
Scientific Article Classification
Topic Modeling

مراجع

Bijankhan, M., Sheikhzadegan, J., & Samareh, M. R. Y. (1994). FARSDAT - The Speech Database of Farsi Spoken Language. Proceedings of the 5th Internationa Conference on Speech Science and Technology, 2, 826–831. https://www.researchgate.net/publication/292798168_The_speech_database_of_Farsi_spoken_language
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3(Jan), 993–1022. https://dl.acm.org/doi/10.5555/944919.944937
Borko, H. (1968). Information science: What is it? American Documentation, 19(1), 3–5. https://doi.org/10.1002/asi.5090190103
Chen, Y., Zhang, H., Liu, R., Ye, Z., & Lin, J. (2019). Experimental explorations on short text topic mining between LDA and NMF based Schemes. Knowledge-Based Systems, 163, 1–13. https://doi.org/10.1016/j.knosys.2018.08.011
Chowdhury, S., & Schoen, M. P. (2020). Research paper classification using supervised machine learning techniques. 2020 Intermountain Engineering, Technology and Computing (IETC), 1–6. https://doi.org/10.1109/IETC47856.2020.9249211
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., & Stoyanov, V. (2020). Unsupervised cross-lingual representation learning at scale. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 8440–8451. https://doi.org/10.18653/v1/2020.acl-main.747
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. https://doi.org/10.18653/v1/N19-1423
EmamiAzadi, T., & AlmasGanj, F. (2006). Topic classification of Persian texts based on the improved probabilistic latent semantic analysis. The 12th Conference of Iran’s Computer Society, Tehran. https://civilica.com/doc/44669/
Farahani, M., Gharachorloo, M., Farahani, M., & Manthouri, M. (2021). Parsbert: Transformer-based model for Persian language understanding. Neural Processing Letters, 53(6), 3831–3847. https://doi.org/10.1007/s11063-021-10528-4
Févotte, C., & Idier, J. (2011). Algorithms for nonnegative matrix factorization with the $β$-divergence. Neural Computing, 23(9), 2421–2456. https://doi.org/10.1162/NECO_a_00168
Ghayoomi, M., & Mousavian, M. (2022). Application of the neural network-based machine learning method to classify scientific articles. Iranian Journal of Information Processing & Management, 37(4), 1217-1244. https://doi.org/10.35050/JIPM010.2022.008
Harris, Z. S. (1954). Distributional structure. Word, 10(2–3), 146–162. https://doi.org/10.1080/00437956.1954.11659520
Jelodar, H., Wang, Y., Yuan, C., Feng, X., Jiang, X., Li, Y., & Zhao, L. (2019). Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey. Multimedia Tools and Applications, 78(11), 15169–15211. https://doi.org/10.1007/s11042-018-6894-4
Jurafsky, D., & Martin, J. H. (2000). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice Hall: Upper Saddle River, New Jersey.
Karami, A., Gangopadhyay, A., Zhou, B., & Kharrazi, H. (2018). Fuzzy approach topic discovery in health and medical corpora. International Journal of Fuzzy Systems, 20(4), 1334–1345. https://doi.org/10.1007/s40815-017-0327-9
Kim, S.-W., & Gil, J.-M. (2019). Research paper classification systems based on TF-IDF and LDA schemes. Human-Centric Computing and Information Sciences, 9(1), 1–21. https://doi.org/10.1186/s13673-019-0192-7
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). Roberta: A robustly optimized BERT pretraining approach. ArXiv Preprint ArXiv:1907.11692. https://arxiv.org/abs/1907.11692
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, 1, 281–297. https://projecteuclid.org/Proceedings/berkeley-symposium-on-mathematical-statistics-and-probability/proceedings-of-the-fifth-berkeley-symposium-on-mathematical-statistics-and/toc/bsmsp/1200512974
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, & K. Q. Weinberger (Eds.), Proceedings of the 26th International Conference on Neural Information Processing Systems (pp. 3111–3119). Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf
Mustafa, G., Usman, M., Yu, L., Sulaiman, M., & Shahid, A. (2021). Multi-label classification of research articles using Word2Vec and identification of similarity threshold. Scientific Reports, 11(1), 1–20. https://doi.org/10.1038/s41598-021-01460-7
Papadimitriou, C. H., Raghavan, P., Tamaki, H., & Vempala, S. (2000). Latent semantic indexing: A probabilistic analysis. Journal of Computer and System Sciences, 61(2), 217–235. https://doi.org/10.1006/jcss.2000.1711
Rabiei, M., HosseiniMotlagh, S. M., & MinaeiBidgoli, B. (2019). Using One-Class SVM for Scientific Documents Classification Case study: Iranian Environmental Thesis. Iranian Journal of Information Processing and Management, 34(3), 1211–1234. https://doi.org/10.35050/JIPM010.2019.036
Rivest, M., Vignola-Gagné, E., & Archambault, É. (2021). level classification of scientific publications: A comparison of deep learning, direct citation and bibliographic coupling. PloS One, 16(5), e0251493. https://doi.org/10.1371/journal.pone.0251493
Salton, G. (1971). The SMART Retrieval System — Experiments in Automatic Document Processing. Prentice-Hall, Inc.
Shokouhian, M., Asemi, A., Shabani, A., & Cheshmesohrabi, M. (2020). Presenting a Thematic Model of Health Scientific Productions Using Text-Mining Methods. Iranian Journal of Information Processing and Management, 35(2), https://doi.org/553-574. 10.35050/JIPM010.2020.061
Teymoorpoor, B., Sepehri, M.-M., & Pezeshk, L. (2009). A new method for topic classification of scientific texts (case study on the articles of the nanotechnology of Iranian specialists). Policy of Science and Technology, 2(2), 1–15. https://doi.org/20.1001.1.20080840.1388.2.2.2.7
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

دوره 15، شماره 49 - شماره پیاپی 49
اسفند 1402
صفحه 31-60

تعداد مشاهده مقاله: 298
تعداد دریافت فایل اصل مقاله: 262

بازنمایی متن مبتنی‌ بر بافت با استفاده از موضوعات پنهان برای دسته‌بندی مقالات علمی

Contextualized Text Representation Using Latent Topics for Classifying Scientific Papers

مراجع

دوره 15، شماره 49 - شماره پیاپی 49
اسفند 1402
صفحه 31-60

فایل ها

هم رسانی

ارجاع به این مقاله

آمار

بازنمایی متن مبتنی‌ بر بافت با استفاده از موضوعات پنهان برای دسته‌بندی مقالات علمی

Contextualized Text Representation Using Latent Topics for Classifying Scientific Papers

مراجع

دوره 15، شماره 49 - شماره پیاپی 49اسفند 1402صفحه 31-60

فایل ها

هم رسانی

ارجاع به این مقاله

آمار

دوره 15، شماره 49 - شماره پیاپی 49
اسفند 1402
صفحه 31-60