Automatic keyword extraction using Latent Dirichlet Allocation topic modeling: Similarity with golden standard and users' evaluation

RiahiNia, Nosrat; Shadanpour, Farzaneh; Borna, Keyvan; Montazer, Gholam Ali

Volume 9, Issue 3 (10-2022) Human Information Interaction 2022, 9(3): 1-22 | Back to browse issues page

‎ 20.1001.1.24237418.1401.9.3.4.4

Mendeley

Zotero

RefWorks

RiahiNia N, Shadanpour F, Borna K, Montazer G A. Automatic keyword extraction using Latent Dirichlet Allocation topic modeling: Similarity with golden standard and users' evaluation. Human Information Interaction 2022; 9 (3)
URL: http://hii.khu.ac.ir/article-1-3069-en.html

Automatic keyword extraction using Latent Dirichlet Allocation topic modeling: Similarity with golden standard and users' evaluation

Nosrat RiahiNia

, Farzaneh Shadanpour

, Keyvan Borna

, Gholam Ali Montazer

Kharazmi University

Abstract: (7526 Views)

Purpose: This study investigates the automatic keyword extraction from the table of contents of Persian e-books in the field of science using LDA topic modeling, evaluating their similarity with the golden standard, and users' viewpoints of the model keywords.
Methodology: This is mixed text-mining research in which LDA topic modeling is used to extract keywords from the table of contents of scientific e-books. The evaluation of the used approach has been done by two methods of cosine similarity computing and qualitative evaluation by users.
Findings: Table of contents are medium-length texts with a trimmed mean of 260.02 words, about 20% of which are stop-words. The cosine similarity between the golden standard keywords and the output keywords is 0.0932 thus very low. The full agreement of users showed that the extracted keywords with the LDA topic model represent the subject field of the whole corpus, but the golden standard keywords, the keywords extracted using the LDA topic model in sub-domains of the corpus, and the keywords extracted from the whole corpus were respectively successful in subject describing of each document.
Conclusion: The keywords extracted using the LDA topic model can be used in unspecified and unknown collections to extract hidden thematic content of the whole collection, but not to accurately relate each topic to each document in large and heterogeneous themes. In collections of texts in one subject field, such as mathematics or physics, etc., with less diversity and more uniformity in terms of the words used in them, more coherent and relevant keywords are obtained, but in these cases, the control of the relevance of keywords to each document is required. In formal subject analysis procedures and processes of individual documents, this approach can be used as a keyword suggestion system for indexing and analytical workforce.

Keywords: Keyword extraction, Topic modeling, Latent Dirichlet Allocation (LDA), Similarity evaluation, Users' evalua-tion.

Full-Text [PDF 897 kb] (4033 Downloads)

Type of Study: Research | Subject: Special

References

1. Asgari, E., Chappelier, J.-C. (2013). Linguistic re-sources & topic models for the analysis of Per-sian poems. In Proceedings of the Second Work-shop on Computational Linguistics for Literature ( pp. 23-31), Atlanta, Georgia, June 14, 2013. Association for Computational Linguistics.

2. Asmussen, C. B., & Mّller, Ch. (2019). Smart litera-ture review: A practical topic modeling approach to exploratory literature review. Journal of Big Data, 6(93). DOI: 10.1186/s40537-019-0255-7 [DOI:10.1186/s40537-019-0255-7]

3. Beliga, S., Mestrovic, A., & Martincic-Ipsic, S. (2015). An overview of graph-based keyword ex-traction methods and approaches. Journal of In-formation and Organization Sciences, 39(1), 1-20. Retrieved from https://jios.foi.hr/index.php/jios/article/view/938

4. Blei, Ng, and Jordan. (2003). Latent Dirichlet Allo-cation. Journal of Machine Learning Research, 3, 993-1022. DOI: 10.5555/944919.944937

5. Choi, Y., Hsieh-Yee, I., & Kules, B. (2007). Re-trieval effectiveness of table of contents and sub-ject headings. JCDL '07 June 18-23, 2007, Van-couver, British Columbia, Canada (pp.103-104). DOI:10.1145/1255175.1255195 [DOI:10.1145/1255175.1255195]

6. Dieng, A. B., Ruiz, F. J. R., Blei, D. M. (2020). Topic modeling in embedding spaces. Transactions of the Association for Computational Linguistics, 8, 439-453. DOI: 10.1162/tacl a 00325 [DOI:10.1162/tacl_a_00325]

7. Di Maggio, P., Nag, M., Blei, D. (2013). Exploiting affinities between topic modeling and the socio-logical perspective on culture: Application to newspaper coverage of U.S. government arts funding, Poetics, 41(6), 570-606. DOI: 10.1016/j.poetic.2013.08.004. [DOI:10.1016/j.poetic.2013.08.004]

8. Goh, R. (2018). Using Named Entity Recognition for Automatic Indexing. Paper presented at the IFLA WLIC, 2018, Kuala Lumpur, Malaysia

9. Golube, K, Hagelbach, J., & Ardo, A. (2018). Au-tomatic classification using DDC on the Swedish :union: Catalogue. CEUR-WS.org/vol-2200/paper1.pdf

10. Hamid, F. (2016). Evaluation techniques and graph-based algorithm for automatic summari-zation and keyphrase extraction. (Doctoral dis-sertation). Available from ProQuest Dissertations & Theses Global database. (UMI No. 10307512)

11. Hoyt, B. (2020). Best practices for content manag-er ondemand full-text search. Retrieved from https://www.ibm.com/support/pages/sites/default/files/inline-files/Best%20practices%20for%20Using%20Full%20Text%20Searching%20with%20Content%20Manager%20OnDemand-4-22-2020.pdf

12. Hurtado, J. L. (2016). Text mining and topic mod-eling for social and medical decision support. (Doctoral dissertation). Available from ProQuest Dissertations & Theses Global database. (UMI No. 10583055)

13. Im, Y., Park, J., Kim, M., & Park, K. (2019). Com-parative study on perceived trust of topic model-ing based on affective level of educational text. Appl. Sci, 9(21), 4565. DOI: 10.3390/app9214565 [DOI:10.3390/app9214565]

14. Junger, U. (2018). Automated first- The subject cataloguing policy of the Deutsche Nationalbib-liottek. Paper presented at IFLA WLIC 2018- Kuala Lumpur, Malaysia- Transform Libraries, Transform Societies in Session 115- Subject Analysis and Access. Retrieved from http://library.ifla.org/2213/1/115-junger-en.pdf

15. Khoshian, Nahid, and Mirzaeian, Vahidreza (2020). The Most Widely Used Functions of Nat-ural Language Processing in the Field of Library Science and Information Science. Knowledge Re-trieval and Semantic Systems, 6(23), 117-151. DOI: 10.22054/jks.2020.44502.1238. (Persian)

16. Levy, K. E. C., & Franklin, M. (2014). Driving regu-lation: Using topic models to examine political contention in the U.S. trucking industry. Social Science Computer Review, 32(2), 182-194. DOI: 10.1177/0894439313506847 [DOI:10.1177/0894439313506847]

17. Maier, D., Waldherr, A., Miltner, P., Wiedemann, G., Niekler, A., Keinert, A., Pfetsch, B., Heyer, G., Reber, U., Häussler, T., Schmid-Petri, H., & Ad-am,A. (2018). Applying LDA topic modeling in communication research: Toward a valid and re-liable methodology. Communication Methods and Measures, DOI: 10.1080/19312458.2018.1430754 [DOI:10.1080/19312458.2018.1430754]

18. Mas'oudi, B., & Rahati Ghochani S. (2016). Farsi word sense disambiguation with LDA Topic model . JSDP, 12 (4), 117-125. Retrieved from

19. http://jsdp.rcisp.ac.ir/article-1-58-fa.html. (Persian)

20. Momtazi, S. (2018). Unsupervised Latent Dirichlet Allocation for supervised question classification. Information Processing and Management, 54,380-393. DOI: 10.1016/j.ipm.2018.01.001 [DOI:10.1016/j.ipm.2018.01.001]

21. Onal Suzek, T. (2017). Using latent semantic anal-ysis for automated keyword extraction from large document corpora. Turkish Journal of Elec-trical Engineering & Computer Sciences, 25, 1784-1794. DOI: 10.3906/elk-1511-203 [DOI:10.3906/elk-1511-203]

22. Pietsch, A.-S., & Lessmann, S. (2018) Topic model-ing for analyzing open-ended survey responses. Journal of Business Analytics, 1(2), 93-116. DOI: 10.1080/2573234X.2019.1590131 [DOI:10.1080/2573234X.2019.1590131]

23. Pokorny, J. (2018). Automatic subject indexing and classification using text recognition and computer based analysis of the table of contents. In Chau, L. [DOI:10.4000/proceedings.elpub.2018.19]

24. & Mounier, P. ELPUB 2018. June 2018, Toronto, Canada. DOI: 10.4000/proceedings.elpub.2018.19. [DOI:10.4000/proceedings.elpub.2018.19]

25. Rahgozar, A. (2020). Automatic poetry classifica-tion and chronological semantic analysis. (Doc-toral dissertation). The University of Ottawa. Canada. Retrieved from https://ruor.uottawa.ca/bitstream/10393/40516/3/Rahgozar_Arya_2020_thesis.pdf

26. Revert, F. (2019). An Overview of Topics Extrac-tion in Python with Latent Dirichlet Allocation. Retrieved from https://www.kdnuggets.com/2019/09/overview-topics-extraction-python-latent-dirichlet-allocation.html

27. Riaz, K. H. (2018). Improving search via named entity recognition in morphologically rich lan-guages - A case study in Urdu (Doctoral disserta-tion). Available from ProQuest Dissertations & Theses Global database. (UMI No. 10747478)

28. Risch, J. (2016). Detecting Twitter topics using La-tent Dirichlet Allocation. (Master's Thesis). Re-trieved from http://uu.diva-por-tal.org/smash/get/diva2:904196/FULLTEXT01.pdf

29. Roder, M., Both, A., & Hinneburg, A. (2015). Ex-ploring the space of topic coherence measures. In The Eighth ACM International Conference on Web Search and Data Mining WSDM'15, Feb-ruary 2-6, Shanghai, China (pp. 39- 408). ACM. DOI: 10.1145/2684822.2685324 [DOI:10.1145/2684822.2685324]

30. Sadeghi, M., & Vegas, J. (2014). Automatic identi-fication of light stop words for Persian infor-mation retrieval systems. Journal of Information Science, 40, 476 - 487. DOI: 10.1177/0165551514530655 [DOI:10.1177/0165551514530655]

31. Sbalchiero, S., & Eder, M. (2020). Topic modeling, long texts, and the best number of topics: Some Problems and solutions. Quality & Quantity, 54, pp. 1095-1108. DOI: 10.1007/s11135-020-00976-w [DOI:10.1007/s11135-020-00976-w]

32. Saidul Hasan, K., & Ng, V. (2014). Automatic keyphrase extraction: A survey of the state of the art. In Proceedings of the 52nd Annual Meet-ing of the Association for Computational Lin-guistics, Baltimore, Maryland, USA, June 23-25, 2014. Pp. 1262-1273. DOI: 10.3115/v1/p14-1119 [DOI:10.3115/v1/P14-1119]

33. Schauble, P. (1997). Multimedia information re-trieval: Content-based information retrieval from large text and audio databases. New York: Springer Science+Business Media. [DOI:10.1007/978-1-4615-6163-7]

34. Schofield, A., Magnusson, M., & Mimno, D. (2017). Pulling Out the Stops: Rethinking Stop-word Removal for Topic Models. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, Valencia, Spain, April 2017 (pp. 432-436). Association for Computa-tional Linguistics. https://www.aclweb.org/anthology/E17-2069.pdf [DOI:10.18653/v1/E17-2069]

35. Sfakakis, M., Zoutsou, K., Papachristopoulos, L., Tsakonas, G., & Papatheorodu, Ch. (2019, Au-gust). Between two worlds: harmonizing auto-mated and manual term labeling. Paper present-ed at IFLA WLIC 2019 - Athens, Greece - Librar-ies: dialogue for change in Session S02 - Knowledge Management with Digital Humani-ties/Digital Scholarship. In: Artificial Intelligence (AI) and its impact on libraries and librarianship, 22 August 2019, Corfu, Greece. Retrieved from http://library.ifla.org/2759/1/s02-2019-sfakakis-en.pdf

36. Short, M. (2019). Text mining and subject analysis for fiction; or, using machine learning and infor-mation extraction to assign subject headings to dime novels. Cataloging and Classification Quar-terly, 57(5), 315-336. DOI: 10.1080/01639374.2019.1653413 [DOI:10.1080/01639374.2019.1653413]

37. Sun, Y., Loparo, K., & Kolacinski, R. (2020). Con-versational Structure Aware and Context Sensi-tive Topic Model for Online Discussions. 2020 IEEE 14th International Conference on Seman-tic Compting(ICSC),(pp.8592).DOI:10.1109/ICSC.2020.00019 [DOI:10.1109/ICSC.2020.00019]

38. Tchoua, R. B. (2019). Hybrid human-machine scientific information extraction. (Doctoral dis-sertation). Available from ProQuest Dissertations & Theses Global database. (UMI No. 13904924)

39. Sun, Ch., Hu, L., Li, Sh., Li,T., Li, H., & Chi, L. (2020). A Review of Unsupervised Keyphrase Extraction

40. Methods Using Within-Collection Resources. Symmetry, 12(1864). DOI:10.3390/sym12111864 [DOI:10.3390/sym12111864]

41. Syed, Sh., and Spruit, M. (2017). Full-Text or Ab-stract? Examining Topic Coherence Scores Using [DOI:10.1109/DSAA.2017.61] [PMID]

42. Latent Dirichlet Allocation. 2017 IEEE Interna-tional Conference on Data Science and Ad-vanced

43. Analytics (DSAA), Tokyo, Japan, 2017, pp. 165-174. doi: 10.1109/DSAA.2017.61 [DOI:10.1109/DSAA.2017.61] [PMID]

44. Tushara, M. G., Mownika, T., & Mangamuru, R. (2019). A comparative study on different key-word extraction algorithms. In Proceedings of the Third International Conference on Computing Methodologies and Communication (ICCMC 2019), Erode. India, 2019. Pp 969-973; DOI: 10.1109/ICCMC.2019.8819630 [DOI:10.1109/ICCMC.2019.8819630]

45. Wang, W., Feng, Y., & Dai, W. (2018). Topic anal-ysis of online reviews for two competitive prod-ucts using Latent Dirichlet Allocation. Electronic Commerce Research and Application, 29, 142-156. DOI:10.1016/j.elerap.2018.04.003 [DOI:10.1016/j.elerap.2018.04.003]

46. Wang, Y., & Taylor, J. E. (2019). DUET: data-driven approach based on Latent Dirichlet Allo-cation topic modeling. Journal of Computing in Civil Engineering, 33(3), 04019023. [DOI:10.1061/(ASCE)CP.1943-5487.0000819]

47. Xing, L., Paulz, M. J., & Carenini, G. (2019). Eval-uating Topic Quality with Posterior Variability. In Proceedings of the 2019 Conference on Empiri-cal Methods in Natural Language Processing and the 9th International Joint Conference on Natu-ral Language Processing, Hong Kong, China, No-vember 3-7, 2019 (pp. 3471-3477). Association for Computational Linguistics. DOI: 10.18653/v1/D19-1349 [DOI:10.18653/v1/D19-1349]

48. Yan, Y., Guo, J., Lan, Y., & Cheng, X. (2013). A Biterm topic model for short texts. WWW2013, May, 13-17,2013, Rio de Janeiro, Brazil. DOI: 10.1145/2488388.2488514 [DOI:10.1145/2488388.2488514]

49. Yao, J., Wang, Y., Zhang, Y., Sun, J., & Zhou, J. (2018). Joint Latent Dirichlet Allocation for social tags. IEEE Transactions on Multimedia, 20(1). DOI: 0.1109/TMM.2017.2716829 [DOI:10.1109/TMM.2017.2716829]

Send email to the article author

Rights and permissions
	This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Designed & Developed by : Yektaweb

Human Information Interaction

Related Websites

Site Keywords

Vote