Contributions to text mining from an unsupervised perspective: new approaches and methods
Keywords:
text mining, personalized information retrieval, text summarization, key phrase extraction, graph-based text analysisAbstract
Introduction: The volume of textual information is growing exponentially. Embedded in these texts is a wealth of relevant information and knowledge of extraordinary value to human activity, economically, politically, intellectually, academically, socially and otherwise. Therefore, the identification and extraction of this relevant information, as well as the discovery of this valuable knowledge, is of paramount interest in many fields. In this context, innovative Text Mining solutions are increasingly necessary.
Objectives: To improve the efficiency of information retrieval through the personalization. To conceive and develop new solutions for the automatic synthesis of texts that provide greater efficiency. To conceive and develop a solution for the analysis and discovery of knowledge in texts based on their representation in the form of graphs.
Methods: A set of solutions for personalized information retrieval are proposed. Methods for the automatic generation of extractive summaries from multiple documents and for the extraction of relevant sentences in texts are proposed. A model for computational text analysis based on knowledge graphs was developed.
Results: The proposed contributions were evaluated and validated experimentally, and several of them also through their practical application in various scenarios. The contributions and benefits of the solutions in the processing of different types of texts, from news, scientific articles, mails, among others, were demonstrated. Significant impacts were achieved for the analysis of information in the field of security and internal order.
Conclusions: The developed solutions represent promising contributions to the state of the art of text mining with an unsupervised approach, with emphasis on the treatment of semantics and the use of knowledge graphs.
Downloads
References
1. Hambarde K, Proença H. Information Retrieval: Recent Advances and Beyond. IEEE Access. 2023 [Consultado jul 2024];99:1-1. Disponible en: https://ieeexplore.ieee.org/document/10184013
2. Singh A, Dey N, Ashour A, Santhi V. Web Semantics for Personalized Information Retrieval. En A. Singh N, Dey AS, Ashour V. Santhi (Eds.). Web Semantics for Textual and Visual Information Retrieval, IGI Global. 2017 [Consultado may 2018];166-86. Disponible en: https://www.igi-global.com/chapter/web-semantics-for-personalized-information-retrieval/198576
3. Liu J, Liu Ch, Belkin NJ. Personalization in Text Information Retrieval: A Survey, Journal of the Association for Information Science and Technology. 2020 [Consultado abr 2022];71(3):349-69. Disponible en: https://asistdl.onlinelibrary.wiley.com/doi/abs/10.1002/asi.24234
4. Utrera EB, Simón-Cuevas A, Olivas JA. Análisis de tendencias en la personalización de los resultados en buscadores web. Revista Cubana de Ciencias Informáticas. 2018 [Consultado jun 2024];12(2):111-28. Disponible en: http://scielo.sld.cu/pdf/rcci/v12n2/rcci09218.pdf
5. Utrera EB, Simón-Cuevas A, Olivas JA, Romero FP. Aproximación a un modelo de recuperación de información personalizada basado en el análisis semántico del contenido. Procesamiento del Lenguaje Natural. 2018 [Consultado jun 2024];61:31-8. Disponible en: http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/2018-61-3/3376
6. Utrera EB, Simón-Cuevas A, Olivas JA, Romero FP. A Personalized Information Retrieval Approach using Semantic Processing of Text Documents. Proceedings of the International Conference on Artificial Intelligence (ICAI'18). CSREA Press. 2018 [Consultado jun 2024];414-9. Disponible en: https://www.researchgate.net/publication/326522466_A_Personalized_Information_Retrieval_Approach_using_Semantic_Processing_of_Text_Documents
7. Serrano J, Romero FP, Olivas JA. A relevance and quality-based ranking algorithm applied to evidence-based medicine. Computer Methods and Programs in Biomedicine. 2020 [Consultado sep 2024];191:105415. Disponible en: https://www.sciencedirect.com/science/article/pii/S0169260719303785
8. El-Kassas WS, Salama ChR, Rafea AA, Mohamed HK. Automatic text summarization: A comprehensive survey, Expert Systems with Applications. 2021 [Consultado jul 2024];165:113679. Disponible en: https://www.sciencedirect.com/science/article/abs/pii/S0957417420305030
9. Gambhir, M., Gupta V.: Recent automatic text summarization techniques: a survey. Artificial Intelligence Review. 2017 [Consultado mar 2019];47(1):1-66. Disponible en: https://dl.acm.org/doi/10.1007/s10462-016-9475-9
10. del Camino Valle O, Simón-Cuevas A, Valladares-Valdés E, Olivas JA, Romero FP. Generación de resúmenes extractivos de múltiples documentos usando grafos semánticos. Procesamiento del Lenguaje Natural. 2019 [Consultado jul 2024];63:103-10. Disponible en: http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6100
11. Miller G, Fellbaum C. WordNet: An Electronic Lexical Database, The MIT Press: Cambridge, MA. 1998.
12. Rao SX, Piriyatamwong P, Ghoshal P, Nasirian S, de Salis E, Mitrovié S, Wechner M, Brucker V, Egger P, Zhang C. Keyword extraction in scientific documents, arXiv preprint arXiv:2207.01888. 2022 [Consultado feb 2024]. Disponible en: https://arxiv.org/abs/2207.01888
13. Song M, Liu H, Hyperrank JL. Hyperbolic ranking model for unsupervised keyphrase extraction. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023 [Consultado feb 2024];16070-80. Disponible en: https://aclanthology.org/2023.emnlp-main.997/
14. Barreiro-Guerrero M, Simón-Cuevas A, Pérez-Guadarrama Y, Romero FP, Olivas JA. Applying OWA Operator in the Semantic Processing for Automatic Keyphrase Extraction. Lecture Notes in Computer Science. 2019 [Consultado jul 2024];11896:62-71. Disponible en: https://link.springer.com/chapter/10.1007/978-3-030-33904-3_6
15. Pérez-Guadarramas Y, Simón-Cuevas A, Hojas Mazo W, Romero FP, Olivas JA. A Fuzzy Approach to Improve an Unsupervised Automatic Keyphrase Extraction Process. IEEE International Conference on Fuzzy Systems (FUZZ-IEEE). 2018 [Consultado sep 2024];70-5. Disponible en: https://ieeexplore.ieee.org/document/8491487
16. Pérez Y, Rodríguez A, Simón-Cuevas A, Hojas W, Olivas JA. Combinando patrones léxico-sintácticos y análisis de tópicos para la extracción automática de frases relevantes en textos. Procesamiento del Lenguaje Natural. 2017 [Consultado sep 2024];59:39-46. Disponible en: http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/5491
17. Pérez-Guadarramas Y, Simón-Cuevas A, Romero FP, Olivas JA. Topic Modeling Based on OWA Aggregation to Improve the Semantic Focusing on Relevant Information Extraction Problems. G. Rivera et al. (eds.), Data Analytics and Computational Intelligence: Novel Models, Algorithms and Applications, Studies in Big Data. 2023 [Consultado sep 2024];17-42. Disponible en: https://link.springer.com/chapter/10.1007/978-3-031-38325-0_2
18. Pérez-Guadarramas Y, Barreiro-Guerrero M, Simón-Cuevas A, Romero FP, Olivas, JA. Analysis of OWA operators for automatic keyphrase extraction in a semantic context. Intelligent Data Analysis. 2020 [Consultado sep 2024];24:43-62. Disponible en: https://journals.sagepub.com/doi/10.3233/IDA-200008
19. Hojas-Maz W, Simón-Cuevas A, Iglesia M, Romero FP, Olivas JA. A Concept-Based Text Analysis Approach Using Knowledge Graph. Communications in Computer and Information Science (CCIS). 2018 [septiembre 2024];854:696-708. Disponible en: https://link.springer.com/chapter/10.1007/978-3-319-91476-3_57
20. Rodríguez A, Simón A. Método para la extracción de información estructurada desde textos. Revista Cubana de Ciencias Informáticas. 2013 [marzo 2024];7(1):55-67. Disponible en: http://www.scielo.sld.cu/pdf/rcci/v7n1/rcci07113.pdf
21. Rodríguez A, Simón A, Hojas W, Perea JM. Extracción de Datos Enlazados desde Información No Estructurada Aplicando Técnicas de PLN y Ontologías. CEUR-WS Proceedings Series. 2016 [septiembre 2021];1797. Disponible en: https://ceur-ws.org/Vol-1797/paper8.pdf
22. Simón A, Ceccaroni L, Rosete A, Suárez-Rodríguez A, Victoria R. A support to formalize a conceptualization from a concept map repository. En Cañas AJ, Reiska P, Ahlberg MK, Novak JD. (Eds.). Proceedings of the 3rd International Conference on Concept Mapping. 2008 [mayo 2021];1:68-75. Disponible en: https://cmc.ihmc.us/cmc2008papers/Backup/cmc2008-p291.pdf
23. Hojas-Mazo W, Simón-Cuevas A, de la Iglesia Campos M, Ruíz-Carrera JC. Semantic Processing Method to Improve a Query-Based Approach for Mining Concept Maps. Advances in Intelligent Systems and Computing. 2019 [junio 2024];1078:22-35. Disponible en URL: https://link.springer.com/chapter/10.1007/978-3-030-33614-1_2
24. Hojas W, Simón A, Rodríguez A. Aplicación de técnicas de minería de grafos para el análisis de textos. III Congreso Internacional de Ingeniería Informática y Sistemas de Información (CIIISI 2016). La Habana, Cuba, 2016 [junio 2024]. Disponible en: https://www.researchgate.net/publication/312727056_Aplicacion_de_tecnicas_de_mineria_de_grafo_para_el_analisis_de_texto
25. Hojas W, Simón A, de la Iglesia M. Método de análisis semántico basado en WordNet para la extracción de información en mapas conceptuales. Research in Computing Science. 2016 [junio 2024];124:81-92. Disponible en: https://rcs.cic.ipn.mx/2016_124/Metodo%20de%20analisis%20semantico%20basado%20en%20WordNet%20para%20la%20extraccion%20de%20informacion.pdf
26. Hojas-Mazo W, Simón-Cuevas A, Romero FP, Olivas JA. Procesamiento Semántico Difuso Aplicado a un Modelo de Análisis de Textos basado en Grafos. Actas de XVIII Conferencia de la Asociación Española para la Inteligencia Artificial (CAEPIA 2018). 2018 [Consultado jun 2024];279-84. Disponible en: https://www.researchgate.net/publication/328463183_Procesamiento_Semantico_Difuso_Aplicado_a_un_Modelo_de_Analisis_de_Textos_basado_en_Grafos
27. Suárez LM, Hojas W, Simón-Cuevas A. Un método para la recuperación de pasajes de texto a partir de mapas conceptuales usando Lucene. XVI Congreso Internacional de Informática en la Educación (InforEdu’16). Habana, Cuba, 2016 [Consultado jun 2024]. Disponible en: https://www.researchgate.net/publication/299398019_Un_metodo_para_la_recuperacion_de_pasajes_de_texto_a_partir_de_mapas_conceptuales_usando_Lucene
28. Vicente-López E, M de Campos L, Fernández-Luna JM, Huete JF, Tagua-Jiménez A. Tur-Vigil C. An automatic methodology to evaluate personalized information retrieval systems, User Modeling and User-Adapted Interaction. 2014 [Consultado ene 2019];25(1):1-37. Disponible en: https://link.springer.com/article/10.1007/s11257-014-9148-9
29. Lin C.-Y. ROUGE: a package for automatic evaluation of summaries. En Proceedings of the ACL-04 workshop. 2004 [Consultado feb 2019];74-81. Disponible en: https://aclanthology.org/W04-1013/
30. Valladares-Valdés E, Simón-Cuevas A, Romero FP, Olivas JA. A Fuzzy Approach for Sentences Relevance Assessment in Multi-document Summarization. Advances in Intelligent Systems and Computing. 2019 [Consultado feb 2019];950: 57-67. Disponible en: https://link.springer.com/chapter/10.1007/978-3-030-20055-8_6
31. Ying Y, Qingping T, Qinzheng X, Ping Z, Panpan L. A graph-based approach of automatic keyphrase extraction. Procedia Computer Science. 2017 [Consultado ene 2024];107:248-55. Disponible en: https://www.sciencedirect.com/science/article/pii/S1877050917303629
32. Duari S., Bhatnagar V. sCAKE: semantic connectivity aware keyword extraction. Information Sciences. 2019 [Consultado ene 2024];477:100-17. Disponible en: https://www.sciencedirect.com/science/article/abs/pii/S0020025518308521
33. Zhu X, Lou Y, Zhao J, Gao W, Deng H. Generative non-autoregressive unsupervised keyphrase extraction with neural topic modeling. Engineering Applications of Artificial Intelligence. 2023 [Consultado feb 2024];120:105934. Disponible en: https://www.sciencedirect.com/science/article/abs/pii/S0952197623001185
34. Verma RM, Zeng V, Faridi H. Data quality for security challenges: Case studies of phishing, malware and intrusion detection datasets. Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security. 2019 [Consultado mar 2024]; 2605-07. Disponible en: https://dl.acm.org/doi/10.1145/3319535.3363267
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Alfredo Javier Simón-Cuevas, Yamel Pérez Guadarrama, Wenny Hojas Mazo , José Ángel Olivas Varela , Francisco Pascual Romero Chicharro, Manuel Barreiro Guerrero , Eduardo Javier Valladares-Valdés , Manuel de La Iglesia Campos , Jesús Serrano Guerrero, Oleyda del Camino Valle

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
The journal Anales de la Academia de Ciencias de Cuba protects copyright, and operates with a Creative Commons License 4.0 (Creative Commons Attribution-NonCommercial License 4.0). By publishing in it, authors allow themselves to copy, reproduce, distribute, publicly communicate their work and generate derivative works, as long as the original author is cited and acknowledged. They do not allow, however, the use of the original work for commercial or lucrative purposes.
The authors authorize the publication of their writings, retaining the authorship rights, and assigning and transferring to the magazine all the rights protected by the intellectual property laws that govern in Cuba, which imply editing to disseminate the work.
Authors may establish additional agreements for the non-exclusive distribution of the version of the work published in the journal (for example, placing it in an institutional repository or publishing it in a book), with recognition of having been first published in this journal.
To learn more, see https://creativecommons.org
