SemSimp: A Parametric Method for Evaluating the Semantic Similarity of Digital Resources

Antonio De Nicola; Anna Formica; Ida Mele; Francesco Taglino

doi:10.3844/jcssp.2024.841.849

Abstract

SemSim^p is a parametric method for evaluating the semantic similarity of digital resources that is based on the notion of information content. It exploits a weighted reference ontology of concepts and requires resources to be semantically annotated, each by means of a set of concepts from the ontology. Specifically, the weights of the concepts can be calculated either by considering the available annotations or only the structure of the ontology. SemSim^p was evaluated against six representative semantic similarity methods proposed in the literature. Experiments were run on a large real-world dataset based on the Association for Computing Machinery (ACM) digital library, including both a statistical analysis and an expert judgment assessment. The main result shows that the SemSim^p annotation frequency configuration, when combined with the geometric average normalization factor, outperforms the other methods.

References

Abioui, H., Idarrou, A., Bouzit, A., & Mammass, D. (2018). Towards a Novel and Generic Approach for OWL Ontology Weighting. Procedia Computer Science, 127, 426–435. https://doi.org/10.1016/j.procs.2018.01.140

Adhikari, A., Dutta, B., Dutta, A., Mondal, D., & Singh, S. (2018). An intrinsic information content-based semantic similarity measure considering the disjoint common subsumers of concepts of an ontology. Journal of the Association for Information Science and Technology, 69(8), 1023–1034. https://doi.org/10.1002/asi.24021

Alizadeh, D., Alesheikh, A. A., & Sharif, M. (2021). Prediction of vessels locations and maritime traffic using similarity measurement of trajectory. Annals of GIS, 27(2), 151–162. https://doi.org/10.1080/19475683.2020.1840434

Banu, A., Fatima, S. S., & Khan, K. U. R. (2015). Information content based semantic similarity measure for concepts subsumed by multiple concepts. International Journal Web Applications, 7(3), 85–94.

Batet, M., & Sánchez, D. (2020). Leveraging synonymy and polysemy to improve semantic similarity assessments based on intrinsic information content. Artificial Intelligence Review, 53(3), 2023–2041. https://doi.org/10.1007/s10462-019-09725-4

Beeri, C., Formica, A., & Missikoff, M. (1999). Inheritance hierarchy design in object-oriented databases. Data & Knowledge Engineering, 30(3), 191–216. https://doi.org/10.1016/s0169-023x(99)00011-7

Berrhail, F., & Belhadef, H. (2020). Genetic Algorithm-based Feature Selection Approach for Enhancing the Effectiveness of Similarity Searching in Ligand-based Virtual Screening. Current Bioinformatics, 15(5), 431–444. https://doi.org/10.2174/1574893614666191119123935

Bloehdorn, S., & Moschitti, A. (2007). Combined Syntactic and Semantic Kernels for Text Classification (G. Amati, C. Carpineto, & G. Romano, Eds.; Vol. 4425). Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71496-5_29

Bollegala, D., Matsuo, Y., & Ishizuka, M. (2011). A Web Search Engine-Based Approach to Measure Semantic Similarity between Words. IEEE Transactions on Knowledge and Data Engineering, 23(7), 977–990. https://doi.org/10.1109/tkde.2010.172

Cazzanti, L., & Gupta, M. R. (2006). Information-theoretic and Set-theoretic Similarity. 2006 IEEE International Symposium on Information Theory, 1836–1840. https://doi.org/10.1109/isit.2006.261752

Chandrasekaran, D., & Mago, V. (2022). Evolution of Semantic Similarity—A Survey. ACM Computing Surveys, 54(2), 1–37. https://doi.org/10.1145/3440755

De Nicola, A., & D’Agostino, G. (2021). Assessment of gender divide in scientific communities. Scientometrics, 126(5), 3807–3840. https://doi.org/10.1007/s11192-021-03885-3

De Nicola, A., Formica, A., Missikoff, M., Pourabbas, E., & Taglino, F. (2023). A parametric similarity method: Comparative experiments based on semantically annotated large datasets. Journal of Web Semantics, 76, 100773. https://doi.org/10.1016/j.websem.2023.100773

De Nicola, A., Melchiori, M., & Villani, M. L. (2019). Creative design of emergency management scenarios driven by semantics: An application to smart cities. Information Systems, 81, 21–48. https://doi.org/10.1016/j.is.2018.10.005

De Nicola, A., Villani, M. L., Sujan, M., Watt, J., Costantino, F., Falegnami, A., & Patriarca, R. (2023). Development and measurement of a resilience indicator for cyber-socio-technical systems: The allostatic load. Journal of Industrial Information Integration, 35, 100489. https://doi.org/10.1016/j.jii.2023.100489

De Nicola, A., Zgheib, R., & Taglino, F. (2022). Chapter 7 - Toward a knowledge graph for medical diagnosis: issues and usage scenarios. In S. Tiwari, F. Ortiz Rodriguez, & M. A. Jabbar (Eds.), Semantic Models in IoT and eHealth Applications (pp. 129–142). Academic Press. https://doi.org/10.1016/b978-0-32-391773-5.00013-3

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 4171–4186. https://doi.org/10.18653/v1/N19-1423

Dhami, M. K., & Harries, C. (2001). Fast and frugal versus regression models of human judgement. Thinking & Reasoning, 7(1), 5–27. https://doi.org/10.1080/13546780042000019

Dice, L. R. (1945). Measures of the Amount of Ecologic Association Between Species. Ecology, 26(3), 297–302. https://doi.org/10.2307/1932409

Dulmage, A. L., & Mendelsohn, N. S. (1958). Coverings of Bipartite Graphs. Canadian Journal of Mathematics, 10, 517–534. https://doi.org/10.4153/cjm-1958-052-0

Fellbaum, C., & Miller, G. (1998). Combining Local Context and WordNet Similarity for Word Sense Identification. In WordNet: An Electronic Lexical Database (pp. 265–283). MIT Press.

Formica, A. (2019). Similarity reasoning in formal concept analysis: from one- to many-valued contexts. Knowledge and Information Systems, 60(2), 715–739. https://doi.org/10.1007/s10115-018-1252-4

Formica, A., & Missikoff, M. (2004). Inheritance processing and conflicts in structural generalization hierarchies. ACM Computing Surveys, 36(3), 263–290. https://doi.org/10.1145/1035570.1035572

Formica, A., Missikoff, M., Pourabbas, E., & Taglino, F. (2010). Semantic Search for Enterprises Competencies Management. Proceedings of the International Conference on Knowledge Engineering and Ontology Development (IC3K 2010) - KEOD, 183–192. https://doi.org/10.5220/0003069801830192

Formica, A., Missikoff, M., Pourabbas, E., & Taglino, F. (2013). Semantic search for matching user requests with profiled enterprises. Computers in Industry, 64(3), 191–202. https://doi.org/10.1016/j.compind.2012.09.007

Formica, A., & Pourabbas, E. (2009). Content based similarity of geographic classes organized as partition hierarchies. Knowledge and Information Systems, 20(2), 221–241. https://doi.org/10.1007/s10115-008-0177-8

Formica, A., & Taglino, F. (2021). An Enriched Information-Theoretic Definition of Semantic Similarity in a Taxonomy. IEEE Access, 9, 100583–100593. https://doi.org/10.1109/access.2021.3096598

Formica, A., & Taglino, F. (2023). Semantic relatedness in DBpedia: A comparative and experimental assessment. Information Sciences, 621, 474–505. https://doi.org/10.1016/j.ins.2022.11.025

Gruber, T. R. (1993). A translation approach to portable ontology specifications. Knowledge Acquisition, 5(2), 199–220. https://doi.org/10.1006/knac.1993.1008

Haase, P., Siebes, R., & Van Harmelen, F. (2004). Peer Selection in Peer-to-Peer Networks with Semantic Topologies. In M. Bouzeghoub, C. Goble, V. Kashyap, & S. Spaccapietra (Eds.), Semantics of a Networked World. Semantics for Grid Databases (Vol. 3226, pp. 108–125). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-540-30145-5_7

Hadj Taieb, M. A., Zesch, T., & Ben Aouicha, M. (2020). A survey of semantic relatedness evaluation datasets and procedures. Artificial Intelligence Review, 53(6), 4407–4448. https://doi.org/10.1007/s10462-019-09796-3

Hassan, B., Abdelrahman, S. E., Bahgat, R., & Farag, I. (2019). UESTS: An Unsupervised Ensemble Semantic Textual Similarity Method. IEEE Access, 7, 85462–85482. https://doi.org/10.1109/access.2019.2925006

Jaccard, P. (1912). The distribution of the flora in the alpine zone. New Phytologist, 11(2), 37–50. https://doi.org/10.1111/j.1469-8137.1912.tb05611.x

Jia, Z., Lu, X., Duan, H., & Li, H. (2019). Using the distance between sets of hierarchical taxonomic clinical concepts to measure patient similarity. BMC Medical Informatics and Decision Making, 19(1), 91. https://doi.org/10.1186/s12911-019-0807-y

Jiang, J. J., & Conrath, D. W. (1997). Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of International Conference Research on Computational Linguistics, 19–33. https://doi.org/https://doi.org/10.48550/arXiv.cmp-lg/9709008

Köhler, S., Schulz, M. H., Krawitz, P., Bauer, S., Dölken, S., Ott, C. E., Mundlos, C., Horn, D., Mundlos, S., & Robinson, P. N. (2009). Clinical Diagnostics in Human Genetics with Semantic Similarity Searches in Ontologies. American Journal of Human Genetics, 85(4), 457–464. https://doi.org/10.1016/j.ajhg.2009.09.003

Li, Y., McLean, D., Bandar, Z. A., O’Shea, J. D., & Crockett, K. (2006). Sentence similarity based on semantic nets and corpus statistics. IEEE Transactions on Knowledge and Data Engineering, 18(8), 1138–1150. https://doi.org/10.1109/tkde.2006.130

Likavec, S., Lombardi, I., & Cena, F. (2019). Sigmoid similarity - a new feature-based similarity measure. Information Sciences, 481, 203–218. https://doi.org/10.1016/j.ins.2018.12.018

Lin, D. (1998). An information-theoretic definition of similarity. 296–304.

Manning, C. D., Raghavan, P., & Schutze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.

Meng, L., Gu, Junzhong, & Zhou, Z. (2012). A new model of information content based on concept’s topology for measuring semantic similarity in WordNet. International Journal of Grid and Distributed Computing, 5(3), 81–94.

Miller, G. A., & Charles, W. G. (1991). Contextual correlates of semantic similarity. Language and Cognitive Processes, 6(1), 1–28. https://doi.org/10.1080/01690969108406936

Resnik, P. (1995). Using information content to evaluate semantic similarity in a taxonomy. Proceedings of the 14th International Joint Conference on Artificial Intelligence, 448–453. https://doi.org/10.48550/arXiv.cmp-lg/9511007

Rezaei, M., & Fränti, P. (2014). Matching Similarity for Keyword-Based Clustering. In P. Fränti, G. Brown, M. Loog, F. Escolano, & M. Pelillo (Eds.), Structural, Syntactic, and Statistical Pattern Recognition (pp. 193–202). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-662-44415-3_20

Rubenstein, H., & Goodenough, J. B. (1965). Contextual correlates of synonymy. Communications of the ACM, 8(10), 627–633. https://doi.org/10.1145/365628.365657

Sammut, C., & Webb, G. I. (2011). Encyclopedia of Machine Learning. Springer Science & Business Media.

Sánchez, D., Batet, M., & Isern, D. (2011). Ontology-based information content computation. Knowledge-Based Systems, 24(2), 297–303. https://doi.org/10.1016/j.knosys.2010.10.001

Seco, N., Veale, T., & Hayes, J. (2004). An intrinsic information content metric for semantic similarity in wordnet. Proceedings European Conference on Artificial Intelligence (ECAI), 4, 1089–1090.

Shajalal, Md., & Aono, M. (2019). Semantic textual similarity between sentences using bilingual word semantics. Progress in Artificial Intelligence, 8(2), 263–272. https://doi.org/10.1007/s13748-019-00180-4

Sharma, S., Sharma, S., Pathak, V., Kaur, P., & Singh, R. K. (2021). Drug Repurposing Using Similarity-based Target Prediction, Docking Studies and Scaffold Hopping of Lefamulin. Letters in Drug Design & Discovery, 18(7), 733–743. https://doi.org/10.2174/1570180817999201201113712

Shawe-Taylor, J., & Cristianini, N. (2004). Kernel methods for pattern analysis. Cambridge University Press. https://doi.org/10.1017/CBO9780511809682

Szumlanski, S., Gomez, F., & Sims, V. K. (2013). A new set of norms for semantic relatedness measures. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, 890–895.

Taglino, F., Cumbo, F., Antognoli, G., Arisi, I., D’Onofrio, M., Perazzoni, F., Voyat, R., Fiscon, G., Conte, F., Canevelli, M., Bruno, G., Mecocci, P., & Bertolazzi, P. (2023). An ontology-based approach for modelling and querying Alzheimer’s disease data. BMC Medical Informatics and Decision Making, 23(1), 153. https://doi.org/10.1186/s12911-023-02211-6

Tien, N. H., Le, N. M., Tomohiro, Y., & Tatsuya, I. (2019). Sentence modeling via multiple word embeddings and multi-level comparison for semantic textual similarity. Information Processing & Management, 56(6), 102090. https://doi.org/10.1016/j.ipm.2019.102090

Toch, E., Reinhartz-Berger, I., & Dori, D. (2011). Humans, semantic services and similarity: A user study of semantic Web services matching and composition. Journal of Web Semantics, 9(1), 16–28. https://doi.org/10.1016/j.websem.2010.10.002

Tversky, A. (1977). Features of similarity. Psychological Review, 84(4), 327–352. https://doi.org/10.1037//0033-295x.84.4.327

Wang, F., Wang, N., Cai, S., & Zhang, W. (2020). A Similarity Measure in Formal Concept Analysis Containing General Semantic Information and Domain Information. IEEE Access, 8, 75303–75312. https://doi.org/10.1109/access.2020.2988689

Wu, Z., & Palmer, M. (1994). Verbs semantics and lexical selection. Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics, 133–138. https://doi.org/10.3115/981732.981751

Yang, S., Wei, R., Guo, J., & Tan, H. (2020). Chinese semantic document classification based on strategies of semantic similarity computation and correlation analysis. Journal of Web Semantics, 63, 100578. https://doi.org/10.1016/j.websem.2020.100578

SemSim^p: A Parametric Method for Evaluating the Semantic Similarity of Digital Resources

Abstract

References

Download

Keywords