Exploring Optimizations to HeteSim for Computing Relatedness in Heterogeneous Information Networks
Abstract
Heterogeneous information networks, or knowledge graphs, are valuable tools for analyzing insights from the vast number of papers published in the biomedical sciences. Informally, a heterogeneous information network is a directed graph in which each node corresponds to a biomedical concept and each directed edge encodes a relationship between concepts. SemNet is a heterogeneous information network with approximately 300,000 nodes and 20,000,000 edges built from the abstracts of papers in the PubMed database. This work builds on the SemNet codebase with the goal of reducing algorithm runtime. We focus specifically on the similarity score HeteSim, which has been successfully used in multiple biomedical heterogeneous information networks. However, HeteSim can require prohibitive runtime. Performance improvements to HeteSim computation algorithms based on data structure changes and randomized approximation algorithms will be presented. Both theoretical bounds on algorithm performance and empirical timing results will be included.