KRLSMDA: Identifying Human miRNA–Disease Association Based on Similarity and Kronecker Regularized Least Squares Method

Corresponding Author: Weimin Gao School of Computer and Information Science, Hunan Institute of Technology, Hengyang, 421002, China Email: gwmhy@163.com Abstract: A growing number of studies have suggested that miRNAs (microRNAs) have associations with human diseases, the design and discovery of drug. But so far, we do not yet fully understand the molecular mechanism of miRNAs in the development of human diseases. Predicting miRNA-disease associations is helpful for understanding the molecular mechanism of miRNAs in the development of human diseases. However, wet-lab experiments are time-consuming and need higher costs to discover miRNA-disease associations. Some computational methods are proposed for predicting miRNA-disease associations, but the prediction performance of these methods needs to be further improved. In this study, we propose a new computational model (KRLSMDA) based on similarity and the Kronecker Regularized Least Squares algorithm. In KRLSMDA, the miRNA functional similarity, the miRNA sequence similarity and the Gaussian Interaction Profile (GIP) kernel similarity are integrated into the comprehensive miRNA similarity. Then we compute the disease semantic similarity, disease functional similarity and the GIP kernel similarity to construct the comprehensive disease similarity based on the disease semantic information, the disease functional information and known miRNA-disease associations, respectively. Finally, the kronecker regularized least squares algorithm is used to predict hidden miRNAdisease associations. The experimental results show that KRLSMDA has achieved the average Area Under the Curve (AUC) values of 0.9181±0.032 and 0.9267±0.022 in 5-fold Cross-Validation (5CV) and 10-fold CrossValidation (10CV), respectively, which demonstrates KRLSMDA is superior to four competing models. We expect KRLSMDA to be a supplement in the field of biomedical research in the future.


Introduction
MicroRNAs (miRNAs) are short, single-stranded non-coding RNAs (~22 nt) that can regulate gene expression by base pair binding to the 3' Untranslated Region (UTR) of their messenger RNA (mRNA) (Llave et al., 2002;Eulalio et al., 2008). Up to now, miRNAs have been found to be involved in a series of key life processes, including cell growth, differentiation, death and apoptosis (Zhu et al., 2016;Fernando et al., 2012;Lize et al., 2010). A growing number of evidences show that miRNA is directly and closely associated with human diseases, especially complex diseases like cancer (Akhtar et al., 2016;Calin and Croce, 2006). MiRNA can regulate the expression of disease genes (Ambros, 2004). The abnormality, dysregulation and dysfunction of miRNAs may cause some human diseases (Ha and Kim, 2014). For example, miR-15a and miR-16-1 are down-regulated in chronic lymphocytic leukemia (Lu e al., 2005), prostate cancer (Bonci et al., 2008) and pituitary adenoma (Bottoni et al., 2005) and are induced by the tumor suppressor p53 (He et al., 2007;Chang et al., 2007) and the transcriptional activity of these miRNAs can regulate the p53 response. Furthermore, the expressions of miRNA miR-29b-1, miR-29a, miR-29b-2 and miR-29c in lung cancer (Yanaihara et al., 2006), breast cancer (Wang et al., 2008), cholangiocarcinoma (Mott et al., 2007), lymphoma (Zhao et al., 2010) and liver cancer (Xiong et al., 2010;Agsalda-Garcia et al., 2020) were down-regulated and could be negatively regulated by oncoprotein Myc (Cimmino et al., 2005). Mir-377-3p is up-regulated in patients with Multiple Sclerosis (Khorasgani et al., 2019). Therefore, identifying hidden miRNA-disease associations can be helpful for probing the molecular mechanism of miRNAs in the development of human diseases and designing appropriate and effective treatments (Xu et al., 2013). But predicting hidden miRNA-disease associations by biological experiments is time-consuming and expensive. This problem can be effectively resolved by computational methods as effective complements of predicting hidden miRNA-disease associations. The successful application of these methods to predict hidden miRNA-disease associations are based on the biological assumption that similar miRNAs are more likely to be associated with similar diseases and vice versa. However, there are some disadvantages in these methods. On the one hand, the prediction performance of these methods needs to be further improved. On the other hand, local information is applied to some methods (Li et al., 2019). It will result in high false positives.
In our work, a new computational model (KRLSMDA) is proposed to predict miRNA-disease associations based on the Regularized Least Squares algorithm of Kronecker product kernel. Based on the miRNA functional information, miRNA sequence information and known miRNA-disease associations, KRLSMDA computes the miRNA functional similarity, the miRNA sequence similarity and the Gaussian Interaction Profile (GIP) kernel similarity to construct a comprehensive miRNA similarity matrix by the linear weighted method. Then two disease semantic similarities (DSMesh and DSDO) are computed by the representation of direct acyclic graph based on the Mesh database and the Disease Ontology database. We further compute the disease functional similarity and the GIP kernel similarity based on the disease functional information and known miRNA-disease associations, respectively. In order to get a comprehensive similarity matrix, we integrate the disease functional similarity, the Gaussian Interaction Profile (GIP) kernel similarity and two disease semantic similarities (DSMesh and DSDO). We can compute the kernel of the miRNA-disease pair via the kronecker product of kernel of the miRNA-disease pair. Finally, the kronecker product kernel-based regularized least squares algorithm is applied for predicting the associations scores of miRNA-disease pairs.

Materials
In our study, the known miRNA-disease associations are obtained from the HMDD V1.0 database (Jiang et al., 2009). We sort and standard these downloaded data and obtain 1395 curated miRNA-disease associations, 271 miRNAs and 137 diseases. Let Y be an adjacency matrix with nm rows and nd columns. If there is a known interaction yij between miRNA mi and disease dj, the value of yij is 1, otherwise is 0. Therefore the benchmark dataset consists of 1395 curated miRNA-disease associations and 35732 unknown miRNA-disease associations. This benchmark dataset is represented as follows: where, + ,  andare 1395 miRNA-disease associations, a union of the sets and 35732 unknown miRNA-disease associations, respectively.

The Similarity of miRNAs
For miRNAs, we compute three similarities of miRNAs, including the miRNA GIP kernel similarity, the miRNA sequence similarity and the miRNA functional similarity. GIP kernel has been widely used to compute the similarity between biological entities in biological networks (Van Laarhoven et al., 2011;Zhu et al., 2021). According to the biological assumption that similar miRNA sin cline to interact with similar diseases, the GIP kernel similarity between miRNA mi and miRNA mj is also calculated based on the known miRNA-disease associations: where, ymi and ymj denote the interaction profiles of disease mi to disease mj, respectively. m regulates the normalized kernel bandwidth by the original bandwidth 'm.
These miRNA nucleotide sequences are downloaded from miR Base database (Kozomara and Griffiths-Jones, 2014). Based on these miRNA nucleotide sequences, we can use Emboss-Needle tool (McWilliam et al., 2013) compute the miRNA sequence similarity MSseq(mi, mj) between miRNA mi and miRNA mj.
The miRNA functional information is also downloaded from MISIM. We also use the misim method (Wang et al., 2010) to compute the miRNA functional similarity based on the miRNA functional information.
As shown above, three miRNA similarity matrices are computed. To minimize the effect of adjusting too many parameters on the performance of our model, the final miRNA similarity is computed based on three miRNA similarity matrices by the linear weighted method:

The Similarity of Diseases
For diseases, we compute the disease functional similarity and the disease semantic similarity. Based on the assumption that similar diseases incline to associate with genes (Cheng et al., 2014), we can use the gene function similarity (Lee et al., 2011) to compute the disease functional similarity. Let DSFun denote the disease functional similarity between disease di and disease dj, then DSFun can be computed as: in which Gi and Gj are the set of genes which is related to disease di and disease dj, respectively: is the functional similarity between gene gip and gene set Gj and GF(gip, gjk) is the gene functional similarity between gene gip and gene gjk: in which LLS(gip, gjk) is the score of log likelihood between gene gip and gene gjk.
The Mesh database (Schaefer et al., 2013) and the Disease Ontology database (Kibbe et al., 2015) can be used to compute two disease semantic similarities: DSMesh and DSDO. For the first semantic similarity of diseases, we use DAG to describe the Mesh database. For disease di, T(di) is a set of disease di and its ancestor and E(di) is a edge set. The semantic contribution of disease dt to disease di in the DAG as: The disease semantic value CV(di) is computed as: The disease semantic similarity DSMesh can be defined as: where, Cd is the semantic contribution of disease dt to disease di in the DAG.
For the second semantic similarity of diseases, we also use DAG to describe the Disease Ontology database. The disease semantic similarity DSDO can be defined as: in which, |Gi| and |Gi| are the set of genes which is related to disease di and disease dj, respectively. |Gc| is the number of genes in Gc. Gc is the gene set associated with Dc, which denote the nearest common ancestor of disease di and disease dj in the DAG of the Disease Ontology.
Based on the known miRNA-disease associations, the GIP kernel similarity between disease di and disease dj is calculated as:

KRLSMDA for Predicting miRNA-Disease Associations
As a machine learning-based computational method, the kronecker regularized least squares algorithm has been widely applied in many fields. Driven by the kronecker regularized least squares algorithm of successful applications, we propose a computational method (KRLSMDA) to predict hidden miRNA-disease associations based on the kronecker regularized least squares algorithm. In KRLSMDA, the miRNA-disease associations matrix can be defined as: where, Y is the association scores of miRNAs and diseases, vec is the vector symbol, K is a nuclear matrix of miRNA-disease and  is the regularization parameter. In our study, we set parameter  to be 1.
Based on the Kronecker product Km  Kd, the kernel K can be computed as: where, Km is a miRNA similarity matrix and Kd is a disease similarity matrix. To obtain the predictive matrix, we can use a method of the matrix eigenvalue decomposition to calculate the inverse of an nmnd × nmnd matrix. The kernel K can be defined as: In which m and d are the unitary of the eigenvectors of the miRNA similarity matrix Km and the disease similarity matrix Kd, respectively.  denotes the Kronecker product, m is the eigenvalue diagonal matrices of the miRNA similarity matrix Km and d is the eigenvalue diagonal matrices of the disease similarity matrix Kd. The decomposition process of the eigenvalue diagonal matrices is computed as: So, we can compute the predictive miRNA-disease associations' matrixas below:

Performance Evaluation
For 5CV, known miRNA-disease associations + are randomly divided into 5 exclusive subsets as below: With: in which, ,  and  are symbols of union, intersection and the empty set, respectively. Each subset (e.g., 1  ) in turn, acts as a test set and the remaining four subsets as the training set. 5  denotes the fifth subset. 5CV is performed 10times, with the average of the predicted results as the final result.
Similar to the 5CV method, + are randomly also divided into 10 exclusive subsets with equal size as: With: Each subset (e.g., 1  ) in turn, acts as a test set and the remaining four subsets as the training set. 10CV is also performed 10 times, with the average of the predicted results as the final result.
In 10CV, we also compared KRLSMDA with other four prediction models. As we can see from Fig. 2

Parameter Analysis
In order to analyze the robustness of KRLSMDA, we quantify the effects of different values of three parameter (, 'm and 'd) on the prediction performance of KRLSMDA in 5CV and 10CV, respectively. Parameter  is a regularization parameter in KRLSMDA. The original bandwidth 'm and 'd are used to regulate the normalized kernel bandwidth.
As shown in Fig. 4, KRLSMDA also describes an increasing AUC trend of KRLSMDA from 0.566±0.015 to 0.9267±0.022 in 10CV, when  increases from 0.2 to 1.0, respectively. It is obvious for KRLSMDA to make a better performance when the value of parameter  is1.0. So  is set to be 1.0 in this study.
It is obvious for KRLSMDA to make a better performance when γ is equal to 2 1 .

Conclusion
The KRLSMDA method is proposed to predict miRNA-disease associations based on the Kronecker Regularized Least Squares algorithm. In KRLSMDA, we compute the miRNA functional similarity and miRNA sequence similarity based on the miRNA functional information and miRNA sequence information. Then the disease semantic similarity and disease functional similarity are computed by the disease semantic and functional information, respectively. Finally, we apply kronecker product kernel-based regularized least squares algorithm to predict hidden miRNA-disease associations. The experimental results show that KRLSMDA is effective to predict potential miRNA-disease associations.
However, the limitations of our method are also discussed as follows: (i) Some relevant biological information should be integrated; (ii) some deep learning methods should be used to enhance the prediction performance of KRLSMDA in the future.