Predicting Microbe-Drug Association based on Similarity and Semi-Supervised Learning

Corresponding Author: Jun Wang School of Computer and Information Science, Hunan Institute of Technology, Hengyang 421008, China Email: wj6@hnit.edu.cn Abstract: Increasing clinic evidences have showed that microbial communities play important roles in human health and disease. Predicting hidden microbe-drug associations can be helpful in understanding the microbe-drug association mechanisms in clinical treatment, drug discovery, combinations and repositioning. Some computational methods were proposed to predict the associations of microbes and drugs. However, the prediction performance of these methods needs to be improved. In this study, a new computational model (LRLSMDA) is proposed for identifying Microbe-Drug Associations based on the Laplacian Regularized Least Square algorithm. LRLSMDA integrates the chemical structure similarity of drugs and known microbe-drug associations. The microbe Gaussian Interaction Profile (GIP) kernel similarity is computed based on known microbe-drug associations. We compute the drug GIP kernel similarity and the drug chemical structure similarity based on known microbe-drug associations and drug chemical structures. The drug GIP kernel similarity and the drug chemical structure similarity are integrated into a more comprehensive drug similarity matrix by the linear weighted method. Finally, the Laplacian regularized least squares algorithm is applied to predict hidden microbe-drug associations. LRLSMDA has achieved the average Area Under the Curve (AUC) values of 0.8983±0.0019, 0.9043±0.0015 and 0.9095 in 5-fold Cross-Validation (5CV), 10-fold Cross-Validation (10CV) and Leave One Out Cross-Validation (LOOCV), respectively. These experimental results show that the prediction performance of LRLSMDA outperforms three compared models.


Introduction
As an important part of the human microbiome, microbes are mainly made up of bacteria, archaea, viruses and fungi etc. Generally speaking, microbes are mainly made up of bacteria, archaea, viruses and fungi etc. Bacteria and viruses are to cause hundreds of human diseases (Geoghegan et al., 2016). Especially for some emerging and epidemic-prone diseases, such as Coronavirus Disease 2019 , Severe Acute Respiratory Syndrome (SARS) and Middle East Respiratory Syndrome (MERS), directly threaten human health and become the public health concern.
Some researchers think that these diseases can result from the absence of beneficial functions or the introduction of maladaptive functions by invading microbes (Turnbaugh et al., 2007;Methé et al., 2012;Young, 2017). It is also believed that restoring the absence of beneficial functions or eliminating harmful microbial activities is helpful to the treatment of certain diseases (Young, 2017;Huttenhower et al., 2012).
After its discovery in the 1940s, penicillin has been used to restore the absence of beneficial functions or eliminating harmful microbial activities. Millions of people already have been saved by antibiotics from diseases and deaths. Therefore, with the abuse of antibiotics, many bacteria are developing antibiotic resistance, which greatly reduces the efficacy of antibiotics and limits the range of antibiotics. Over 70% of bacteria are resistant to at least one common antibiotic. But at the same time antibiotics are developed rarely and only two antibiotics have been discovered in the past 30 years (Pew Charitable Trusts, 2015). The United Kingdom government predicts that without the discovery of new potential antibiotics, 10 million people will die from antibiotic-resistant infections worldwide every year by 2050 (O'neill, 2014). Therefore, drug resistance is a serious threat to public health.
In order to deal with the problem of drug resistance, some scholars proposed two methods to solve this problem: Drug combination and drug repositioning. On the one hand, a medicine application to combat antibiotic resistance is drug combination research (Zimmermann et al., 2007). The first exploration of antimicrobial agents in tuberculosis was performed by using combination drugs (Marshall et al., 1948). Combination drug therapy is being widely used to treat HIV infection and cancer chemotherapy (Vandamme et al., 1998). On the other hand, another method is drug repositioning (Chen et al., 2015), which find novel therapeutic effects of old drugs. For both combinatorial drug treatment and drug repositioning, identifying novel associations between drugs and microbes is their first step (Chen et al., 2016).
Some works show microbes take critical roles in many important biological processes, including an increased toxicity of digoxin (Aarnoudse et al., 2008;Haiser et al., 2014), a reduction of the clearance of morphine and higher morphine AUC inducing virulence in some strains pseudomonas aeruginosa, increasing 221% in simvastatin AUC for homozygote's (Ong et al., 2012;Voora et al., 2009;Ramsey et al., 2014), altering the activity warfarin (Violi et al., 2016) and an increased toxicity of irnotecan (Guthrie et al., 2017). Identifying associations of microbes and drugs is helpful to throw light on why some respond well to certain drugs, but others suffer severe side-effects. However, to date, only a few microbe-drug associations have been identified (Sun et al., 2018).
In this study, a new model (LRLSMDA) is proposed to identify Microbe-Drug Associations based on the Laplacian Regularized Least Square algorithm. In LRLSMDA, we compute the microbe Gaussian Interaction Profile (GIP) kernel similarity based on known microbe-drug associations to construct the microbe similarity matrix. Then an integrated drug similarity matrix is constructed as follows: First, the chemical structures similarity of drugs is calculated based on the Canonical SMILES of drugs downloaded from Drugbank. Second, we calculate the drug GIP kernel similarity based on known microbe-drug associations. Last, the integrated drug similarity matrix is constructed by the average of the drug GIP kernel similarity and the drug chemical structures similarity. Based on the microbe similarity matrix, the integrated drug similarity matrix and the microbe-drug association matrix, the laplacian regularized least squares algorithm is applied to identify hidden microbe-drug associations.
To confirm the prediction ability of LRLSMDA, we compare LRLSMDA with three compared models.
5CV, 10CV and LOOCV computational experiment results show that LRLSMDA is consistently superior to three other models (HGBI, NBI and SNMF). LRLSMDA is effective to identify hidden miRNA-disease associations.

Materials
The dataset of human microbe-drug associations are downloaded from the Microbe-Drug Association Database (MDAD) (Sun et al., 2018). We sort and preprocess these downloaded data and obtain 1152 known microbe-drug associations, 142 microbes and 627 drugs. Let M = {m1, m2, m3,…,mnm} denote nm microbes in M and D = {d1, d2, d3,…, dnd} represent nd drugs in D. Then, Y is nd rows and nm columns of the adjacency matrix of microbe-drug associations. If there is a known association between microbe mi and drug dj, the value of yij is 1, otherwise is 0. Therefore the benchmark dataset consist of 1,152 known microbe-drug associations and 87,882 unknown microbe-drug associations. This benchmark dataset is represented as follows: in which + is 1152 known microbe-drug associations,  is a union of the sets and  is 87882 unknown microbe-drug associations, respectively.

Construct the GIP Kernel Similarity Matrix of Microbes
GIP kernel has been successfully applied in many fields (Van Laarhoven et al., 2011;Zhu et al., 2020;Luo et al., 2018). In terms of an assumption that similar microbes tend to related with similar drugs, the microbe GIP kernel similarity KMGIP(mi,mj) can be computed as: where Sm is a microbe similarity matrix and

Construct the Similarity Matrix of Drugs
For drugs, we compute the drug GIP kernel similarity and the drug chemical structures similarity. According to the microbe GIP kernel similarity calculation method (Zhu et al., 2021), we also compute the drug GIP kernel similarity KDGIP(di,dj) between drug di and drug dj as below: Based on the previous researches, we can use some ways compute the drug similarity. In our study, we introduce the drug chemical structure similarity into LRLSMDA.
The drug chemical structure similarity can be computed by Chemical Development Kit (Steinbeck et al., 2006) based on the chemical structures of drugs in the Canonical Simplified Molecular Input Line Entry Specification (SMILES) (Weininger, 1988). The Canonical Simplified Molecular Input Line Entry Specification formatof drugs can be downloaded from Drugbank (Wishart et al., 2018). We compute binary fingerprints of all drugs by Chemical Development Kit. The Tanimoto score (Tanimoto, 1958) of their binary fingerprints is used to measure the chemical structure similarity DSchem(di,dj).
As shown above, two drug similarity matrices are computed. We combine two drug similarity matrices KDGIP(di,dj) and DSchem(di,dj) into a more comprehensive drug similarity matrix Sd by the linear weighted method:

LRLSMDA for Predicting Microbe-Drug Associations
The Laplacian Regularized Least Squares (LRLS) algorithm has been successfully applied to identify associations between biological entities. In this study, we present a new model (LRLSMDA) to identifymicrobe-drug associations via Laplacian Regularized Least Squares algorithm. LRLSMDA is implemented based on the drug chemical structures similarity, the drug GIP kernel similarity and the microbe GIP kernel similarity.
Based on the microbe GIP kernel similarity matrix and the comprehensive drug similarity matrix above, two diagonal matrixes Dm and Ddcan are expressed as follows: Then we normalize these two diagonal matrixes Dm and Dd to obtain two normalized laplacian similarity matrixes Lm and Ld by the laplacian operation, respectively:   Based on this LRLS algorithm, FM * and FD * are computed using the minimization of the cost functions, respectively: in which tr() and ||||F area matrix trace and the Frobenius norm (Xia et al., 2010), respectively. The trade-off parameters βm and βd are set to be 1. Two prediction matrixes FM * and FD * can be computed as Finally, FM * and FD * are transformed into a prediction matrix with a linear mean method as follows:

Performance Evaluation
The prediction performance of LRLSMDA is systematically evaluated by the cross validation framework. In the k-fold cross validation, 1152 known microbe-drug associations + are divided into k exclusive subsets: With: in which  is the symbol of union,  is the symbol of intersection and  is the symbol of the empty set. 1  is the first exclusive subset. Each subset (e.g., 1  ) in turn, acts as a test sample and the remaining samples as the training samples. Moreover, all the unknown microbe-drug associations are considered as the candidate associations kfold cross validation is performed 100 times, with the average of predictive results as final results.
In LOOCV, we select each known association as a test sample and the rest known associations as training samples. Moreover, all the unknown microbe-drug associations are selected as the candidate associations.
Each known microbe-drug association is ranked relative to the candidate associations. If the value of this ranking is higher than an assumed threshold, the test sample is correctly predicted.

Comparison with other Models
In order to evaluate the predictive performance of LRLSMDA, we compare it with three other models, namely NBI (Cheng et al., 2012), HGBI (Wang et al., 2013) and SNMF (Wang et al., 2017). NBI is a networkbased method to infer new interactions of drugs and targets. HGBI is also aheterogeneous graph inferencebased method to infer hidden interactions between drugs and targets. As a matrix factorization-based method, SNMF can predict microbe-drug associations.
In LOOCV, we also compare LRLSMDA with three other models. As shown in Fig. 3, LRLSMDA can achieve the AUC value of 0.9095, while NBI, HGBI and SNMF have 0.7199, 0.8873 and 0.7622 in LOOCV, respectively. LOOCV clear that the AUC value of LRLSMDA is 0.8983±0.0019 in 5CV when γ is equal to 2 1 . Figure 5 describes an increasing AUC trend of LRLSMDA from 0.886±0.0014 to 0.9043±0.015 in 10CV, when γ increases from 2 2 to 2 1 . Itis obvious for LRLSMDA to make a better performance when the value of parameter is 2 1 .
As we can see from Fig. 4 to 6, LRLSMDA makes a better performance in 5CV, 10CV and LOOCV when γ = 2 1 .

Conclusion
Increasing evidences have showed that microbes take important roles in human health and disease. Identifying hidden microbe-drug associations is helpful in understanding the microbe-drug association mechanisms in clinical treatment, drug discovery, combinations and repositioning. In our study, LRLSMDA is proposed to predict microbe-drug associations of human. In the model of LRLSMDA, the microbe GIP kernel similarity, the comprehensive drug similarity, the known microbedrug associations are combined to compute the association score between microbes and drugs. LRLSMDA has Achieved the Curve (AUC) values of 0.8983±0.0019, 0.9043±0.0015 and 0.9095 in 5CV, 10CV and LOOCV, respectively, which shows a better performance than three other models.