Bayesian Exploration of Multilocus Interactions on the Genome-Wide Scale

: Problem statement: Recent technological and scientific advances propelled the field of Genome-Wide Association Study (GWAS), which promises to be instrumental in linking many common complex diseases to their genetic origin. While so far such large-scale surveys have been moderately successful in identifying disease related genetic variants, much of disease heritability is still not accounted for by the discovered loci. There is an urgent need for advanced statistical methods for efficient automatic detection of complicated multilocus interactions on significant scales. Approach: Novel statistical methods based on Bayesian data analysis ideas, specifically Bayesian modeling, Bayesian variable partitioning, graphical and network models are promising to aid in search for missing disease heritability and shed light on complex biological processes involved in disease development. First crucial difference setting these methods apart from all the mainstream previous approaches (hypothesis testing methods) is their joint disease mapping capability via the simultaneous fitting of a statistical model for the whole case-control data set. Additionally, such Bayesian methods allow for the construction of complicated data models and quantitative incorporation of diverse prior information into the final statistical model. Results: The use of Bayesian techniques has already yielded new insights into the details of epistatic interactions across the genome associated with various important diseases. Conclusion/Recommendations: Bayesian approaches provide a way to detect and understand complicated multilocus interactions that already started to elucidate important disease pathways. As the field of GWAS matures, Bayesian strategies can surely aid in converting such multiple surveys into useful biomedical information.


INTRODUCTION
The promise of personalized medicine and genomics: Improved disease prevention and diagnosis as well as novel routes to therapies are the main motivations for extensive studies aimed at finding disease related genes and variants. Particularly, genetic tests capable of showing individual's risks to develop certain diseases would help to tailor preventive and therapeutic treatments to every single patient in order to achieve best possible results (Hall, 2010;McCarthy et al., 2008). While there are already a few companies offering 'consumer genomics' services to provide estimated disease risks via characterization of known genetic risk factors (Donnelly, 2008;Carmichael, 2010), currently this kind of information on genetic markers can give only a limited help in common illness propensity risk assessment (Donnelly, 2008;Hall, 2010). Even though a plethora of resources has been directed in this direction in the past dozen years, the genetic basis of common human diseases has not been identified for the most part (WTCCC, 2007). Recent emergence of successful strategies for the genome-wide association studies was supposed to provide the necessary tools for deciphering genetic causes of complex human illnesses like type 1 and 2 diabetes (Todd et al., 2007), rheumatoid arthritis and bipolar disorder (Hirschhorn and Daly, 2005;WTCCC, 2007).

An emergence and development of GWAS:
An examination of an immense number of genetic markers across the whole genome for multiple individuals with the goal of identifying variants-disease associations is known as Genome-Wide Association Study (GWAS). Novel scientific and technological advances (Metzker, 2010;Branton et al., 2008;Schaffer, 2012) made GWAS fully capable of unlocking the basis of complex diseases. Particularly, development of the International HapMap resource (IHMC, 2005) that simplified design and analysis of association studies, emergence of dense genotyping chips (Metzker, 2010;Svoboda, 2010) and assembly of large and characterized clinical samples (WTCCC, 2007) should be singled out as important factors in GWAS recent successful progress. While many disease loci have been identified in such surveys (WTCCC, 2007;Johnson and O'Donnell, 2009), discovered variants explain only a small proportion of the observed familial aggregation (McCarthy et al., 2008;Altshuler and Daly, 2007). This is known as a 'missing heritability problem' (Gibson, 2012). Currently there are three alternative mainstream ideas for the genetic architecture of complex diseases: the infinitesimal model, the rare allele model and the broad sense heritability model (Gibson, 2012). Thus, the most urgent contemporary debate that needs to be solved is regarding the architecture of complex human traits. While, 'common variant' hypothesis has come under a lot of criticism lately (Hall, 2010;Gibson, 2012), it is now necessary to dig deeper and choose which one of the alternative proposed architectures is closer to reality in order to help develop future studies efficiently (Gibson, 2012;Hall, 2010;Donnelly, 2008).
Beyond single-locus analysis: Despite striking success in the 20th century in pinpointing genes responsible for mendelian diseases, genetic origins of common complex diseases are, in fact, non-mendelian in nature (Zhang and Liu, 2007;Jiang et al., 2011). Particularly, gene-gene interactions are involved in many complex biological processes like metabolism, signal transduction and gene regulations and, thus, genetic variants in multiple loci may contribute to the disease formation together (Moore, 2003;Chen et al., 2011a). For example, breast cancer and type 2 diabetes have been linked to multi-SNP interactions (Chen et al., 2011a;Ritchie et al., 2001;Wiltshire et al., 2006). While most current bioinformatics approaches focus on detecting single-SNP associations, advanced statistical methods are necessary for multi-SNP association mapping because single-variant methods not only loose power when interactions exist but are, in fact, helpless in detecting rare mutations (Zhang, 2012). Also, the number of possible interactions is so vast that it is computationally unrealistic to search though all possible interactions in the genome for a large scale case-control study (Zhang et al., 2011a;Cordell, 2009).
Additional challenge for disease origin discovery comes from the statistical correlation between nearby variants known as linkage disequilibrium or LD (Zhang et al., 2011a;Kozyryev and Zhang, 2012). LD patterns have many important applications in genetics and biology (Wall and Pritchard, 2003) and arise due to shared ancestry for contemporary chromosomes (IHMC, 2005). Due to LD patterns, it is likely that there will be a lot of redundant positive signals in dense studies (Zhang, 2012). Later on we address in detail how Bayesian strategies can address the burning problems in genetics while dealing with epistasis and linkage disequilibrium.
Statistical approaches for GWAS: Currently, most of the approaches to disease association mapping employ the standard 'frequentist' attitude to the evaluation of significance (McCarthy et al., 2008). Particularly, such algorithms use hypothesis testing procedures to deal with one variant at a time (Zhang, 2012). The accepted threshold for the p-value is ~ 5×10 −8 (Risch and Merikangas, 1996;Hoggart et al., 2008;McCarthy et al., 2008). However, failures of such 'frequentist' methods to account for the power of a study and the number of likely true positives (McCarthy et al., 2008) combined with the increased likelihood to report a multitude of redundant associations (Zhang, 2012) sparked a wide interest in the Bayesian procedures. In this review we survey the challenges facing statistical geneticists while analyzing the GWAS data and outline how recently emerged Bayesian methods can help with the process. In addition to outlining the main differences between various proposed approaches, we highlight limitations and advantages of each method and describe future prospects in the field and how Bayesian approaches can aid in answering outstanding questions in biomedicine.

MATERIALS AND METHODS
Previously, we mentioned multiple complicated interactions that have to be considered while developing statistical models for understanding of the multilocus interactions. In Fig. 1 we summarize all the relevant interactions present in the GWAS in the graph. The ultimate goal is to be able to accurately understand all the shown couplings in large-scale case-control studies while also comprehending the biological processes that lead to disease development. Thus, while statistical understanding is important, developing methods that can point in the direction of the appropriate biological processes taking place is the next ultimate goal.

Overview of Bayesian data analysis:
Statistical conclusions about an unknown parameter θ (or unobserved data y unobs ) in the Bayesian approach to parameter estimation are described utilizing probability statements which are conditional on the observed data y: p(θ|y) and p(y unobs |y). Additionally, implicit conditioning is performed on the values of any covariates (Gelman et al., 2004). SNPs are shown as circles with color indicating their disease connection: 'green' SNPs are not associated with the phenotype of interest, 'blue' are marginally associated, 'brown' are influencing disease formation either through epistasis or they are in Linkage Disequilibrium (LD) with such variants and 'orange' ones can lead to disease formation through gene-environment interactions. LD between different variants is depicted as lines without arrows, while gene-gene and gene-environment couplings are represented by lines ending with arrows at both ends. In the paper we review which of the interactions in the graph can be efficiently discovered using the novel Bayesian approaches.
The concept of conditioning on the observed data is what separates Bayesian statistics from other inference approaches which estimate unknown parameter over the distribution of the possible data values while conditioning on the true, yet unknown parameter value (Gelman et al., 2004;Rice, 2006). At the heart of all the Bayesian approaches for detection of gene-gene interactions lies the concept of Bayesian inference and, specifically, Bayesian model selection. The goal is to determine the posterior distribution of all parameters in the problem (disease association, epistatic interactions and block structures), given the common variants data for the case-control study while incorporating prior beliefs about parameter values. The conditional probability of all parameters given the observed data is proportional to the product of the likelihood function of the data and prior distribution on the parameters (Rice, 2006): For all large data sets encountered in GWAS P(data) cannot be explicitly calculated (Zhang and Liu, 2007) and, therefore, P(parameters|data) can be known only up to the proportionality constant as shown in Eq. 1. However, advanced computational techniques (iterative sampling methods) can be used to determine posterior distribution of parameters (Liu, 2008;Rice, 2006). The main task is to make appropriate choices of statistical models to describe P (data|parameters) and also to choose appropriate prior distributions on the values of parameters: P(parameters).

Overview of Bayesian variable partition:
Instead of testing each SNP set in a stepwise manner (Marchini et al., 2005;Liu et al., 2011), Bayesian approaches fit a single statistical model to all of the data simultaneously (Zhang and Liu, 2007;Zhang et al., 2011a; allowing for increased robustness when compared to hypothesis testing methods (McCarthy et al., 2008;Zhang, 2012). Another advantage of Bayesian approach to the problem is the ability to quantify all the uncertainties and information and to incorporate previous knowledge about each specific SNP marker into the statistical model through the priors (Zhang and Liu, 2007;Rice, 2006). In the Bayesian model selection framework, we are interested in figuring out which of the set of models {M} is the most likely one given the observed data (X). In analogous way to Eq. 1, we can find the posterior probability for a particular model M i given data, by replacing parameters with M i : Thus, through comparison of P(M i |X) and P(M j |X) it can be determined using Eq. 2 whether model M i or M j is more likely (Rice, 2006). Now let's consider how this conceptual framework is applied in practice to the extraction of multilocus interactions in GWAS.
Epistasis analysis in genome-wide data sets: While statistical methods like BGTA (Zheng et al., 2006), MARS (Cook et al., 2004) and CPM (Nelson et al., 2001) are capable of detecting epistatic associations, the Bayesian Epistasis Association Mapping (BEAM) algorithm (Zhang and Liu, 2007) was the first practical approach capable of handling genome-wide casecontrol data sets. BEAM algorithm gives for each SNP marker posterior probabilities for disease association and epistatic interaction with other markers given the case-control genotype SNP data. The core of the Bayesian marker partition model used can be briefly summarized as follows.
BEAM can detect both interacting and noninteracting disease loci among a large number of variants. It is an application of Bayesian model selection procedure. Particularly, all the markers are split into three non-overlapping groups: (1) markers not associated with the disease, (2) marginally diseaseassociated variants and (3) those with interaction associated disease effect. Thus, using the priors on the marker memberships and Markov Chain Monte Carlo (MCMC) methods, posterior probabilities for group memberships are determined. Specifically, by interrogating each SNP marker conditionally on the current status of others via MCMC method the algorithm produces posterior probabilities (Zhang and Liu, 2007). Particularly, the genotype counts are modeled by the multinomial distribution with the frequency parameters described by the Dirichlet prior. In order to determine the posterior probability of each marker's group membership (represented by I) the Metropolis-Hastings (MH) algorithm (Liu, 2008) is used to sample from P(I|D,H) as given in Eq Where: D = The patient data set (with disease) H = The control data set (healthy) and then D0 D1 and D2 = Correspondingly partitions of the patient data set into three categories described above The assumption is that case genotypes at the disease associated markers will have different distributions when compared to control genotypes. Furthermore, the likelihood model assumes independence between markers in control group.
While BEAM algorithm was one of the first few to be able to handle GWAS data, it suffered from an assumption that SNPs dependence structure could be described by the Markov chain (Zhang et al., 2011a;Zhang and Liu, 2007). In fact, SNP markers are highly correlated within haplotype blocks which are separated by the recombination events (IHMC, 2005;Reich et al., 2001). Therefore, despite its successful approach, BEAM model is not able to capture the block-like human genome structure.
Incorporating block-type genome structure: A new Bayesian model that infers LD-blocks and chooses SNP markers in the blocks that are disease associated, therefore successfully incorporating diplotype blocks in the human genome into the Bayesian approach proposed by Zhang and Liu (2007) is known as BEAM2 algorithm developed by Zhang et al. (2011a). The statistical Bayesian model for the LD-block structure is summarized in Zhang et al. (2011a) and Kozyryev and Zhang (2012). The main assumption is that diplotypes of individuals come from a multinomial distribution with frequency parameters described by the Dirichlet prior and that genotype combinations of SNPs in different LD blocks are mutually independent, which is a good approximation to reality (Zhang et al., 2011a). Therefore, the compact expression for the marginal probability of the data for a specific block is given by Eq. 4: where, a block of SNPs considered is (s,…,b-1); Γ is the gamma function, a is the vector of Dirichlet parameters and n i refers to the number of counts for a specific diplotype. For joint inference of the diplotype blocks and disease association status we use the joint statistical model for the observed genotype data in cases and controls, the marker membership and block partition variable as in Eq. 5: Finally, in order to determine the posteriors P(B|D,H) and P(I|D,H) the model uses a combination of MH algorithm and Gibbs sampler (Liu, 2008;Zhang et al., 2011a).

Detailed
interaction partition structure determination: While successful in inferring epistatic interactions in GWAS, both BEAM and BEAM2 had a disadvantage of using saturated models which limited the ability of the algorithms to accurately determine the structure of the epistatic interactions among different disease related markers. However, recent studies showed that such interaction details arising due to encoding of the complicated regulatory mechanisms might play an important role in the disease formation (Zhang et al., 2011b;Yang et al., 2009;WTCCC, 2007). In order to be able to carefully explore the etiopathogenesis and genetic mechanisms of diseases, Zhang et al. (2011b) proposed the Recursive Bayesian Partition (RBP) algorithm. The RBP approach attempts to search for conditional independence and independence groups among interacting markers. RBP first recursively infers all the marginally independent interaction groups (no interaction between groups) and then infers the conditional independence within each group using chain-dependence model. RBP therefore successfully recursively determines dependence structure among interacting variants in the GWAS setting. Figure 2 shows an example of the possible outcomes of the RBP algorithm applied to GWAS data when determining the epistatic interactions independence structure.

Bayesian graph models and networks:
Here we describe BEAM3 algorithm developed by Zhang (2012) and how it improves on BEAM and BEAM2 models and what genetic problems and questions it can help to address. Through the use of Bayesian graphical method, BEAM3 detects flexible interaction structures instead of using saturated models (like BEAM and BEAM2 do), therefore, highly reducing the multi-SNP model complexity. Moreover, because only the disease association graphs are constructed, BEAM3 provides for higher computational efficiently in the GWAS settings (Zhang, 2012).
In detail, Zhang (2012) allowed for higher-order couplings via saturated interactions within cliques (nonoverlapping partition of SNPs) and pairwise interactions between them. It can be shown (Zhang, 2012) that the joint probability of all SNPs X, parameters, including disease graph and association status (G, I) and disease status indicator (Y) is given by:  Zhang et al. (2011b). In this simple example, five SNPs (numbered SNP1 though SNP5) were assumed to be associated with the phenotype in question. The independence groups within the set of those SNPs are singled out using circles/ovals and different colors. There is a strong conditional independence in the group of 'red' SNPs {2, 3, 4} while 'blue' SNP1 and SNP5 are independent of the other three disease-associated variants where, G = (C,∆) is an undirected disease graph constructed on disease associated SNPs (X 1 ) and including partition of SNPs into cliques (C) and interaction between cliques (∆); probability function of X 1 set under the phenotype association hypothesis is described by P A . Therefore, as can be seen from Eq. 6, only a few disease-associated SNPs are modeled (in set X 1 ) and hence a significant portion of computational time is saved due to avoiding explicit modeling of complicated dependence structures of all SNPs which could be millions (Zhang, 2012;Zhang and Liu, 2011;Jiang et al., 2011). Additionally, through the choice of a proper baseline probability function P 0 (X 1 ), the model automatically accounts for the complex LD effects among dense SNPs employing graphs. Thus, a significant number of repetitive false interactions are avoided reducing computational burden (Zhang, 2012). In a different direction, Bayesian methodology has been also applied to data-mining and machine learning approaches to improve detection of gene-gene interactions in GWAS. Chen et al. (2011a) proposed to use a Bayesian classification tree model for identification of multilocus interactions in the largescale data sets. The terminal nodes in the graph represent the partition of the feature space with each subject eventually being assigned to only one such node. Each terminal node is marked with the proportion of the case individuals assigned there, with green-colored nodes containing mainly controls and red-colored ones containing mainly cases. As an example, from this simple tree model we can conclude that subjects with SNP1 value of {1} and SNP2 value of {0,1} are very likely to have a disease in question, therefore it is possible that interaction between this common variants leads to the formation of the disease. The goal of the method is the determination of the posterior distribution for the binary tree given the observed data in case-control study Specifically, this kind of machine-learning approach produces tree-structure models where each nonterminal node determines the splitting rule based upon the predictor variables like SNP genotypes and edges between nodes correspond to different possible values for the variable in the top parent node. In summary, a path along such a tree till the terminal node represents a specific combination of predictor variables along the path, in such sense, accommodating for the multilocus interactions (Cordell, 2009;Chen et al., 2011a). For example, Fig. 3 shows an example of such a tree model. There are various ways for searching through tree space in such recursive partitioning approaches including greedy algorithms (Hastie et al., 2009), random forests approach (Breiman, 2001;Cordell, 2009) and MCMC (Chipman et al., 1998;Denison et al., 1998). Bayesian variable partition and Bayesian classification trees are, in fact, conceptually very similar in that prior is assigned to all the tree models with the purpose of controlling the tree size (Chen et al., 2011a). One main advantage of this approach is in a possible enhancement of finding probability for epistatic interactions with weak marginal effects due to ensuring the variable splitting through the prior specification (Chen et al., 2011a). Moreover, due to the adaptivity of the MCMC algorithm, such Bayesian tree models detect higher-order interactions by performing thorough searches near trees with the interacting variables determined in previous iterations (Chen et al., 2011a). However, one of the major drawbacks of such approaches is that they do not test for interactions directly, but instead allow for them, while testing for associations in the data (Cordell, 2009).

Even though practical Bayesian approaches for
GWAS multilocus interactions analysis have emerged relatively recently, such methods have already helped to make important advances in determination of disease etiology. Table 1 succinctly summaries all the Bayesian methods described above as well as their success in determination of the previously known disease loci and, more importantly, in the discovery of new multilocus interactions responsible for complex diseases. For example, Zhang et al. (2012) discovered 319 high-order interactions across the genome that can potentially explain the missing genetic component of the Rheumatoid Arthritis (RA) susceptibility. Moreover, their findings indicate that nervous system, in addition to autoimmune one, potentially performs a crucial role in RA development. This is an example of the statistical study in which disease underlying biological processes can be extracted from determined statistical associations.  (Zhang and Liu, 2007) Epistasis detection AMD GWA data set a More powerful than previous approaches BEAM2 (Zhang et al., 2011a) Epistasis/LD-block detection WTCCC T1D b Many previous loci+new two-way associations RBP (Zhang et al., 2011b) Detailed independence dbMHC c T1D data set Confirmed previously known saturated structure of epistasis interactions BEAM3 (Zhang, 2012) Bayesian graph model for WTCCC IBD d data set All previous IBD loci+2 new+2 interchr. f epistasis/LD interactions Bayesian Classification Tree Classification tree model/ Crohn's disease data Possible epistasis identified (Chen et al., 2011a) recursive partitioning Haplotype Block Differences Separate LD-block determination WTCCC T1D and RA e Detected differences around previously (Kozyryev and Zhang, 2012) for cases and controls chr6 data sets known loci + near new positions BEAM+BEAM2 High-order epistatic interactions WTCCC RA data set 319 high-order interactions found  study a Age-related macular degeneration genome-wide association data set with 116,204 SNPs for 96 cases and 50 controls. b Type 1 Diabetes (T1D) data generated by the Welcome Trust Case-Control Consortium; c This data contained resequenced haplotypes of exons for DRB1 and DQB1 genes in the MHC region . d Inflammatory bowel disease (IBD) data set from the Welcome Trust Case-Control Consortium with 2,005 IBD patients and 3,004 combined controls (Zhang, 2012). e Rheumatoid Arthritis (RA) and Type 1 Diabetes (T1D) data generated by the Welcome Trust Case-Control Consortium; f Interchromosome For sure, many more studies will follow in the near future that apply Bayesian methods either to existing GWAS data or to new large scale studies that will be produced soon (Hayden, 2012).

DISCUSSION
Certain issues need to be considered when using Bayesian approaches described above. For example, a combination of genotyping errors, disease heterogeneities and population substructures could have adverse effect on the statistical results of the methods (Zhang and Liu, 2007). Currently, the major problem with GWAS approaches is that the determined disease associated genetic regions explain only a small part of the disease heritability (Donnelly, 2008;WTCCC, 2007). However, it is possible that with the improved statistical methods outlined above the situation will soon change after the detailed understanding of the interactions involved emerges. Additionally, the main criticism of the GWAS based on the SNPs analysis, is that it is hard to understand the causal biology taking place in the disease formation (Hall, 2010); however, with the development of the recent Bayesian models that provide the detailed structure of the multilocus interactions (Zhang et al., 2011b; detailed etiopathogenesis of many diseases may soon be elucidated. Improvements to the Bayesian approaches mentioned in this article can include incorporation of environmental factors and population structures as covariates in the statistical model (Zhang et al., 2011b;Lobach et al., 2010). Another possible improvement is to impute untyped SNPs and missing genotypes from the reference panel (Zhang et al., 2011b;Zhang, 2011;Marchini et al., 2007). Moreover, utilization of sophisticated methods for incorporation of prior biological knowledge (like pathway topology) can increase the probability of making discoveries in association studies (Chen et al., 2011b).
While the main focus of this review article was on statistical methods to determine disease-related interactions among genetic variants, it is important to keep in mind the relationship between determined mathematical coupling and its biochemical underpinnings. Particularly, a common view is that disease development at large is prompted by biomolecular or protein-protein interactions at the molecular level (Cordell, 2009;Gibson, 2012;Jiang et al., 2011). While studying specifics of multilocus interactions has potential to convey the details of biological and biochemical disease pathways, the biological interpretation of the determined single-and multi-variant effects is the current burning issue in genetics (Cordell, 2009). The crust of the problem is the necessity to infer biological interaction from a statistical one and the straightforwardness of this process is highly debated by geneticists and epidemiologists (Cordell, 2009;Greenland, 2009). It has been even suggested that functional epistasis might not be detectable in the current GWAS as statistical interactions (Greenland, 2009;Vanderweele, 2009). Soon we will be able to solve this debate using actual results from the ongoing studies. One possible solution is to merge together datadriven and hypothesis-motivated approaches (Jiang et al., 2011;Hall, 2010).

CONCLUSION
In conclusion, Bayesian approaches are filling an important previously empty niche in bioinformatics and genomics research and the future of this scientific area looks extremely exciting and, for sure, will promptly bring a multitude of important surprises.