A Bayesian Method for Disentangling Dependent Structure of Epistatic Interaction

Problem statement: We propose a Bayesian method (RBP) to recursively i nfer the independence structure of epistatic interactions in case-control study. Approach: Based on the results of BEAM2, RBP can powerfully detect the marginal and c onditional independence within interacting SNPs even in the complicated interaction cases. Results: We did extensive simulations to test RBP and compare it with stepwise logistic regression. Simul ation results show that this approach is more power ful than stepwise logistic regression in detecting in m arginal independence and conditional independence a s well as more complicated dependence structure. We t hen applied BEAM2 and RBP on dbMHC Type 1 Diabetes (T1D) data and we found in MHC region, gen es DRB1 and DQB1 are associated with T1D with saturated interaction structure which is consi stent with the current knowledge of haplotype effec t of these two genes on T1D. Conclusion: RBP is a powerful method to infer detailed depende nce structures in epistatic interactions.


INTRODUCTION
Recently the methodology of Genome-Wide Association (GWA) has been greatly improved (WTCCC, 2007a;Zhang and Liu, 2007;Zhang et al., 2011;Yang et al., 2009).A Bayesian method (BEAM, Zhang and Liu, 2007) equipped with Monte Carlo algorithms has been shown able to powerfully detect high-order interactions in genome-wide association studies.This method uses Markov chain Monte Carlo (MCMC) to 'interrogate' each marker conditional on the current status of other markers iteratively.But one drawback of BEAM is the assumption that SNPs are independent to each other, thus BEAM is not able to capture the block-wise structure of human genome.Zhang et al. (2011) extended BEAM model to BEAM2, incorporating LD blocks into the original Bayesian partition model.This BEAM2 is able to simultaneously infer haplotype-blocks and select SNPs within blocks that are associated with the disease, either individually or through epistatic interactions with others across the genome.Using WTCCC1 type 1 diabetes data, BEAM2 identified the most previous reported associated SNPs and also landscaped the two-way interactions in MHC region.
Under the concept of common (complex) diseases, genetic variants typically show very little effect independently with low penetrance, but they may interact with each other in complex ways, i.e., they have complicated interacting structure, which is probably because of the sophisticated regulatory mechanisms encoded in the human genome (WTCCC, 2007a;Chambers and Hastie, 1992;Yang et al., 2009;WTCCC, 2007b;Zhang et al., 2011;Ding et al., 2005a;2005b).Both BEAM2 and BEAM2 use a saturated model to model the interaction group, thus neither of them can infer the detail interacting structure.However, knowing the detail structure is very important for investigating the etiopathogenesis and genetic mechanisms of the complex disease.
In this study, we propose a Bayesian method called Recursive Bayesian Partition (RBP) to recursively infer the marginal and conditional independence among interacting genetic markers.Given the associated markers inferred by BEAM2 or BEAM2, RBP first uses Independence Partition Model (IPM) to recursively infer all the marginally independent interaction groups, such that there is no interaction between/across different groups.That is, only within each group, there are some interactions.Then, RBP uses Chain-Dependence Model (CDM) to recursively infer the conditional independence within each interaction group (Zhang et al., 2010).
Using simulations we showed our method is more powerful than stepwise logistic regression using AIC or BIC in both marginal independence and conditional independence detections.In the complicated interaction simulations, our method is much more powerful than stepwise logistic regression.We applied RBP to the dbMHC type 1 Diabetes (T1D) data and we found genes DRB1 and DQB1 are associated with T1D with saturated interaction structure which is consistent with the current knowledge of haplotype effect of these two genes on T1D (Steenkiste et al., 2007).

MATERIALS AND METHODS
Recursive Bayesian Partition (RBP) we propose a Bayesian Partition model to search for independence groups and conditional independence among interacting SNPs.The whole procedure is done in two steps: first, we use Independence Partition Model (IPM) to partition all the interacting SNPs into several independence groups, i.e., there is no interactions across groups; thenwe use Chain-Dependence Model (CDM) to search for conditional independence within each groups.
Suppose there are N d sequences in the case group and N u sequences in the control group.Each sequence is p-SNP long and each SNP position i can take L i possible categories more generally, we can view each position as a random variable.Thus, our data consists of observations on each individual status (or response) variable Y, i.e., 0 for control and 1 for case and its p "explanatory" variables X 1 ,…, X p .

Independence Partition Model (IPM):
Our Bayesian Independence variable partition model seeks to partition the p variables into two groups: A and B. We say that the joint distribution for X G is a Independence Partition model if the index set G can be partitioned into disjoint subsets A and B, such that X A and X B are mutually independent, i.e., P(X G ) = P (X A )P(X B ) .We let Π denote the partitioning, i.e., indicating which indices in G belong to which subset.
We let D denote n iid observations (n = N d in cases and n = N u in controls) on (X G ) = (X A , X B ). Suppose X A takes on A N possible values, follows the distribution Multinom (Θ A ), where where, we define Bayesian recursive model selection (Zhang et al., 2010): In order to determine to include an independence model for control, we define a model indicator I C , which is equal to 1 if the variables 1, , P {X X } ⋯ in controls have the same group membership as in cases and 0 if the variables in controls are all independent of each other, in which case we have: where, p (X j |l c = 0,y = 0) is multinomial with probability vector Θ j , Multinom( Θ j ) and Θ j follows a Dirichlet distribution a priori with parameter a j = (a j1 ,a j2 ,…,a jGj ).Integrating out the multinomial parameters, we obtain that Eq. 3: Here the operation |a| sums over all elements in a.
We assume an equal probability prior for I C .Then, we can write the joint distribution as (D here includes both case data and control data) Eq. 4: In which p (D|Π,l c = 1,Y = 0), p (D|Π,l c = 0,Y = 1) and p (D|Π,l c = 1,Y = 1)are equal to expression (1) times (2) and p (D|Π,l c = 0,Y = 1) is equal to (3) .We use a MCMC algorithm to simulate from (4) so as to estimate the posterior distribution of I C and Π.After partitioning all the variables into A or B, we reapply the method to A and B superlatively.The procedure is applied recursively until only single-variable nodes are available and thus all the variables are grouped in several independent disjoint subsets.

Chain-Dependence Model (CDM):
After identifying all the independence groups, we use Chain-dependence Model to discover the conditional independence within each group.In the above IPM model, variables in A or B are not imposed any simplifying dependence structure, which in statistical term means that a "fully saturated" model was used.However, in practice often a much more desirable and simpler model, which takes advantage of conditional independence relationships among the variables, can fit the data well.In complex disease scenario, SNP1 could interact with SNP2, but does not directly interact with SNP3.SNP2 interacts with SNP3.Thus all the three SNPs are in the same independent group (i.e., IPM cannot separate them), but conditional on SNP2, SNP1 is conditionally independent of SNP3.Therefore, detecting conditional independence within each group can tell more detail about the relationship within each independence group.
We call it a chain-dependence model for a group of variables X G if the index set G can be partitioned into 3 subgroups A, B and C such that X A and X C are independent given X B , denoted as X A → X B → X C .Only set C is allowed to be empty, in which case this model degenerates to the saturated model.Under the chain-dependence model, we can decompose the joint distribution of XG as: p (X G ) = p (X A ) p (X A ) p (X B | X A ) p (X C |X B ).We let Π denote the set partitioning.
Suppose X A takes on N A possible values, follows the distribution Multinom (Θ A ), where, . The prior for Θ A is assumed to be the

Dirichlet distribution with pseudo-counts vector
denoted as Dir (β A ).Here we Furthermore, suppose X B takes on N B possible values, we assume that P (X B |X A ) is Multinom (Θ B|XA ), where, . We let the prior distribution for (Θ B|XA ) be Dir ( ) in the study.Lastly, suppose X C take on N C possible values.Then, P (X C |X B ) = Multinom (Θc|X B ) and . Suppose our data D consists of n iid observations on X G = (X A , X B , X C ).We can summarize the data into counts corresponding to X A , X B | X A and | X C X B and decompose the probability of the data conditional on the set partitioning Π as P where A k n is the number observations whose X A takes the kth categorical value.Integrating out the multinomial parameters, we obtain that Eq. 5: where, we define Similarly, we obtain that Eq. 6: where, recording the number of observations in which X A = I and X B = j Thus, . Finally, we get Eq.7: Thus, the probability of the observed data conditional on the set partitioning Π is the product of (5-7) Eq. 8: Two-locus interaction models with no marginal effect p1 = p (A) = 1 -q1 and p2 = p (B) = 1-q2 Thus, if we assign a prior on Π, such as uniform, we obtain its posterior distribution by combining the prior with (8).We then design an MCMC algorithm to sample from this posterior distribution.
Like IPM, we first partition variables into three disjoint subsets which follow the above chain-dependence model and then recursively apply this partition until the data do not allow us to discern more model details.

Simulations from HapMap data:
We did three sets of simulations to evaluate the powers of IPM, CDM and RBP and compared them with stepwise logistic regressions by AIC and BIC.First, we simulated two independent interaction groups Fig. 7a, conditional independent group Fig. 7b for IPM and CDM respectively.And then we simulated complicated interactions which contains 9 independent groups: 6 of them are composed of two-locus interactions Fig. 7a following three different disease models in Table 1-3 of them are composed of three-loci conditional independent interactions Fig. 7b with pairwise interactions (like interaction between D1 and D2, interaction between D2 and D3 in Fig. 5b following three different disease models in Table 1.For details see the following sections.
Three different disease interaction models: We simulated case control data according to 3 disease models given in Table 1.There are 2 disease loci involved in each model.In model 1, the disease risk is present only when both disease loci contain some mutations and the disease risk increases as the number of mutations increases.In model 2, the disease risk is again present only when both disease loci contain some mutations, but the risk is a constant corresponding to a threshold model.In model 3, there is no marginal effect but only interactions for two loci.

Marginal independent groups simulation for IPM:
For marginal independent groups A and B, (Group A and Group B are marginally independent in both populations (Y =1case and Y = 0 control)), the odds can be written as: There are two SNPs in group A and B following Model 1, 2 or 3 Table 1.
Conditional independent group simulation for CDM: For conditional independence, A and C are conditionally independent given B, the odds ratio for three groups A, B and C, is .
There is one SNP in group A, B and C, the interaction model of AB and BC follows Model1, 2 or 3 Table 1.

Complicated interaction simulation for RBP:
Finally we simulated a complicated interaction data which contains multiple marginal independent groups and within some groups there are conditional independence interactions.The logarithm of odds ratio is the summation of all the marginal independence groups and conditional independence groups.In our simulation there are totally 21 SNPs in 6 two-locus marginal independence groups and 3 three-locus conditional independence groups.

Simulations with block structures and interactions:
To simulate case control data, we first randomly select a region in the human genome that contains 1000 tagged SNPs in Illumina HapMap 300k tagSNPs.We then use HAPGEN (Marchini et al., 2007) to generate a pool of 10000 individuals and their genotypes within the region for tagged SNPs using HapMap European individuals (parents only).Four SNPs are then randomly selected as disease loci for marginal independence simulations and three SNPs are selected for conditional independence simulations.We choose the disease allele frequency in the population to be 0.05, 0.10, 0.20 and 0.50, respectively and we specify the marginal effect size (log odds ratio-1) of each disease loci to be 0.5 for Model 1 and 2 (θ = 0.5 in Table 1) and for Model 3 there is always no marginal effect, so we set interaction effect (Yang et al., 2009) parameter θ = 0.5.Given these parameters, we calculate the joint allele frequencies of both disease loci in cases and controls using the same method used in BEAM (Zhang and Liu, 2007) and BEAM2 (Zhang et al., 2011).We then randomly sample 2000 cases and 2000 controls according to the disease allele frequencies from the pool of 10000 individuals without replacement.
In the same way, we simulated a complicated interaction case-control data, where 21 SNPs are selected for 9 independent groups.Totally 6 groups follow two-locus interaction models (2 groups for each of Model 1, 2 and 3 in Table 1).3 groups arecomposed of three-locus conditional independent interactions (one group for each of Model 1, 2 and 3 in Table 1).

RESULTS AND DISCUSSION
Results of simulations from HapMap: We simulated case control data from HapMap data (see Methods) under three disease models and compared the power of IPM, CDM and RBP with stepwise logistic regression (Chambers and Hastie, 1992) using AIC and BIC.We first used BEAM2 (Zhang et al., 2011) to search for associated SNPs.We ranked SNPs in each dataset according to their association posterior probabilities, then calculate the power of each program from 50 datasets.The power is defined as the fraction of disease related SNPs ranked among the top ranked SNPs.A SNP is regarded as disease related if it is within 5 SNPs on either side of a true disease locus.The results are shown in Fig. 1 for marginal independence and Fig. 2 for conditional independence simulations.
Consistent with the simulation results from Zhang et al. (2011) , we observed that BEAM2 can identify most of the associated SNPs with good power for Model 1 and 2, but the performance for Model 3 is worse than the other two due to the fact there is no marginal effect in Model 3 at all.
Results for IPM and CDM compared with stepwise logistic regression: Then we focus on the associated SNPs (four in marginal independence simulations and three in conditional independence simulations) and compare the powers of IPM and CDM with stepwise logistic regression for correctly identifying interacting structures (marginal independence groups and conditional independence).For marginal independence simulations (see Methods), Fig. 3 show the powers of IPM and stepwise logistic regression (general model, i.e., genotypes are treated as categorical variables and additive model, i.e., genotypes (AA, Aa, aa) are treated as 0, 1, 2 numerical variables) using AIC and BIC.For conditional independence simulations, Fig. 4 shows the powers of CDM and stepwise logistic regression (general and additive models).Here IPM and CDM only report the model with highest posterior probability and stepwise logistic regression starts from the model with all the main effect terms, greedily adding or deleting one term at each step, until AIC or BIC stop dropping.The final model is reported by stepwise logistic regression.Then the power is calculated as the fraction of correctly inferred the interacting structures (models) among all the 50 simulated datasetsunder each parameter settings.From Fig. 3 and 4, it is clear that IPM or CDM outperform stepwise logistic regression in most parameter settings.And BIC is more powerful than AIC.

Results for RBP compared with stepwise logistic regression:
Then we simulated a complicated interaction model to access the power of RBP (see Methods).In this simulation, there are totally 21 disease-associated SNPs in 6 two-locus marginal independence groups (2 for each of the models in Table 1) and 3 three-locus conditional independence groups (one for each of the models in Table 1.RBP first uses IPM to infer the independence groups and then uses CDM to infer conditional independence within each group.Figure 5 shows the power boxplot of RBP for 50 simulation datasets.The power is the fraction of SNPs whoseinteractions are correctly inferred (i.e., the SNPs are inferred in the correct independent group with correct neighbors and relationships).It is clear that as interaction parameter (θ) increases, the power substantially increase and the minor allele frequencies (f) make little difference except that when f = 0.5, there is a big drop in power.We also tried stepwise logistic regression on these simulations, but due to complicated interactions, neither AIC or BIC (general model or additive model) can have positive power., genotypes (AA, Aa, aa) are treated as 0, 1, 2 numerical variables) using AIC and BIC.Theta is the interaction parameter θ in Table 1.MAF is minor allele frequency Fig. 4: For conditional independence simulations, Figure 4 shows the powers of CDM and stepwise logistic regression (general and additive models) using AIC or BIC.Theta is the interaction parameter θ in Table 1.MAF is minor allele frequency When reapplying IPM to the rest 26 SNPs, no independence was found (posterior probability >0.9).Then we applied CDM to the 26-SNP group, again no conditional independence was found (posterior probability >0.9) Validation of RBP with dbMHC T1D data: We used T1D data from dbMHC (http://www.ncbi.nlm.nih.gov/gv/mhc/) to validate our RBP method.The data contained resequenced haplotypes of exons of two MHC genes DRB1 and DQB1.Since it is well-known that in HLA DQ-DR region SNPs form associated haplotypes rather than interactions, RBP should find most associated SNPs in this region cluster in one big marginal independent group and no conditional independence within the group.We first applied BEAM2 to search for associated SNPs with posterior probabilities >0.5. Figure 6a shows the posterior probabilities and number SNPs with posterior probabilities >0.5.Then we applied RBP to detect marginal independence and conditional independence recursively.Figure 6b shows the procedure of RBP.First we applied IPM to all of the 28 associated SNPs.Only the 25th SNP comes out independently of all the others.
number observations whose X A takes the kth categorical value.Integrating out the multinomial parameters, we obtain that Eq. 1:

Fig. 1 :
Fig. 1: Power curves of BEAM2 on marginal independence simulations.Under each setting, the power is calculated as the proportion of disease-associated SNPs in 50 datasets identified within 5 SNPs from the top m SNPs ranked by posterior probability (m ranges from 1-100).Each data set contains 1,000 SNPs in 1000 cases and 1000 controls.The disease allele frequency in the population is 0.05 (pink), 0.10 (blue), 0.20 (green) and 0.50 (red), respectively

Fig. 5 :
Fig. 5: Shows the power boxplot of RBP for 50 complicated interaction simulation datasets under different parameter settings