The Analysis of High Dimensional Contingency Table with Interactions

Corresponding Auhtor: Keorapetse Sediakgotla Department of Statistics, University of Botswana, Botswana Email: sediakgo@mopipi.ub.bw Abstract: This paper is focused on the analysis of categorical data in a 2 × c × K contingency table. The theoretical frame work of a 2 × c design is extended to 2 × c × K with provision for testing interactions among subsets of either lower or upper columns of the designated table. The developed chi square tests for the total interactions as well as for the partitions are shown to be significant and the degrees of freedom additive.


Introduction
The development of methods for analyzing categorical data begun about four decades ago after Bartlett (1935) made challenging comments on the lack of contributions to contingency tables of high dimensions. Most data in social sciences and more importantly survey data come handy in categorized forms and researchers often make use of chi-square tests to perform necessary statistical tests of associations. Adamu (1969) had commented that the common approach by many researchers in social sciences is to calculate an overall Chi-square for each contingency table and on the basis of the critical value for this test statistic decide whether to accept or reject the hypothesis of association between variables that form the basis of classification of the table. The decision on the computed statistic might be simple for a 2 × c contingency table but the challenges faced with researchers on the interpretation of multi-way contingency tables are enormous, especially when considering interactions among some variables. A number of authors have presented some procedures in dealing with the analysis of interaction in multi-dimensional contingency tables Ostle (1954); Lewis (1962). Lancaster (1960) considered a canonical form, or rather a class of canonical forms, for three dimensional probability distributions subject to a rather mild restriction of fixed margins and developed suitable tests of independence and lead to a consideration of the partition of χ 2 in the analysis of complex contingency table. Ama (1992) developed one degree of freedom chi-square test for interaction in an r×c, twoway contingency table in a manner similar to the Turkey's one degree of freedom F-test of interaction in a two factorial experiment. The test was extended to an r × c × K three-way contingency table by the same author and further developed a 1 d.f. chi-square test for the 3 way interaction after re-parameterization of the interaction Ama (1994). Bishop et al. (2007) have shown that we can collapse over one or more classifications only if those classifications are independent of at least one of the remaining classifications. Seligman (n.d) in their paper caution that collapsing tables without due justification can lead to incorrect results. For instance in the paper, if the sex classification is viewed as unimportant and collapsed (summed) the male and female categories, the resulting 2×2 table would lead to the rejection of the null hypothesis of independence of classifications in the collapsed table. Goodman (1964) proposed a definition of the order interactions in an m-dimensional (d 1 × d 2 × ... × d m ) contingency table (r = 0, 1, 2, …, m-1) and presented methods for testing the hypothesis that any specified subset of these interactions is equal to zero. In addition, the author presented simple methods for obtaining simultaneous confidence intervals for these interactions or for any specified subset of them. However, to the best of our knowledge none of these authors gave a breakdown of the overall Chi-square test for interaction in a multidimensional contingency table (especially when this test is significant) to accommodate various scenarios of the interactions such as the cases where in an r × c × K contingency one margin is fixed or the total frequency is fixed.
According to Lewis (1962), although several important papers appeared on this subject, the treatment of these contingency tables is widely neglected in standard text books. The paper presented by Adamu (1969) accommodated this aforementioned challenge but fell short of extension to higher dimensional contingency tables usually encountered in problems in Education, Medicine Science and in the Social Science, particularly where the response variable is dichotomous while the other variables are multi-level. This paper, specifically, extends the derivations of Adamu (1969) to higher dimensional tables and presents an alternative statistic and test to the overall Chi-square test statistic for interaction in the case of 2 × c × K contingency table. The paper will be arranged in four sections. Section two immediately following this introduction will contain the derivation of the generalized chi-square statistic for the 2 × c × K contingency table and the partial test statistics. Section three will be the application of the methods to real life problem, while section four will contain the results and discussions.

Derivation of the 2 × c × K Chi-Square Test Statistic
In a general three-dimensional r × c × K contingency table, r is the number of rows; c is the number of columns and K is the number of layers,(r > 2; c > 2 and K > 2), let f ijk denote the observed frequency in the cell of the i th row, j th column and k th layer. Similarly, let P ijk denote the probability of an observation belonging to the (ijk) th cell. The marginal totals over the row, column, layer, column × layer, row × layer and row × column are given respectively as: ..
.  The 'dot' notation indicates a summation over the subscripts. A similar notation can be written for the cell probabilities, P ijk , with P ... = 1. When the row classification is dichotomous, the general r × c contingency table becomes 2 × c × K contingency table shown in Table 1.
One interest in the analysis of contingency table is usually to test whether there is mutual independence or association between the ways of classification, that is, to test the null hypothesis H 0 : P ijk = P i.. P j . P .k against the alternative hypothesis H 1 : P ijk ≠ P i.. P j P .k . The test rejects the null hypothesis in favour of the alternative hypothesis if the overall chi-square test statistic: Is greater than X 2 (r-1)(c-1)(l-1) (a), at a-level of significance. Adamu (1969) had presented a general formula for the overall Chi-square in (1) when testing for proportion of success in a 2 × c contingency table without consideration for the k-th upper columns. In this study we consider a 2 × c contingency table as assumed in Adamu (1969) and then provide an alternative overall Chi-square test statistic: where, e ij is the expected frequency in the (ij) th cell; p ij is the probability of an observation belonging to the (ij) th cell; p is the estimated total probability The generalized form of Equation 1 for a 2 × c contingency table is: Adopting the quantities in Equation 2 and assuming the summand is a linear operator for which i and j could be used without restriction, then we have the right hand side of expression (3)
We shall consider the test statistic for interactions in a 2 × c × K contingency table. Let f ijk denote the observed frequency in the i th row, j th lower column and k th upper layer of the 2 × c × K, three-dimensional contingency table, where, i = 1, 2; j = 1, 2, …., c and k = 1, 2, …., K. We denote x, y and z as the 3-variable with values in natural order; f ijk is the cell frequency for (i, j, k) th cell where i = 1,2 j = 1,2,...,c and k = 1,2,.....1.
The general test statistic for interaction in a 2 × c × K is: where: . . (For more details of these notations see Lewis (1962 Now assume the summands are operated on i,j and k without restrictions on ordering we have the test statistic as: (1 )ˆ2 ( 1) In this study we interpret χ 2 test in the context of significant difference between proportions. Many practical situations might demand to know whether the proportion of one group made up of the first X (lower columns) say X 1 to X τ of the table is different from the remaining X τ+1 to X c lower columns. Similarly we may wish to compare the difference between the first Z (upper columns) say Z 1 to Z t and the remaining Z t+1 to Z k . To circumvent the aforementioned scenarios, we breakdown the general expression in Equation 5 into different scenarios as: • To test the variations among the first subdivision of say lower columns or upper columns we derive the test statistic from the overall Chi-square as: • To test the variations among the last subdivision of say lower columns or upper columns we derive the test statistic from the overall Chi-square as: • To test for interaction, we derive the test statistic from the overall Chi-square as: It is important to note that the partitioning of the various variables in the lower and upper columns is convenient and arithmetically logical if the variables are in their natural homogeneous order and it would be easy to make comparison between and within each group.

Data Analysis, Results and Discussion
The data below taken from Woodward (2013) represents the number of children who have caries and those who do not have caries (a control group) classified by age of child (in months) and age of mother (in years).

Conclusion
The paper, therefore, has given a breakdown of the overall Chi-square which can then be used to test for interactions on the multidimensional contingency tables. The chi square values for the partitions/ interactions as well as their degree of freedom should be additive.