GAUSSIAN COPULA MARGINAL REGRESSION FOR MODELING EXTREME DATA WITH APPLICATION

Regression is commonly used to determine the relati onship between the response variable and the predic tor variable, where the parameters are estimated by Ord inary Least Square (OLS). This method can be used w ith an assumption that residuals are normally distribut ed (0, σ). However, the assumption of normality of the data is often violated due to extreme observations, whic h are often found in the climate data. Modeling of rice harvested area with rainfall predictor variables al lows extreme observations. Therefore, another appro ximation is necessary to be applied in order to overcome the presence of extreme observations. The method used to solve this problem is a Gaussian Copula Marginal Re gression (GCMR), the regression-based Copula. As a case study, the method is applied to model rice har vested area of rice production centers in East Java , Indonesia, covering District: Banyuwangi, Lamongan, Bojonegoro, Ngawi and Jember. Copula is chosen because this method is not strict against the assum ption distribution, especially the normal distribut ion. Moreover, this method can describe dependency on ex treme point clearly. The GCMR performance will be compared with OLS and Generalized Linear Models (GL M). The identification result of the dependencies structure between the Rice Harvest per period (RH) and monthly rainfall showed a dependency in all are as of research. It is shown that the real test copula typ e mostly follows the Gumbel distribution. While the comparison of the model goodness for rice harvested ar a in the modeling showed that the method used t o model the exact GCMR in five districts RH1 and RH2 in Jember district since its lowest AICc. Looking a t the data distribution pattern of response variables, it can be concluded that the GCMR good for modeling t he response variable that is not normally distributed an tend to have a large skew.


INTRODUCTION
The method used to determine the pattern of the relationship between variables is correlation analysis and regression analysis. Correlation analysis that is frequently used is the Pearson correlation. Parameter estimation methods in regression analysis that commonly used is Ordinary Least Squares (OLS). Both of these methods (Pearson correlation and OLS) can be used well if it satisfies the assumption of normal distribution of data. Data normality assumption is often violated if the data there exists an extreme.
In the development, the method that can be used when the response is not normally distributed is Generalized Linear Model (GLM). Requirements that must be satisfied for this method are the relationship between the predictor variables are linear and the distribution of the response variable should be Exponential family members. Distributions that frequently used are the binomial, Poisson, negative binomial, normal, gamma, inverse Gaussian and lognormal (McCullagh and Nelder, 1989). In addition, GLM.
If the pattern or curve regression relationships between predictor variables and the response is not known, the method is used a nonparametric approach. Embrechts et al. (2002) suggests using copula approach to address violations of the assumption of normality of the data. Copula is a statistical method that shows the relationship between variables, where the method is not too strict on the assumption of distribution, in particular the normal distribution. Copula excellence can also describe dependencies on extreme points clearly. In the recent years, the copula has been widely used to model the structure of the relationship at risk management (Villarini et al., 2008;Embrechts et al., 2001), climatology and meteorology (Vreac et al., 2005;Schölzel and Friedrich, 2008;De Michele and Salvadori, 2003) and in other areas.

JMSS
The results showed that copula method has better performance in conditions of normality assumptions violated. However, previous research is still limited to a correlation, not identified until the relationship causality. The method that can be used to model causality in extreme events is Gaussian Copula Marginal Regression (GCMR). This method has been used by Masarotto and Varin (2012).
Natural events such as climate often erratic over time, thus causing extreme climate (Bekti, 2009). Phenomenon of nature in the form of extreme weather events is one of the problems that are difficult to address in the agricultural sector. Currently, the rainfall pattern is erratic causing a significant drop in national rice production. Climate does not directly affect the production of rice, but the rice harvested area. Therefore, the need for information about the rice harvested area forecast the future as part of efforts to support food security.
Some researches on rice production involving climate indicator has been done in recent years. Regression modeling anomalous area harvested per period and weighted rainfall index estimated with OLS produces small R 2 due to outlier observations (Miconnet et al., 2005;Sutikno et al., 2010). Other studies using indicator variables El-Nino Southern Oscillation (ENSO) are analyzed by simple regression analysis (Bekti, 2009). This method has not been properly interpreted because the data analyzed do not meet the assumptions of normality of the data due to the extremes. Therefore, modeling the effect rainfall for the area of the rice harvest using the Gaussian Copula Marginal Regression is necessarily to be carried out. This method is expected to be able to properly model the rice harvested area affected by extreme climate in central production in East Java rice, namely Lamongan, Bojonegoro, Ngawi, Jember and Banyuwangi.

MATERIALS AND METHODS
The data used in this study is a secondary data obtained from the Central Statistical Agency and the Board Meteorology, Climatology and Geophsycs. Central Statistical Agency data including data per subround rice harvested area (area harvested first period (January-April), II (May-August) and III (September-December) as the response variable. Board Meteorology,Climatology and Geophsycs data is the data of rainfall per month (January-December) used as the predictor variable spanning from 1990 to 2010. The data are collected from center of rice production in East Java, namely: Banyuwangi, Lamongan, Bojonegoro, Ngawi and Jember, as Fig. 1.
Stages of the data analysis are described as follows: • Create a scatterplot between variables X and Y to identify patterns of linear and nolinear two variables • Calculate the correlation between variables X and Y with Pearson correlation and Kendall- Tau  With i = 1, 2, 3, ... n (size of observations) and p = 1, 2, 3 (p is the period). RF1, RF2, RF3 and RF4 denote the rainfall in the first to forth in each period. The data is analyzed by software R.

Copula Family
Two of the most popular Copula family is Archimedean Copula and Copula ellipse. Elliptical Copula consists Normal Copula and Copula-t. While Archimedean Copula consists of Clayton, Gumbel and Frank.

Gaussian Copula
Gaussian copula or Normal Copula can be obtained from transformation from the random variable to standard normal distribution. Copula Gaussian's function can be written as follows Equation 3: If normal copula is used to the multivariate normal distribution then it assumes linear relationship (Schölzel and Friederichs, 2008).

Archimedean Copula
Copula Archimedean family has tail dependency which is different each other, Clayton Copula has tail dependency in lower area, Frank Copula has not tail dependency and Gumbel Copula has tail dependency in upper area (Fig. 2). The generator of each copula is shown on Table 1.

Transformation of Random Variables
Marginal distribution from the random variables X and Y which is unknown is shown as in the Equation 4 respectively: The data transformation to the uniform domain can be done with making the scatter plot [0,1] and form the rank plot for X and Y as shown in Equation 5: , 1 j andi 1,2,...,m n 1 n 1 Referring to the transformation, the Copula equation can be given as follows (Genest and Nešlehová, 2010) Equation 6:

Parameter Estimate
Parameter Estimation for Archimedean Copula can be done by using Tau Kendall's approach, can be written as follows Equation 7: Tau Kendall's approach for each Copula Clayton, Frank and Gumbel being shown on Table 2.

Ordinary Least Square
One of the estimation procedure for linear regression models is the ordinary least squares procedure. The concept of this method is to estimate the regression coefficient (β) to minimize sum square error, so that the estimators for β can be formulated as follows (Draper and Smith, 1981)

Generalized Linear Model (GLM)
Development of classical linear models with response variables is not normally distributed. GLM has three components, namely: Random component, Systematic component and Link function (McCullagh and Nelder, 1989;Agresti, 2007)

Gaussian Copula Marginal Regression
Common form of Gaussian copula marginal regression models is as follows Equation 10: where, g (.) is the corresponding function, i.e., the error models and λ is a parameter. Among the many possible g (.), the selection of the model is as follows Equation 11-13: where, Ф (.) is the cumulative distribution function of Yi given xi. When the model using Weibull distribution, then µ i = exp (x i T β ) the λ = β (Masarotto and Varin, 2012).

Akaike Information Criteria Corrected
One of the frequently used information criteria are AIC: Where: L(k) = The likelihood function and k = The number of sample is relatively small parameters If n/k <40, the criteria used is the AIC C (Hu, 2007): C 2k(k 1) AIC AIC n k 1

Data Exploration and Patterns Identification of Relationships Between the Response Variable and the Predictor
In the following discussion we present in detail the fact observed in one regency, namely Banyuwangi regency. The discussion for other areas is presented in the summary.
The relationships between rainfall and rice harvested area in Banyuwangi does not show a clear pattern, though there are several adjacent points that indicate a relationship between rice harvested area and rainfall, as presented in Fig. 3. In addition, pearson correlation and Kendall Tau cannot explain the relationship well because each test gives different results. The correlation results concluded that most of the rice harvested area do not have a close relationship with rainfall ( Table 3). The unclear relationship between the two variables is alleged absence of data extreme observations (outliers).
In addition, the result of normality distribution test using Anderson Darling shows that most of the data do not follow normal distribution, such as RH1, RH3, April rainfall, rainfall from June to December. Only the RH2, rainfall in January, February, March and April follows normal distribution. Therefore, further analyzes on the dependencies is carried out using copula approach in order to specifically look at the model dependencies. Table 4 presents the Copula parameter estimation by Tau-Kendall approach. There are several variables that one Copula (Gumbel) can not estimate. This is due to the value of θ <1. Copula has selected value based on the value of the type of copula that yields on largest loglikelihood. Visually we can see the dependencies between two variables as shown on each rankplot Copula in Fig. 4. Figure 4 shows that the dependencies between variables in particular of rice harvested area and rainfall. For instance, the relationship between RH3 with rainfall in October shows that the tail dependencies exists, which is the characteristic of the Clayton Copula. We can see that there are extreme points in minimum area and when rainfall fell in October, the rice harvested area will also decrease. As for the plot which has a tail dependencies above, for example, is the relationship between rainfall and January with RH1 (following the Gumbel copula) showed that the higher rainfall in January, RH1 will be growing as well. While the relationship between rainfall in the month of May with RH2 follows the Normal Copula and the type of the relationship between both variables is linear.
Using the similar procedure we identify the dependency structure between rice harvested area with rainfall in the four districts of rice production centers in the other East Java in Table 5.
The result of the identification of the dependency structure between the rice harvest per period and monthly rainfall showed a dependency in all areas of research. It is shown that the real test copula type mostly follow the Gumbel distribution. This phenomenon illustrates that the rice harvested area is very dependent on rainfall, especially in the 3rd period.  Tau

Selection of the Best Model
Lowest AIC C between three methods for modeling RH1 in five regency are GCMR method. While modeling RH1 and RH3 in Banyuwangi and Ngawi is more appropriate using OLS, because its AIC C lowest value, nor the modeling RH2 in Lamongan. While the GLM method is more appropriate to model in Bojonegoro RH2 and RH3 in Bojonegoro, Jember and Lamongan as presented in Table 6.
If rainfall happens in January (x+1) mm, then the RH1 tended to rise by 1.0005 ha times than when rainfall x mm. Meanwhile, the rainfall in May rose by 1 mm, then RH2 tend to increase by 30.09 ha and if the rainfall in June rose 1 mm, then the rice harvested area tends to increase by 11.51 ha. Based on this it can be concluded that most affect rainfall RH2 increase in Banyuwangi is rainfall in May because it provides the most substantial change to the RH2 in Banyuwangi.
Similar as in Banyuwangi, RH3 models indicates that precipitation increased by 1 mm 2 in September, will likely reduce the amount of rice harvested area of 9.029 ha. Meanwhile, if the rainfall in October rose 1 mm, then the rice harvested area tends to increase by 36.33 ha. If the rainfall in November rose 1 mm, it tends to grow rice harvested area of 16.486 ha and if the rainfall in December rose by 1 mm, the harvested area of 13.179 ha of rice tends to increase. More regression models for each district is presented in Table 7.

CONCLUSION
Identification of the dependence structure between the results of harvested area of rice per subround and monthly rainfall showed a dependency in all areas of research. It is shown that the real test copula type mostly follows the Gumbel distribution. Copula type is characterized by the upper tail dependencies. This phenomenon illustrates that the rice harvested area is very dependent on rainfall (especially in subround III).
Modeling of rice harvested area by comparing the three methods (OLS, GLM and GCMR) showed that GCMR better to model response variables that are not normally distributed with a large skew trend. GCMR is better when compared with the GLM method in dealing with the response variable that is not normal.

ACKNOWLEDGEMENT
Thank you submitted to Higher Education who have supported this research in the National Strategic Research Grant (STRANAS) DIKTI.