Density Power Downweighting and Robust Inference: Some New Strategies

Preserving the robustness of the procedure has, at the present time, become almost a default requirement for statistical data analysis. Since efficiency at the model and robustness under misspecification of the model are often in conflict, it is important to choose such inference procedures which provide the best compromise between these two concepts. Some minimum Bregman divergence estimators and related tests of hypothesis seem to be able to do well in this respect, with the procedures based on the density power divergence providing the existing standard. In this paper we propose a new family of Bregman divergences which is a superfamily encompassing the density power divergence. This paper describes the inference procedures resulting from this new family of divergences, and makes a strong case for the utility of this divergence family in statistical inference.


Introduction
In statistical modeling, parameter estimation is an inevitable and formidable task. Accurate estimation of the model facilitates the characterization and the subsequent understanding of the mechanism that generates the observed data. Statistical distances can be useful tools for the estimation of the model parameters.
Statistical distances can be naturally applied to the case of parametric statistical inference. The most important idea in parametric minimum distance inference is the quantification of the degree of closeness between the sample data and parametric model as a function of an unknown set of parameters through a suitable distance-like measure. Thus the estimate of the parameter is obtained by minimizing this "distance" over the parameter space.
It is worthwhile to mention here that the class of distances which we will consider are not mathematical metrics in the strict sense of the term. They may not be symmetric in their arguments and may not satisfy the triangle inequality. The only properties that we require of these measures are that they should be nonnegative, and should equal zero if and only if the arguments are identically equal. However, we will, somewhat loosely, continue to call them distances, or "statistical distances". In a practical sense, the word "divergence" is a good descriptor of these measures. We will, in fact, use the "minimum distance" and the "minimum divergence" terminologies interchangeably.
Density-based divergences form a special class of statistical distances. Several minimum distance estimators in this family have high model efficiency. In particular, the maximum likelihood estimator (MLE) also belongs to the class of density-based minimum distance estimators, being the minimizer of the likelihood disparity (Lindsay, 1994), which is a version of the Kullback-Leibler divergence. But one of the major drawbacks of the MLE is that it is notoriously nonrobust and even a small proportion of outlying observations can lead to meaningless inference. In fact it is the failure of the classical methods like maximum likelihood to deal with outliers and mild deviations from the model which had led to the emergence of the field of robustness; see, for example, [HR09], [HRRS86], [MMYSB19] and [BSP11]. However, some of the other members of the class of minimum distance estimators have been observed to do much better in the sense of combining strong robustness with high model efficiency. See, for example, [Csi63], [AS66], [Lin94], [Par05] and [BSP11] for a description of the φ-divergence class of minimum distance measures.
A more modern class of minimum distance estimators is based on the family of Bregman divergences. The Bregman divergence (Bregman, 1967) is a distance like measure between points and has been used in mathematics and information theory for some time. When the points are represented by probability distributions, the corresponding Bregman divergence is a statistical distance. See, for example, [JB90], [Csi91], [BMDG05] and [SV12] for some examples of statistical and related applications of the Bregman divergence. The principal representatives of Bregman divergence estimators in the current statistical literature are the minimum density power divergence estimators (MDPDEs), based on the density power divergence (DPD) class of [BHHJ98]. Over the last two decades, this class of divergences has provided a popular and frequently used method to balance the trade-off between robustness and efficiency in parameter estimation, hypothesis testing, and related inference. The minimum divergence estimators based on the DPD have been shown to provide a high degree of stability under model misspecification, often with minimal loss in model efficiency. Our primary purpose in this paper is to refine the minimum distance procedure based on the DPD, so as to achieve even better compromise between efficiency and robustness.
The Bregman divergence between the density g and model density f θ is given by where the index function B(·) is strictly convex and B (·) represents its first derivative with respect to its argument. In practice, where f θ is the pdf of the parametric family, g is the true density, the minimization of the above divergence over the parameter space Θ will generate the corresponding minimum distance functional which can lead to meaningful inference, depending on the form of the function B(·). The DPD, defined later in this section, is a special case of the Bregman divergence for B(y) = y 1+α α , α ≥ 0. When the model is differentiable, the general estimating equation under the divergence in Eq. 1 is or equivalently where u θ (x) = ∇ log f θ (x) is the score function of the model f θ (x), ∇ represents derivative with respect to θ and B (·) represents the second derivative of B(·) with respect to its argument. Since G is unknown, we construct an empirical version of the divergence in Eq. 1, or the estimating equation given in Eq. 3, by replacing G (the true data generating distribution) by its empirical counterpart G n . This leads to a class of unbiased (under the model) estimating equations The root of the Eq. 4 is defined to be the minimum Bregman divergence estimator (MBDE).
Here the robustness of the corresponding minimum distance estimator may be at least partially understood by observing the effect of the downweighting function B (f θ (x))f θ (x) on u θ (x) for less probable values of x under f θ . For the DPD, this weight becomes (α + 1)f α θ (x). In this paper we attempt to find a refinement of the DPD downweighting scheme, and, by reconstruction, a corresponding divergence. We will show that the corresponding minimum distance procedure provides a better compromise between robustness and efficiency in many cases compared to the minimum density power divergence estimator (MDPDE).

The Density Power Divergence
As mentioned earlier, the density power divergence (DPD) is obtained by substituting B(y) = y 1+α α in Eq. 1. The general form of this divergence, as a function of a nonnegative tuning parameter α, is For simplicity we have dropped the dummy variable in the above equation. One can define the minimum DPD functional T α (G) at G through the relation

A New Divergence
Our key philosophy for constructing new divergences and estimation strategies involves manipulating the downweighting factor B (f θ (X i ))f θ (X i ) in Eq. 4. Here we are going to develop a stronger downweighting effect compared to the MDPD estimating equation. Our exploration will generate an estimation scheme with two tuning parameters and we will explore the possibility of coming up with specific candidates which might beat the MDPDEs both in terms of efficiency and robustness.

Choosing the B Function
The downweighting effect on the score u θ (x) applied by the MDPD estimating equation is f α θ (x). As we want to impose a stronger downweighting in relation to this, we wish to choose the B function (or rather, the B function) so that as x → 0 + , xB (x) converges to zero faster than x α for α > 0 fixed. (Note, from Eq. 4, the downweighting term for u θ (x) in the general Bregman divergence is f θ B (f θ )). In particular, we will assume the following conditions on B .
(P2) xB (x) is an increasing function over x in (0, ∞). Thus the less likely observations will be downweighted more.
x β = 0, i.e., the Bregman formulation attaches weights to the score function which go to zero at a rate faster than the corresponding weights in the MDPD estimating equation.
To prove that such choice of B(·) satisfying (P1)-(P4) can help us generate divergences which have the desired properties and provide superior inference compared to the DPD, let us first demonstrate the general asymptotic properties of the minimum Bregman divergence estimators. For ease of representation, we refer to the divergence generated by the B(·) function satisfying (P1) to (P4) as φ-DPD.

General Asymptotic Properties of the MBDE
We need some regularity assumptions to prove the asymptotic properties of the general MBDE, which we list below.
(A1) The pdfs f θ of X have common support, so that the set X = {x|f θ (x) > 0} is independent of θ. The distribution G is also supported on X , on which the corresponding density g is greater than zero.
(A2) There is an open subset ω of the parameter space Θ, containing the best fitting parameter θ g (D B (g, f θ g ) = inf θ∈Θ D B (g, f θ )) such that for almost all x ∈ X , and all θ ∈ ω, the density f θ (x) is three times differentiable with respect to θ and the third partial derivatives are continuous with respect to θ. (The best fitting parameter θ g depends on the index function B(·) also, but we suppress that notation for brevity).
is bounded by some universal constant then the following hold. (a) The usual DPD defined in Eq. 5 is a special limiting case of φ-DPD. (b) If g = f θ for some θ ∈ Θ and if for the DPD there exists α, β such that the asymptotic relative efficiency (ARE) of the estimator under tuning parameter β is greater than that of the estimator under tuning parameter α, then there exists γ such that φ-DPD with tuning parameter (β, γ) generates an estimator with higher ARE than the MDPDE with tuning parameter α.
Our choice for B (·) in the LφDPD case is B (x) = 1 γ x β log 1 + γ x 0 < β ≤ 1, 0 < γ ≤ 1, x > 0. The corresponding B function may be expressed in the integral form as Obviously other choices are possible, but we have found the LφDPD to be a very useful divergence for our purpose, and for the rest of the paper all our illustrations will be in relation to the LφDPD. We will refer to the corresponding minimum distance estimator as MLφDE. θ (y) log 1 + γ f θ (y) is finite which is indeed the case for most parametric models suggesting the observed robustness of the MLφDE under those parametric models.
In Figure 1 it is clearly seen that the tuning parameter β has a significant impact on the robustness of the estimator and the influence functions redescend faster for larger values of β. On the other hand, for fixed β the influence functions are somewhat closer for different γ as seen in Figure 2. It suggests that γ has a less pronounced impact on robustness than β, although the graphs in Figure 2 indicate that larger γ lead to relatively stronger downweighting.

The Breakdown Point under the Location Model
. Consider the contamination model H ,n = (1 − )G + K n , where {K n } is a sequence of contaminating distributions. Let h ,n , g and k n be the corresponding densities. We say that there is breakdown in the minimum LφDPD functional for level contamination if there exists a sequence K n such that |T β,γ (H ,n ) − T β,γ (G)| → ∞ as n → ∞. We write below θ n = T β,γ (H ,n ) and assume that the true distribution belongs to the model family, i.e., g = f θ g . We make the following assumptions.
(BP1) min{f θ (x), k n (x)}dx → 0 as n → ∞ uniformly for |θ| ≤ c for any fixed c, i.e., the contamination distribution is asymptotically singular to the true distribution and to specified models within the parametric family.

Description and Results
Here we have performed a simulation study to analyze the performance of the LφDPD and the associated minimum distance estimators under the N (µ, 1) model at a given level of contamination. In the following study data are generated from two normal mixtures, 0.9N (0, 1) + 0.1N (5, 1) and 0.8N (0, 1) + 0.2N (5, 1), where N (0, 1) represents the target distribution and the second component is the contamination. The sample size is 50. The empirical MSE for the location model has been calculated by replicating the process 1000 times, evaluating the estimate for each replication and taking average squared error loss against the target value, i.e., µ = 0. In Table 1 the theoretical asymptotic relative efficiency of minimum LφDPD estimator and MDPDE is shown for different values of (β, γ) while in Table 2 and Table 3 the simulated mean square errors are presented under contaminated normal data under two different contamination levels.

The LφDPDersus the DPD
We briefly note our observations as may be evident from Tables 1 and 2. The asymptotic efficiencies of the minimum divergence estimators decrease with increasing β and increasing γ. Note that given an α ∈ (0, 1), it may be possible to choose β ∈ (0, α) and γ ∈ (0, 1) so that, in relation to our numerical study, MLφDE β,γ beats MDPDE α both in terms of β γ = 0 γ = 0.01 γ = 0.02 γ = 0.03 γ = 0.04 γ = 0.05 γ = 0.06 γ = 0.07 γ = 0.08 0.     Tables 1 and 2 (as also in Tables  1 and 3), there exists a better MLφDE, both in terms of asymptotic model efficiency and obtained mean square error under contamination. In most of these cases there are several (β, γ) combinations which provide the domination over a given MDPDE. Tables 2 and 3 also show that the robust minimum distance estimators hold out well against the outliers at both 10 and 20 percent contamination. Simulation results not presented here indicate that the same holds for higher levels of contamination smaller than 1/2, a consequence of the high breakdown point of the method under location models.   5 Algorithm for Finding the Optimal (β, γ) The LφDPD can generate many different kinds of estimators, starting from the most efficient estimator to highly robust estimators. For example, in the limit γ → 0 and β → 0, one gets the likelihood disparity which is minimized by the classical maximum likelihood estimator.
On the other hand, relatively larger values of β and γ lead to estimators with extremely high outlier stability. In a given situation, therefore, it is imperative that one is able to choose the most suitable tuning parameters for that particular case. Here we consider a data driven algorithm for selecting the "optimal" tuning parameters (β, γ) which would provide best compromise for the given situation. For this purpose we modify an approach of Warwick (2002), pp. 78-82, and minimize an empirical version of the asymptotic summed mean square error. The optimization technique is a two stage process. Suppose that the data are generated by a contaminated version of a model distribution, and let θ 0 be the parameter for the model component. Although the data are generated by a contaminated version, the parameter θ 0 of the model component is our target parameter. The spirit of such a set up is described in Warwick and Jones (2005). Let θ β,γ = T β,γ (G) be the corresponding minimum distance functional andθ β,γ is the solution of the unbiased equation of LφDPD with tuning parameter (β, γ) based on the data. The summed mean square error of the minimum LφDPD estimator has the asymptotic formula Here θ * is the pilot estimator playing the role of θ 0 and tr{·} represents the trace of matrix. The asymptotic covariance matrix of . So the estimated asymptotic summed mean square of the MLφDE is For the multiparameter case, the above quantity is a matrix. So trace of the matrix is used to provide a global measure of the summed mean square error for minimization. Thus when there are two parameters to be estimated (say (µ, σ) for N (µ, σ) model) then the expression to be minimized is The optimal value of (β, γ) is the minimizer of Eq. 19 under certain conditions. One important note is that in the first stage of minimization our pilot estimate for θ * is taken to be a good robust estimate based on the data as suggested in [War02]. The empirical summed mean square error is then obtained by evaluating the expressions in Eq. 18 or Eq. 19 after substitutingθ β,γ for θ β,γ and the empirical distribution G n in place of the true unknown distribution G. Let us denote this empirical summed mean square error by AMSE in the following.

Algorithm:
Given a dataset X n×1 we perform the following steps to obtain the estimate of θ.
1. Apply the method suggested in [War02] to get an optimal α for MDPDE. Suppose this value is α w . This step is the 1st stage of optimization by assuming an initial pilot estimate of θ * .

Real Data Examples
Here we take some real data sets and use our algorithm to find the optimal tuning parameters to be used in estimating the parameters of the model. We worked with two data sets, Newcomb's light speed data and Short's parallax of the sun data, under normality assumptions. We have used the minimum L 2 distance estimates as our pilot estimates of (µ, σ).

Newcomb's Data (Speed of Light)
This example involves Newcomb's light speed data (Stigler, 1977, Table 5). The data size is n = 66. Under the normal model, the MLE of the mean and standard deviation for these Figure 3: Normal density fits for Newcomb's data data are found to be equal to 26.212 and 10.664, respectively. We employ our algorithm for tuning parameter selection and Table 4 reports the optimal tuning parameters for DPD and LφDPD, as well as the parameter estimates at these optimal values. The estimators are extremely close, but the estimated asympmtotic summed mean square, for whatever it is worth, is lower in case of the MLφDE.

Short's Data (Parallax of the Sun)
This example involves Short's data for the determination of the parallax of the sun, the angle subtended by the earth's radius as if viewed and measured from the surface of the sun. From this angle and available knowledge of the physical dimensions of the earth, the mean distance from earth to the sun can be easily determined. The raw observations are presented in Table 4 of Stigler (1977). The data size is n = 53. Under the normal model, the MLE of the mean and standard deviation for these data are found to be equal to 8.378 and 0.846 respectively. We perform all the steps of the aforesaid tuning parameter selection algorithm, and the results of the analysis are now listed in Table 5. Again, the empirical asymptotic MSE for the MLφDE is slightly better than that of the MDPDE.
From Figure 3 and Figure 4, it is evident that the normal fits coming from the MDPDE

The MLφE for Independent Non-homogeneous Observations
Here we generalize the above concept to the case of independent but not identically distributed observations. [GB13] explains the methodology for this problem in the case of DPD, but here we will extend it to the case of LφDPD. Let us assume that the observed data Y 1 , ..., Y n are independent but for each i, Y i ∼ g i where the densities g 1 , ..., g n may not be same. We want to model g i by the family F i,θ = {f i (·; θ)|θ ∈ Θ} for all i = 1, 2, ..., n. We want to estimate θ by minimizing the LφDPD between the data and the model. However, the model density may not be same for each Y i 's, and hence we need to calculate the divergence between data and model separately for each data point. For this purpose, we minimize the average divergence between the data points with respect to θ ∈ Θ, where d(ĝ i , f i (·; θ)) denotes the LφDPD between the density estimate corresponding to the i-th data point and the associated model density. In the presence of only one data point Y i from density g i , the best possible density estimate of g i is the (degenerate) density which puts the entire mass on Y i so that we have where K is a constant independent of θ, the parameter of interest. Thus, for the purpose of estimation it suffices to minimize the objective function where Differentiating the above with respect to θ we get the estimating equation of the minimum LφDPD estimator for non-homogeneous observations as where u i (·) is the score function for f i (·).

Asymptotic Properties
We will now derive the asymptotic distribution of the minimum LφDPD estimatorθ n defined by the relation H n (θ n ) = min θ∈Θ H n (θ) provided such a minimum exists. Let us first present the necessary set up and conditions. Let the parametric model F i,θ be as defined above. We also assume that there exists a best fitting parameter of θ which is independent of the index i of the different densities. Let us denote it by θ g . The assumptions hold if all the true densities g i belong to the model family so that g i = f i (·; θ) for some common θ, and in that case the best fitting parameter is nothing but the true parameter θ. Next, recall that the MLφDEθ n is obtained as a solution of the estimating Eq. 23. This equation is satisfied by the minimizer of H n (θ) in Eq. 21. Similarly, we also define, for i = 1, 2, · · · , Note, at the best fitting parameter θ g , we must have We also define, for each i = 1, 2, · · · the p × p matrix J (i) whose (k, l)-th entry is given by where ∇ kl represents the partial derivative with respect to the indicated components of θ. We further define the quantities A simple calculation shows that, and where ξ i = 1 γ f i (y;θ) 0 s β log 1 + γ s ds g i (y; θ)dy.
We will make the following assumptions to establish the asymptotic properties of the MLφDE: (G1) The support X = {y|f i (y; θ) > 0} is independent of i and θ for all i; the true distributions G i are also supported on X for all i.
(G2) There is an open subset ω of the parameter space Θ, containing the best fitting parameter θ g such that for almost all y ∈ X , and all θ ∈ Θ, all i = 1, 2, · · · , the density f i (y; θ) is thrice differentiable with respect to θ and the third partial derivatives are continuous with respect to θ.
(G3) For each i = 1, 2, · · · , the three integrals f i (y; θ) f i (y;θ) 0 s β log 1 + γ s ds dy, f i (y;θ) 0 t 0 s β log 1 + γ s ds dt dy, and f i (y;θ) 0 s β log 1 + γ s ds g i (y)dy can be differentiated thrice with respect to θ, and the derivatives can be taken under the integral sign (the first indefinite integral).
(ii) The asymptotic distribution of Ω − 1 2 n Ψ n [ √ n(θ n − θ g )] is p-dimensional normal with (vector) mean 0 and covariance matrix I p , the p-dimensional identity matrix.
Note that, putting f i = f for all i, we get back the corresponding asymptotic properties of the minimum LφDPD estimator for the i.i.d. case. If f i = f, i = 1, 2, · · · , we get J (i) = J for all i; thus Ψ n = J and Ω n = K. Here J and K are as defined previously. In this case assumptions (G1)-(G5) are exactly the same as the assumptions (A1)-(A5), while assumptions (G6) and (G7) are automatically satisfied by the dominated convergence theorem. Thus the result, which establishes the consistency and asymptotic normality of the minimum LφDPD estimatorθ with n 1/2 (θ − θ g ) having the asymptotic covariance matrix Ψ −1 n Ω n Ψ −1 n = J −1 KJ −1 , emerges as a special case of Theorem 4.

Normal Linear Regression
A natural situation where the theory proposed above would be immediately applicable is the case of linear regression. We consider the linear regression model where the error i 's are i.i.d. normal variables with mean zero and variance σ 2 , x T i = (x i1 , · · · , x ip ) is the vector of the independent variables corresponding to the i-th observation and β = (β 1 , · · · , β p ) T represents the regression coefficients. We will assume that x i 's are fixed. Then y i ∼ N (x T i β, σ 2 ), and hence the y i 's are independent but not identically distributed. Thus y i 's satisfy our independent but non-homogenous set-up and hence the MLφDE of the parameter θ = (β T , σ 2 ) T can be obtained by minimizing the expression in Eq. 21 with f i ≡ N (x T i β, σ 2 ).

Real Data Examples in Regression
We now consider some real data examples to illustrate the above technique in linear regression.

Hertzsprung-Russel Data
This example involves a robust regression on the Hertzsprung-Russel data. These data, associated with the Hertzsprung-Russel diagram of the star cluster CYG OB1 containing 47 stars in the direction of Cygnus has been analyzed previously by several authors including [RL87].
We fit the simple linear regression model y = η 0 + η 1 x + under homoscedastic normal errors. Here the independent variable (x) is the logarithm of the temperature of the stars, and the dependent variable (y) is the logarithm of the light intensity of the stars. The initial regression parameter values are the least median of squares (LMS) estimates. The initial scale estimate is the scaled median absolute deviation (MAD) of the LMS residuals. We perform the previously mentioned steps of optimal tuning parameter selection and obtain the estimates for the regression coefficients, which are given in Table 6. The regression lines for LS regression, LMS regression and minimum LφDPD regression are given in the Figure  5. The robust performance of the MLφDE is self evident.

Salinity Data
This example involves the Salinity data (Table 5, Chapter 3, Rousseeuw and Leroy, 1987). These data were originally presented by [RC80]. The measurements of the salt concentration of the water and the river discharge taken in North Carolina's Pamlico Sound were recorded as the data. These data represent a multiple linear model with salinity as the dependent variable (y), and salinity lagged by two weeks (x 1 ), the number of biweekly periods elapsed We fit the multiple linear regression model y = η 0 + η 1 x 1 + η 2 x 2 + η 3 x 3 + under homoscedastic normal errors. The initial regression parameter values are the least median of squares (LMS) estimates. The initial scale estimate is the scaled median absolute deviation (MAD) of the LMS residuals.
The optimal parameters obtained through our algorithm for optimal parameter selection are presented in Table 7. The residual plots for LS regression, LMS regression and minimum LφDPD regression are given in the Figure 6. Like the LMS method (and unlike the LS method) the MLφDE gives a nice outlier resistant fit.

Hypothesis Testing using LφDPD
Now we develop the tests of parametric hypothesis based on LφDPD divergence. The most common problem is that of testing a simple null hypothesis for a parametric family of densities {f θ : θ ∈ Θ ⊂ R p } under the one sample case. Here we test Category MLφDE Tuning Parameter (β, γ) = (1, 0.9) Estimate of η 0 −8.5557324 Estimate of η 1 3.0590795 Estimate of σ 0.4266284 Table 6: Regression estimates for Hertzsprung-Russel data Figure 6: Residual plots of the fitted regression models for Salinity data using LS, LMS and minimum LφDPD estimation when a random sample X 1 , X 2 , . . . , X n is available from the population of interest. We propose our test statistic as withθ =θ β,γ being the MLφDE estimate of θ and B(·) is as defined in Eq. 16. We shall find the asymptotic distribution of T under H 0 and reject the null hypothesis for large values of T . We assume the following regularity conditions of the parametric family of distributions, (B1) The support of the distribution function F θ , i.e. the set X = {x|f θ (x) > 0} is independent of θ.
(B2) There is an open subset ω of the parameter space Θ, containing the true parameter value θ 0 such that for almost all x ∈ X , and all θ ∈ ω, the density f θ (x) is three times Category MLφDE Tuning Parameter (β, γ) = (1, 0.9) Estimate of η 0 57.16780461 Estimate of η 1 0.06010002 Estimate of η 2 −0.01301208 Estimate of η 3 −2.08372562 Estimate of σ 0.56157558 Table 7: Regression estimates for Salinity data differentiable with respect to θ and the third partial derivatives are continuous with respect to θ.
(B3) The integrals B (f θ (x))f 2 θ (x)dx can be differentiated with respect to θ, and the derivatives can be taken under the integral sign.
(B4) The p × p matrix J(θ) defined by is positive definite where E θ represents the expectation under the density f θ .
(B5) There exists functions M jkl (x) with finite expectation, j, k, l = 1, . . . , p, such that Then we have the following theorem.
We can extend this theorem and hence the testing result to the general two sample problem of testing H 0 : θ 1 = θ 2 against H 1 : θ 1 = θ 2 where there is a random sample of size n from population 1 with parameter θ 1 and that of size m from population 2 with parameter θ 2 . Letθ 1 andθ 2 be MLφDEs of the parameter in populations 1 and 2, respectively. Then under the (B1)-(B5) regularity conditions on the model, we have the following results.

Equivalence with the Score Test
A score test, developed in the same spirit under the same set up as in Theorem 5, also has the same asymptotic null distribution.
Theorem 7. The score test statistic using the LφDPD for testing the simple null in Eq. 35 can be given by and A(θ 0 ) is as described in Theorem 5. Under the null hypothesis, the asymptotic distribution of this statistic is same as that of T β,γ (θ, θ 0 ).

Divergence Difference test statistic
We assume that we have a parametric model F of densities and X 1 , . . . , X n be i.i.d. from the true distribution G with the same support as the distributions in F. Consider the null hypothesis where Θ 0 is a proper subset of Θ. The likelihood ratio test (LRT) is one of the most common tests that may be employed in this situation. Define where L(θ|X 1 , . . . , X n ) is the likelihood of θ given the data. The test statistic in this case is −2 log λ. Assume that the distribution function G is discrete. In particular let its support be X = {0, 1, 2, . . .}, which is also the common support of the family F. Then the test statistic can be expressed in terms of observed relative frequencies ν n as where LD(· , ·) stands for the likelihood disparity. Hereθ andθ 0 stands for unrestricted maximum likelihood estimator and maximum likelihood estimator under null hypothesis respectively. Eq. 38 gives a motivation to construct a new test statistic based on LφDPD.
As an analog of the likelihood ratio test, we consider the divergence difference test (DDT) based on LφDPD to test the hypothesis given in Eq. 37. Note that the test statistic in Eq. 38 can be viewed as a difference of the minimized value of likelihood disparity under null and unrestricted minimum of likelihood disparity. In the same spirit one may define the following test statistic DDT β,γ (ν n ) = 2n d β,γ (ν n , fθ θ 0 andθ are MLφDE under null hypothesis and unrestricted minimum MLφDE respectively. Also note that where B(·) is defined as Eq. 16. We will show that under certain regularity conditions the asymptotic distribution of the the test statistic DDT β,γ (ν n ) coincides with the distribution of linear combination of independent chi-squared random variables. Suppose that Θ 0 is defined by a set of r ≤ p restrictions on Θ defined by R i (θ) = 0, 1 ≤ i ≤ r. We assume that the parameter space under H 0 can be described through a parameter ξ = (ξ 1 , . . . , ξ p−r ), with p − r independent components, i.e., H 0 specifies that there exists a function b : The function b is assumed to have continuous derivativeḃ(ξ) of order p × (p − r) with rank p − r. Then the constrained estimator isθ 0 = b(ξ), whereξ is the MLφDE under the ξ formulation of the model. Let G = F θ be the true distribution which belongs to the family F with parameter θ. Under H 0 , let ξ be the true value of the reduced parameter. So we have θ = b(ξ). When the null hypothesis is true under standard regularity conditions it can be easily shown thatξ andθ 0 are consistent for ξ and θ respectively in the sense thatξ where Z n (b(ξ)) is AN (0, K B (b(ξ))). Here J B (·) and K B (·) is defined as in Theorem 1. Now we will lay out some appropriate regularity conditions under which we will derive the asymptotic distribution of DDT β,γ (ν n ) under the null hypothesis.
Remark. In the above theorems the null distribution of the test statistic turns out to be same as that of a linear combination of independent chi squared random variables. In general it is hard to get hold of critical values under this distribution for actually performing the test. Also calculations regarding this distribution become numerically hard. This gives the motivation to explore another test statistic which will lead to a simpler null distribution.

Wald Type Test
Assume a similar setup of hypothesis testing as in Eq. 37. Suppose that the null space Θ 0 ⊆ Θ ⊆ R p is defined by a set of r ≤ p restrictions on Θ defined by R i (θ) = 0, 1 ≤ i ≤ r. Let G = F θ be the true distribution which belongs to the family F with parameter θ. Assumeθ to be the MLφDE of the true parameter θ. Define R(θ) = (R 1 (θ), . . . , R r (θ)) T and D(θ) = ∂R i (θ) . Under the spirit of the original Wald test statistic, we can construct the following test statistic where Σ(θ) = J B (θ) −1 K B (θ)J B (θ) −1 under the B(·) function described in Eq. 16. Under standard regularity conditions it is easy to prove that the asymptotic distribution of W (θ) is χ 2 r under the null hypothesis. The proof follows from simple application of delta method theorem on the quantity R(θ) and the fact that under the null hypothesis √ n(θ − θ) is AN (0, Σ(θ)). The main benefit of this test statistic is that its asymptotic null distribution is simpler. Hence it is easy to perform numerical computations based on these statistics. For example, the critical values of the test statistic can be computed with ease in this case.

Real Data Example
Researchers needed to evaluate the effectiveness of an insecticide (dieldrin) in killing Anopheles farauti mosquitoes. The theory was that resistance to dieldrin was due to a single dominant gene, and that in an appropriately selected sample of the mosquitoes, there should be 50% susceptibility to insecticide. The hypothesis is where p is the probability of susceptibility. The results of such experiment is given in [Osb79]. The sample contains 465 mosquitoes where 264 of them died on being exposed to the insecticide. We can perform this test with test statistic DDT β,γ (ν n ) in Eq. 39. Here β and γ are chosen to be 0.3 and 0.05 respectively. The support of the distribution is X = {0, 1}, where the digit 1 stands for the death of a mosquito. From here it is evident that ν n (1) = 264/465. The null hypothesis is rejected if the value of the test statistic is large. In this case the asymptotic null distribution of the test statistic turns out to be 0.774χ 2 1 . Under the observed data the value of the test statistic turns out to be approximately 6.62. The 95% quantile of the aforementioned scaled chi-squared distribution is 2.97. So, under 5% level of significance the null hypothesis is rejected.

Summary
In this paper, we have developed a large class of density based divergences which includes the density power divergence family as a special limiting case. The key philosophy of stronger downweighting effect to construct the new family has been discussed. For application purposes, the family gives the data analyst a larger number of choices of possible divergences for inference purposes. We have shown several asymptotic and distributional properties of the proposed estimator. We have also shown that judicial choice of the tuning parameters leads to highly robust and efficient estimators which can often dominate the MDPDE. Though one of the parameters has a smaller effect on the robustness we have shown that both of them play an important role in the context of finite sample efficiency. We have also presented a possible data driven algorithm to obtain the "optimal" estimator in a given data set. We have also considered several hypothesis testing strategies for parameteric models which may serve as robust alternatives to the classical likelihood ratio and other likelihood based tests.
Remark. Like the MDPDE, the procedures described in this paper avoid the nonparametric density estimation and associated complications specific to classical minimum distance estimation. Another approach of this type can be found in [TB11].
Remark. In creating the test statistics for parametric hypothesis tesing using the LφDPD, we have restricted ourselves to the case where the same set of tuning parameters have been used for estimation as well as the construction of the subsequent divergences. In practice, one could allow them to vary; see, for example, [BMMP13]. In the present context, while this is possible, we do not explore this issue as we feel that there are enough tuning parameters involved already, and there are no demonstrated results indicating that such differential choices will necessarily produce improved tests.
Remark. In this paper, most of our illustrations have been with respect to the continuous model. Theoretically, however, there is nothing preventing its successful use in discrete models. All the necessary theories work out satisfactorily in this case.

Proof of Theorems
Proofs of Theorem 2, 5, 6 and 7 are skipped as they can be reproduced along the existing proofs in [BSP11], [GB13] and [GBP15].
under k n and f θn , the set A n converges to a set of zero probability as n → ∞. Thus, on A n , d(h ,n ) → d((1 − )g, 0) as n → ∞ and so by DCT | An d(h ,n , f θn ) − An d((1 − )g, 0)| → 0.
We will have a contradiction to our breakdown assumption if we can show that there exists a constant value θ * in the parameter space such that for the same sequence k n , lim sup n→∞ D(h ,f θn , f θn ) < a 1 ( ) as then the sequence {θ n } above could not minimize D(h ,f θn , f θn ) for every n. We will now show that above equation is true for all < 1/2 under the model when we choose θ * = θ g . For any fixed θ, let B n = {x : k n (x) > max{g(x), f θ (x)}}. Since g belongs to the model F, from (BP1) we get Bn g → 0, Bn f θ → 0 and B c n k n → 0 as n → ∞. Thus, under k n , the set B c n converges to a set of zero probability, while under g and f θ , the set B n converges to a set of zero probability. Thus, on B n , d(h ,n , f θ ) → d( k n , 0) = B( k n ) as n → ∞. (1) f,(1− ) decreases as ↑ 1. So, a 1 ( ) decreases as ↑ 1. Similarly it can be shown a 3 ( ) is an increasing function of . But a 1 (1/2) = a 3 (1/2); thus asymptotically there is no breakdown and lim sup n→∞ |T β,γ (H ,n )| < ∞ for < 1/2. Hence the theorem follows.