A Qualitative Analysis of Independent Component Analysis Based Algorithms for the Removal of Artifacts from Electroencephalography Signals

: Problem statement: Independent Component Analysis (ICA) based algorithms applied in the context to remove the artifacts from the EEG signals are evaluated with appropriate metrics and it compares and contrasts the performance of the different methods for such applications. The primary goal is to gain some insight into relative performance of the various methods. Approach: CA is a statistical and computational technique for revealing hidden factors that underlie sets of random signals. In the ICA model the data samples are assumed to be linear mixture of some unknown latent variables and the mixing system is also unknown. The latent variables are assumed to have a nongaussian distribution. These variables are the independent components of the observed data which can be found, up to some degree of accuracy, using different algorithms based on ICA techniques. Results: The algorithms based on ICA with different approaches to be considered are JADE, Fast ICA, infomax and extended infomax and these performances are measured in terms of Entropy, PSNR and Speed. The simulation results show that the performance of each algorithm is to be preferred over another in terms of speed and reliability. A framework for accommodating four ICA algorithms is developed and hence selects the best algorithm for the specific type of data. Conclusion: ICA plays a vital role in removing of artifacts in EEG signals .It maintains the similarity in their patterns when subject is performing the mental task. The traditional methods applied for remove artifacts can only compromise between eliminating artifacts and protecting useful signals so that the result is not very satisfying. ICA method can protect the useful signals as well as obviously weaken even entirely remove the artifacts in multichannel EEG signals, this characteristic of ICA is the key to get stable EEG patterns which can be used for mental task classification.


INTRODUCTION
Electroencephalography (EEG) is a medical imaging technique that reads the scalp electrical activity generated by the brain structures (Binjadhnan and Ahmad, 2010), Electrical impulses generated by nerve firings in the brain diffuse through the head and can be measured by electrodes placed on the scalp. The EEG gives a coarse view of neural activity and has been used to non-invasively study cognitive processes and the physiology of the brain. The greatest advantage of EEG when compared with other medical imaging techniques is its speed.
When the EEG signals measured by electrodes placed on the scalp and are always under the influences of artifacts (Rizon, 2010). Those artifacts include: line noise from the power line, eye blinks, eye movements, heartbeat, breathing and other muscle activities. Because of these artifacts contained in EEG, the pattern detection in EEG produced from normal mental states is a very difficult problem.

Independent component analysis:
ICA is an extension of PCA but it is more powerful that PCA in the field of signal analysis. In mid 90's several new ICA algorithms (Furukawa, 2010;Hillyard and Galambos, 1970) were introduced with impressive demonstrations on problems like separating different speech signals from a mixed signal. The applications of ICA include but not limited to the fields of biomedical, telecommunications, audio and video signal processing feature extraction, data mining and functional time series analysis. Generally, ICA technique can be regarded as a technique to separate signals from a mixture. There are several ICA algorithms in use. Some of these algorithms are Fast ICA, JADE and First Order Blind Identification (FOBI), Maximum Likelihood and Infomax, algorithms based on Kernel methods and algorithms using time structure like Second Order Blind Identification (SOBI).

ICA model:
A simple mathematical representation of ICA model (Ali et al., 2010) And n represents the Additive White Gaussian Noise (AWGN). It is assumed that there are at least as many observations as sources i.e., M≥N A is called the mixing matrix. The estimation of the matrix S with knowledge of X is the linear source separation problem. This is schematically shown in Fig. 1 A is the mixing matrix and B is the unmixing matrix. The following assumptions ensure that the ICA model estimates the independent components meaningfully. Actually the first assumption is the only true requirement which ICA demands. The other assumptions ensure that the estimated independent components are unique.
The latent variables (or independent components) are statistically independent and the mixing is linear. There is no more than one Gaussian signal among. The latent variables and the latent variables have cumulative density function not much different from a logistic sigmoid.
The number of observed signals, m, is greater than or equal to the number of latent variables, n (i.e., m≥n). If n > m, we come to a special category of Independent Component Analysis called ICA with over-complete bases .In such a case the mixed signals do not have enough information to separate the independent components. There have been attempts to solve this particular problem but no rigorous proofs exist as of yet. If m>n then there is redundancy in the mixed signals. The ICA model works ideally when n = m.
The mixing matrix is of full column rank, which means that the rows of the mixing matrix are linearly independent. If the mixing matrix is not of full rank then the mixed signals will be linear multiples of one another.
The propagation delay of the mixing medium is negligible. Before applying an ICA algorithm on the data, it is usually very useful to do some pre-processing (Ahmad and Ken, 2010).
Some pre-processing techniques that make the problem of ICA estimation simpler and better conditioned are given.
ICA algorithms-algebraic ICA algorithm: An algebraic solution to ICA is proposed by Taro Yamaguchi. This is a non-iterative algorithm but becomes extremely complex to compute when the number of sources goes more than two. For two sources separation it works very fast (Cichoki and Vorobyov, 2000). Two observed signals x 1 and x 2 are given by linear mixture of two independent original signals S 1 and S 2 as in Eq. 4: where, α and β are unknown mixing rates. The algebraic solution to α and β are given by Eq. 5 and 6: 3 2 10 11 3 9 3 8 2 3 10 1 11 2 6 2 8 3 9 1 7 3 5 3 7 1 6 3 2 4 3 4 1 5 where, C 1 , C 2 , C 3 , C 4 , C 5 , C 6 , C 7 , C 8 , C 9 , C 10 and C 11 are as shown in Eq. 7: where, E [ ] denotes the expectation operation. α and β are obtained by solving the Eq. 5-7 with the Ferrari method. Excluding the solutions having non-zero imaginary parts and negative sizes the proper solution is selected (Hyvarinen, 1999). Original independent signals are computed from Eq. 6 by solving value of α and β.
Fast ICA: The advantage of using the gradient method to maximize negentropy is that the inputs z(t) can be used in the algorithm at once, thus enabling fast adaptation in non-stationary environment (Cichoki and Vorobyov, 2000). However convergence is slow and depends on a good choice of learning rate γ.
To make this method efficient, a fast-fixed point algorithm is devised, also called Fast ICA for Negentropy. To understand this algorithm it should be noted that at a stable point of the gradient algorithm, the gradient must be pointing towards w or in other words it must be a scalar multiple of w. That means that adding the gradient of negentropy in w would not change its direction at the stable points and hence convergence can be obtained. The t Eq. 8 can be written as: The coefficient γ is omitted because it would be eliminated by the normalization anyways. Iteration in Eq. 8 does not however have as good of a convergence as the one using kurtosis. The reason behind this is the non-polynomial moments (G's) do not have same nice algebraic properties as cumulates like kurtosis. Hence to get a better convergent algorithm the iteration in Eq. 9 has to be modified. This modification can simply be done by adding some multiple of w to the both sides of the iterant term in Eq. 9 and then changing the value of multiple to find a better convergence speed as in Eq. 9: Adding a multiple of w to the both sides of Eq. 10 would not change the direction of the vector and after normalization in the next step w will be constrained to the unit sphere again. A suitable value of α and thus the Fast ICA algorithm can be found using Newton's method for solving equations (Berg and Scherg, 1991) Newton's method can briefly explained as follows:To find a maxima or minima of any function with respect to some variable, first the function is expanded using Taylor's series and the terms above the quadratic terms are dropped to keep it manageable (since higher order terms don't contribute a lot in the total value of the function (Hyvarinen and Oja, 2000). Let E (not expectation) be a cost or error function which has to be minimized around vector w (n) having m elements and n being the number of iteration. The change in the cost function can be written as in Eq. 10: where, g(n) = m×1 gradient vector of cost function evaluated at w(n) and, H(n) is an m×m 2nd order derivative matrix of the cost function E(w(n)) evaluated at w(n), called Hessian Matrix. Hessian Matrix H is given by Eq. 11: 2 H E(W(n)) = ∆ where, g(n) = m×1 gradient vector of cost function evaluated at w(n) and, H(n) is an mxm 2nd order derivative matrix of the cost function E(w(n)) evaluated at w(n), called Hessian Matrix. Hessian Matrix H is given by Eq. 11: . .
Differentiating ∆E (W (n)) w.r.t ∆ w (n) to find out the minimal value of ∆E gives the condition as in Eq. 12: Expression in Eq. 12 is the Newton's method for updating the vector w to move towards the minimization of the cost function. The advantage of Newton's method is fast convergence but as it can be seen that it is computationally more intensive since one has to calculate inverse of Hessian matrix at each step.
In order to avoid the cost and time consuming calculation of the inverse of Hessian matrix in the Newton's method, an approximation of this method is developed that avoids the use of matrix inversion without sacrificing its essence to employ ICA algorithm (Comon, 1994). The approximation of Newton's method calls for the use of Lagrangian rule for constrained optimization. Lagrangian rule for constrained optimization can be briefly described as follows.Assume a cost function E(w) (E is not expectation) which is suppose to be minimized or maximized under some constraint H i (w) = 0, where i = 1,2,3, …., k. One can write the Lagrangian function based on the given information as in Eq. 13: where, λ 1 ,λ 2 ,….λ k are called Lagrangian multipliers. The minimum (maximum) point of Eq. 14 where its gradient is zero with respect to both w and all of theλ i gives the solution to the original constrained problem, i.e., minimization of E(w) under some constraint H i (w) = 0. The gradient of L(W,λ 1 ,λ 2 ,….,λ k ) with respect toλ i gives the i th constraint function H i (w), so putting all these to zero will give the original constraint condition. When gradient of L(W,λ 1 ,λ 2 ,….,λ k ) is taken with respect to w and equate it to zero, one will get the Eq. 14: Hence the minimization problem has been changed into two sets of equations that are much easier to solve. A possible way to solve these two sets, one given by the constraints, the other by Eq. 15, is some appropriate iteration method like Newton iteration. From Eq. 16 the optima of E{G(w T z)} for constraint ||w|| 2 = 1 can be evaluated as in Eq. 15: where, H(w) = ||w|| 2 -1 = 0 is the only constraint to find out the extrema of E{G(w T z)}. To solve Eq. 16 one can use Newton's method to find the optima with respect to w. Let F=E{zG(W T z)}βw, the derivative of F, i.e., the second derivative of Lagrangian function can be evaluated as in Eq. 16: Thus the Newton iteration from Eq. 13 can be written as in Eq. 17: Left hand side of Eq. 20 is nothing but a new variable to which right hand side value will be assigned, hence the Fast ICA algorithm based on negentropy will become as in Eq. 21: Following is a brief summary of Fast ICA algorithm based on negentropy for finding one maximally non-Gaussian direction, i.e., estimating one independent component.

JADE:
JADE is an algorithm that uses significant eigenpairs of the cumulant tensor F(M) to find out the estimated values of independent components. In this algorithm the tensor eigenvalue decomposition is considered as more of a preprocessing step (Laubach et al., 1999). Eigenvalue decomposition can be viewed as diagonalization. The idea is to diagonalize F(M) for any M using the matrix W. In other words, WF(M)W T is diagonal. This is because the matrix F consists of a linear combination of terms of the form w i w i , T assuming that the ICA model holds. Hence the goal is to take a set of significant eigenmatrices, M i and try to make the matrices WF (M)W T as diagonal as possible. They might not be made exactly diagonal since the model doesn't hold exactly because of some sampling errors. Let Q = WF (M)W T , then maximization of the sum of the square of diagonal elements of the Eq. 22: Two of the main steps in the JADE algorithm are to find the significant eigenpairs of the cumulant tensor {λ r, M r |1 ≤ r ≤ n} and to jointly diagonalize the JADE criterion J JADE (W). These two steps that lead to the JADE algorithm are discussed next.

Infomax method:
The Infomax method has been proposed for performing linear ICA based on a principle of maximum information preservation (hence its name). However, it can also be seen as a maximum likelihood method, or as a method based on the minimization of mutual information. Infomax uses a network whose structure is depicted in Fig. 2 (the figure shows the case of two components; extension to a larger number of components is straightforward). W is a linear block, yielding the Eq. 23: This block thus performs just a product by a square matrix (we shall designate both the block and the matrix by the same letter since this will cause no confusion). After optimization, the components of Y are expected to be as independent from one another as possible. Blocks φ i am auxiliary, being used only during the optimization phase (Lee et al., 2000). Each of them implements a nonlinear function (that we shall also designate byφ i ). These functions must be increasing, with values in [0; 1]. The optimization of W is made by maximizing the output entropy, H (Z).

Extended infomax method:
The algorithm of Bell and Sejnowski which uses a sigmoidal activation function is specifically suited to separate signals with super-Gaussiandistribution (Alfaouri and Daqrouq, 2008). The proposed extension of Infomax ICA that is able to separate with sub and as well as super Gaussian distribution. This preserves the ICA architecture of Infomax algorithm, but it uses a learning rule derived by Girolami and Fyfe. It determines the sign changes (positive to negative and vice versa) required by the algorithm to handle both sub and super Gaussian distributions.

RESULTS AND DISCUSSION
Data were recorded for 10 sec during each task and each task was repeated five times per session (Verobyov and Cichocki, 2002;Vigon et al., 2000). Subjects attended two sessions recorded on separate weeks, resulting in a total of ten trials for each task. With a 250 Hz sampling rate, each 10 sec trial produces 2,500 samples per channel.

Results of fast ICA algorithm: Parameters:
Nonlinearity: log (cosh(y)) No. of iterations: 100 Max. Weight change: 10e-300 Figure 4 shows the independent components obtained using the Fast ICA algorithm from the EEG data mixed with EOG which is shown in Fig. 3.

Results of JADE algorithm:
No adjustable parameters. Execution time in seconds: Trail1: 0.99, Trail 2: 1.01. Figure 5 shows the independent components obtained using the JADE algorithm for the EEG data of Fig. 3.

Results of infomax:
Parameters: learning rate = 0.1 Max. Change in weight = 1e-3  Figure 6 shows the independent components obtained using the Infomax algorithm for the EEG data of Fig. 3.

Results of Extended Informax:
Parameters: Min. Weight-change: 1e-3 Number of iterations=512 Signs: -1: subgaussian 1: supergaussian Initial weight= Identity matrix   Figure 7 shows the independent components obtained using the Extended Infomax algorithm for the EEG data of Fig. 3.  Performance comparisons of algorithms: Computation time: The computation time i.e., the time taken by algorithm to separate the EEG signal is measured (Teplan, 2002). For comparison of algorithms mental multiplication task EEG (Kumar et al., 2008) which is measured for five trails is used. Entropy of original mental multiplication EEG signals of five trails is given in Table 1. Table 2 shows the comparative analysis between original EEG signal and ICA based algorithms. Table 3 shows computational time comparision in second.

CONCLUSION
ICA plays a vital role in removing of artifacts in EEG signals .It maintains the similarity in their patterns when subject is performing the mental task. BCI systems using EEG as control signal suffers from the artifact problem. The traditional methods applied for remove artifacts can only compromise between eliminating artifacts and protecting useful signals so that the result is not very satisfying. However, ICA method can protect the useful signals as well as obviously weaken even entirely remove the artifacts in multi-channel EEG signals, this characteristic of ICA is the key to get stable EEG patterns which can be used for mental task classification.