A Recursive Application of a Support Vector Machine for Protein Spot Detection in 2-Dimensional Gel Electrophoresis

Two-dimensional polyacrylamide gel electrophoresis (2-D PAGE) analysis remains the core 
of proteomic technology because it is currently the most powerful method to analyze large collections 
of proteins. Advances in electrophoresis equipment are making this technique more accessible but 
effective computer assisted protein spot detection remains a very labor-intensive endeavor. Protein 
spot analysis is still time consuming, requires human intervention and is in need of further 
development. This study explores a technique of recursively applying a Support Vector Machine 
(SVM) in identifying protein. An SVM is a powerful learner capable of optimizing differences 
between classes. In this context the different classes correspond to the presence/absence of a protein. 
Different experiments are conducted to assess these differences in class formation in the context of a 
normal image and a highly saturated image.


INTRODUCTION
At present, most proteome analysis projects begin with the separation of proteins by two-dimensional electrophoresis. Two-dimensional polyacrylamide gel electrophoresis (2-D PAGE) is a widely used method for separating a large number of proteins from complex protein mixtures and for revealing differential patterns of protein expressions. These protein mixtures can come from a variety of different sources such as a complex tumor tissue sample taken from a cancer patient or from a homogeneous cell tissue culture sample. An established method used to study differential protein expression is to compare 2-D PAGE images from different samples. A normal versus a cancer tissue type, or a stimulated vs. a non-stimulated cell tissue culture samples are examples of model systems. This type of study relies on methods that can compare images from at least 2 different gels. Due to the high variation between gels, detection and quantification of protein differences can be problematic.
Background: The inevitable inventory of genes that will be produced by the Human Genome Project heralds the start of a new era: the Age of Proteomics [1]. Although DNA is the blueprint for life, it is the set of proteins that are actually transcribed and translated at any moment in time that determines the function of a particular cell. Proteomics is the study of the complete protein complement of the cell, tissue, or organism at any one time. One of the main techniques used in this growing field is two-dimensional polyacrylamide gel electrophoresis (2-D PAGE) [2]. Patrick O'Farrell [3] first described 2-D PAGE technology more than 25 years ago. However, recent advances have made the technique more popular, by improving what was often a harrowing and timeconsuming experience. Fragile polyacrylamide tube gels containing drift-prone carrier ampholytes have been replaced by Immobilized pH Gradient (IPG) strips that allow for simultaneous "in-gel" rehydration and sample application. They offer mechanical stability due to plastic backing and provide an increased range of pH values. Most importantly, they improve reproducibility between gels and among different laboratories [4]. The first dimension of the 2-D PAGE technique resolves proteins by Isoelectric Focusing (IEF). All proteins have a net charge (the sum of the charges of the amino acid side chains). The pH at which the net charge of the protein equals zero is the isoelectric point (pI). IEF separates (focuses) proteins on the basis of their charge or pI by electrophoresis across a polyacrylamide gel containing a pH gradient. Even proteins that differ from each other by only one amino acid residue can be separated in this manner. Precast, dehydrated, immobilized pH gradient gel strips are widely used for the IEF step and are available in a variety of pH ranges that can be overlapped for even higher resolution. Because the voltage necessary for IEF is generally high (up to 3,500 V) the gels are typically run horizontally on a flatbed system complete with a cooling plate. After completion of IEF, the proteins are resolved in the second dimension via SDS-polyacrylamide gel electrophoresis (2-D PAGE). SDS not only denatures proteins, but also applies a uniform negative charge to the surface of the proteins, allowing separation based on relative molecular weight. After the IPG strips are equilibrated they are affixed to the top of a vertical SDS-polyacrylamide gel and electrophoresed. Finally, the separated proteins are visualized. Proteins labeled with isotopes are detected by autoradiography; other methods of detection commonly used are silver staining, Coomassie brilliant blue staining and fluorography. New stains and staining techniques now permit proteins present in low abundance to be detected in one step. Matrix-Assisted Laser Desorption Ionization-Time of Flight (MALDI-ToF) Mass Spectrometry (MS) allows the analysis and identification of very small amounts of protein isolated from the gel. These advances have combined to make 2-D PAGE a more attractive option for the analysis of complex protein mixtures. 2-D PAGE has long been recognized as a powerful tool for the analytical analysis of biological samples. Several thousand different proteins can be expressed at one time in a particular tissue sample or organism and 2-D PAGE makes it possible to resolve these proteins. Scientists can compare samples from living samples and identify those proteins whose expression levels have changed using 2-D PAGE. With this method, it is possible to distinguish functionally distinct proteins encoded by the same gene, such as mRNA splice variants and proteins bearing post-translational modifications (e.g., methylation, glycosylation and phosphorylation). When combined with a means to quantitatively analyze these patterns, 2-D PAGE provides the ability to analyze complex patterns of gene regulation both temporally and spatially (i.e., subcellular enrichment). The resultant twodimensional array of spots, each of which corresponds to a single protein species, can then be analyzed with specialized computer software. Computerized imaging and image analysis is used now to analyze the array of spots on the gel. While most standard imaging systems--storage phosphor screen or fluoro imagers, flatbed document scanners, or even oldfashioned densitometers--will do the job of collecting images of 2-D PAGE gels, specialized software has been developed for analyzing the complex patterns of spots. Sophisticated imaging and image analysis strategies are required for conclusive interpretation of two-dimensional gel electrophoresis (2-D PAGE) maps in order to identify pertinent differences in protein expression during regulation of the transcription of discrete sets of genes. A single protein may be present as several spots on a 2-D PAGE gel due to a chemical separation of protein subunits that would normally be associated within the cell. Modified forms of a protein through posttranslational processes, which changes the charge of that protein, may result in even more spots representing the same protein. Many proteins don't run as a globular mass and sometimes appear as smears on the gel. There is also a major problem of physical overlapping of different proteins within the same area on the gel. Therefore, the enormous potential of 2-D PAGE is severely restricted by the difficulty of the image analysis of each of the individual proteins within the protein spots. A few investigators, most notably Celis et al. [5], have made remarkable in-roads in the identification of 2-D PAGE-resolved proteins and the development of 2-D PAGE maps and databases, but there is much continued debate about the sustainability of 2-D PAGE as a platform for protein expression mapping. This debate is fueled by its technical demands and limitations that mostly stem from the analysis of the data. In computer-assisted Proteomic research, the comparison of protein separation profiles involves several heuristic steps, ranging from protein spot detection to matching of unknown spots. The development of more sophisticated Mass Spectrometry (MS)-based methods to characterize 2-D PAGEresolved protein spots and proteins from other sources has led to an increase in the efforts now being made to exploit 2-D PAGE as a protein expression mapping tool [6][7][8][9][10]. Many different investigators have tried different methods to solve the technical demands and limitations of the analysis of the data. Several current analysis programs are based on the important step of the recognition of the geometric relationship between the gel profiles, which is modeled on the basis of a given set of known corresponding spots, so-called landmarks. The locations of unknown spots are predicted using the optimized model where efficient protein spot matching is achieved using this image-warping step. This approach is known to be incapable of modeling all the complex distortions inherent in electrophoretic data even when polynomial functions together with least squares optimizations have been used. Some investigators, Salmi [11], have tried to satisfy the need of more flexible gel distortion correction by using a hierarchical grid transformation method with stochastic optimization that provides an adaptive multi-resolution model between the gels to achieve automatic warping of gel images. This method seems to achieve some success in correcting the spotmatching image, but percent recognition is still too low. The shift in the research interest in the recent years from Genomics to Proteomics has increased the potentials and the demand for more research work on protein activity analysis. That shift was caused by the fact that most diseases manifest themselves at the level of protein activity and the realization that genetic information seems to be insufficient to predict and distinguish healthy from diseased tissues [12]. The focus in Proteomics is on analysis of the proteins expressed by a genome, which involves three major tasks: (i) protein spot detection, (ii) protein gel image matching, (iii) and spot quantitation. There are many software packages supporting these three steps and most of them require user guidance and allow for comparison of only two gel images, but efficient Proteomics study of a disease may require a large number of sample gels to be analyzed and compared. When investigators compared two well-known computer analysis programs [13] using three of the fundamental steps (spot detection, gel matching and spot quantitation) involved in 2-D PAGE image analysis, they found each of the programs deficient in one or more of the three steps. In spot detection, even the best program only recognized 89% of the real spots and relatively few of the extraneous spots. When the two programs were compared using non-geometrical distortions in gel matching, both required user intervention. During spot quantitation tests, both programs did relatively well in ratios less than 1/6, but there was a large difference between the two programs using larger ratios. One major goal of research in this domain is to minimize human intervention during the entire process, so that, to reduce the subjectivity of the human expert and to increase throughput. Other important goals include, matching and alignment of several 2-D PAGE gels at a time and efficiently handling merged spots and complex regions in a 2DE gel image. Two sub-tasks, which are universal in the gel matching process, are local matching and global matching. Several of the analysis packages have attempted to build in these capabilities into their software. In local matching, a number of local spots compose a pattern P in a source gel S, the goal is to compute all the local patterns P' in the target gel T that resemble both the geometric pattern and shape of P. Whereas global matching computes the overall list of spot pairs that correspond to each other in the two gels. Usually the local matching produces the landmark spots to be used for global matching. Takashami et al. [14][15][16][17] developed a fully automated set of algorithms for processing 2-D PAGE. The algorithms are based on the Restriction Landmark Genomic Scanning (RLGS) method. The implemented computer system is called DNAinsight. DNAinsight treats RLGS profiles as pattern matching using Delauny Nets (DN) and Relative Neighborhood Graph (RNG) algorithms. The algorithms are fully automated and the company states that no human intervention is required. One drawback of this approach is that it produces a number of false positive (ill-recognized spots) and true negative (unrecognized spots) which requires subsequently a time-consuming spot revising process by a human expert. To try to overcome this drawback, Takahashi and his colleagues developed a new approach that used Gaussian modeling of landmark spots. Another algorithm called Master Spot Pattern (MSP) has been also applied to allow easily distinguishing spot patterns of diseased and nondiseased tissues. The differential protein expression analysis, DIGE [18], has been used to try to improve the reproducibility and reliability of quantification between samples. DIGE includes a standard sample in each gel thereby improving the accuracy of the protein quantification between samples. Another approach described is one [19][20][21] in which the spot detection algorithm depends on watershed transformation applied to the gradient image. The analysis can interpret twin spots/streaks and so-called complex regions using a Linear Programming (LP) formulation. For matching, the detection algorithm performs global-via-local (as they call it) matching. The local matching is done in step one to generate a number of landmark spots to be used in step two for the global matching. This approach is implemented in CAROL and has been deployed over the Internet for use by remote researchers. CAROL consists of two components, a core functionality component residing on a server machine and a graphical user interface via the Internet.

MATERIALS AND METHODS
Kaczmarek et al. [12] describes an approach for automatic matching of two 2-D PAGE images using feature-based matching technique and Fuzzy Alignment (FA). The FA allows automatic matching of images with different numbers of features and with unknown correspondence. The approach has been tested on real and simulated data.

Support Vector Machines:
The Support Vector Machine (SVM) [22,23] is one of the most powerful classification methods and enjoys a considerable empirical and theoretical support. SVMs have proven success in many applications like object recognition [24], face detection [25] and text categorization [26]. Given a data sets consisting of two types of points, positive and negative, SVM attempts to compute a separating hyperplane (a decision surface) with maximum margin between the points of the two sets. This phase constitute the training and the computed hyperplane will be used subsequently to classify some new unseen points in the testing phase. However, the task could merely be separating the points of the given (training) data set. When various lines may be chosen as decision surfaces, the SVM method selects the middle element from the "widest" set of parallel lines. The best decision surface is determined by only a small set of (training) points called the support vectors. This case of finding the optimal separating hyperplane with maximum margin can be generalized to the non-separable cases via introducing the concept of "soft-margin" and constructing nonlinear classifiers via kernel functions. The kernel functions map the training data into a higher dimensional feature space and SVM constructs an optimal separating hyperplane in the higher dimensional space that is corresponding to a nonlinear classifier in the input space. The kernel function can be linear, polynomial, or Gaussians (RBF). Now, with the kernel functions and the high dimensional space, the hyperplane computation requires a quadratic programming, which is computationally intensive and is of non-trivial implementation. The Kernel Adatron (KA) algorithms [27,28] provide an alternative to this extensive computation by offering methods to find solutions rapidly with a fast rate of convergence to an optimal solution. While the classic Adatron algorithm [27] was designed for linear classifiers, the KA is adapted into the (high) feature space of SVMs [28]. The KA algorithm solves the maximization of the dual Lagrangian by implementing the gradient ascent algorithm. Finally, some of the advantages of using a support vector machine in this application include-the ability to separate classes using normal at the boundary lines, and the ability to transform a curved space to linear space.

Protein Spots
The Proposed Method: Two-dimensional gel analysis remains the core of proteomic technology because it is currently the most powerful method to analyze large collections of proteins. Advances in electrophoresis equipment are making the technique more and more accessible to scientists previously intimidated by lengthy procedures and disappointing outcomes, but effective computer assisted image analysis of gels is still time-consuming and in need of further development. The research outlined here demonstrates an effective data-mining, machine learning process that improves the processing and automation of image analysis of 2-D PAGE gels. The approach uses a Kernel Adatron Support Vector Machines to analyze a 2-D PAGE gel. Analyzing these images pose many challenges including: * Proteins when stained may not be visible to the human eye either because of their low concentration or the stain is not specific for that type of protein. * Highly saturated areas make it hard to identify proteins. * Poor Image quality. A protein may smear across a 2-D PAGE gels. This can be due to the chemical makeup of the protein, i.e., post-translational modification.
A traditional automated approach to analyzing a 2-D PAGE examines the whole image for detecting protein spots. However, this approach must contend with inherent image complexity in terms of pixel density diversity and the pixel topology.
An alternative approach is to preprocess the image by dynamically partitioning the image, assessing each subsection and then merging the corresponding pieces.
Partitioning the Image: Initial analysis involves extracting pixel coordinates and the respective intensities. These intensities serve as a basis for identifying local maximas. In the event that a local maxima spans a group of neighboring pixels, then a centroid pixel is determined by using the average of the minimum/maximum values with respect to the local minima points. These centroids act as corner anchors for extracting sub-images from the main image. Thus any rectangle may contain up to 4 proteins at each of the corners. There will be no proteins at the edges. Otherwise, it would be necessary to partition the rectangle into smaller rectangles. Those rectangles with zero proteins may be ignored. Figure 2 illustrates these ideas. We conduct several experiments applying the Kernel Adatron to these partitioned rectangles.

Experiment 1: Proof of Concept:
All experiments seek to divide a rectangle into two classes as input values for the Support Vector Machine. Since an image contains at least two proteins, the coordinates of the local maximas for each protein will be identified as the first class. The question remains on how to identify the second class.
Since proteins occur at the corners of an image, the first experiment uses the center of the rectangle as the second class. The idea is that a single point will rapidly converge to a solution. The parameters for this experiment are set to 1000 epochs with dither set to 0.1.
Although the Support Vector Machine trained very fast, it did a poor job in recognizing the proteins as indicated in Fig. 3. Upon inspection of the results, it is evident that more points are needed for the second class. The next experiment further populates the second class by adding the midpoints from each of the edges. The intent is to provide tighter boundaries around the points from the first class. Once again, this experiment trains for 1000 epochs with a dither rate set to 0.1. The second experiment produced a better image than the first experiment. In this case, there is better clarity in proteins. However as Fig. 4 illustrates, the boundaries need further definition. The next experiment seeks to maximize the number of points in the second (non-protein) class. This is accomplished by inverting the original image from Fig. 2 and extracting out the maximum values. Figure 5 shows the inverted image. This technique produces 14,210 points for the second class. Figure 6 shows the results from the third experiment. The Support Vector Machine is able to identify all three proteins with crisp boundaries. A question arises regarding the extent of the boundary. The experiment strives to discriminate points into a two-class system, thus there is a tradeoff between granularity and simplicity.
5 Experiment 2: Highly Saturated Region: Identifying protein spots in highly saturated regions, as illustrated in Fig. 7, is a major challenge in the field of Bioinformatics. The next experiment extracts out the circled region from Fig. 7 for analysis as depicted in Fig. 8. Visually inspection reveals that Fig. 8 Fig. 10, diminishes the differences between the images. One argument is that Support Vector Machine-based image accentuates the contrast.  Raising the contrast in the original image, as displayed in Fig. 11, shows that the results from the Support Vector Machine are reasonably close in appearance.

DISCUSSION
One challenge faced by Bioinformaticists is the lack of standards in identifying protein spots. Different protein spot detection programs (and lab technicians) frequently disagree on what constitutes a protein and what is the extent, in terms of surface area, of the protein. Figure 12, a sample screenshot from an unnamed commercial product, illustrates this dilemma. The software package assumes contiguous boundaries between proteins. This assumption tends to skew the actual size of the proteins. The two-class approach of the Support Vector Machine seeks to quickly transition from where a protein ends to "non-protein" terrain. A second challenge in defining a protein's perimeter makes it difficult to establish a crisp border. For example, Figure 13 illustrates this situation for the two proteins in the middle of the image. One solution is to apply Bezier curves to each of the protein boundaries. Contrary to the example in Fig. 13, the Support Vector Machine produces very smooth curves with little jaggedness. The Bioinformatic challenges, in terms of lack of standards, make it difficult to statistically assess any methodology. What seems to be important is having a defined, repeatable and consistent process. This approach certainly offers all these features.

CONCLUSION
A Support Vector Machine is able to identify protein spots and mimic their shapes with smooth boundaries. The approach performs very well when the original image is highly contrasted, thus producing an abundant set of points for both classes. 6