Features Extraction Based on Linear Regression Technique

: Problem statement: The matching problem of complex objects is one of the most difficult task in the pattern recognition field. These problems are made difficult by seemingly infinite varieties of shapes and classes which are used. The difficulties are related to absolute shape measurement, given the impossibility of directly mapping shapes, as such, into a feature space. Approach: In this study, an object was modeled using boundaries pixel distance. The invariant has been resulted from the distance of each boundaries pixel to their central point. By performing linear regression on each set of sorted distances, a unique set of numerical features from the coefficients of this linear function has been produced. This unique set of numerical values is then proposed as an object’s features. Results: The experiments show that the coefficient of linear function from boundaries’ distance plot of each object has produced better recognition than polynomial function of degree more than one. Conclusion/Recommendations: More than 200 hundreds trademark’s images have been tested and almost 90% of successful rate of accuracy has been achieved.


INTRODUCTION
In this study, the trademark has been chosen as a case-study, to show that the invariant properties can be derived from the set of distances of its boundaries' pixels. A search of the literature has found very few previous attempts to solve either of the problems of recognizing or matching trademarks. These problems are made difficult by seemingly infinite varieties of shapes and classes which are used. The difficulties are related to absolute shape measurement, given the impossibility of directly mapping shapes, as such, into a feature space (Cortelazo et al., 1994).
The method considered in reference (Cortelazo et al., 1994) concerns with the performing of a hierarchical contour region and recording the inclusion relationships between contours into a tree structure. The matching process is then performed by checking the similarity of the trademarks which have trees of the same type. This method could be performed correctly, should the trademarks are relatively simple (as shown in their sample test), because their main criteria of creating a contour tree is based on an object hole. The more hole exists, meaning to say that the more complex the logo is, the more ambiguity the contour tree is. As a result the contour tree structure would be misleading. Hu (1962), presented an invariant moment set which they are derived from the 2nd and the 3rd order moments. The invariant moment set is used in many application like character recognition system (Hu, 1962) and 3-D aircraft identification system (Dudani et al., 1977); Somaie and Ipson (1995); Eakins et al. (2002) and Sheng and Shen (1994), presented the full face identification system using the 2-D isodensity moments. The geometrical and the template matching techniques could be common systems for the recognition task, while the correlation function was often used to compute the matching ratio (Gonzalez and Woods, 2002). Baboulaz and Dragotti (2009) demonstrate through simulations that the sampling model which enables the use of finite rate of innovation principles is well-suited for modelling the acquisition of images by a camera. Simulations of image registration and image superresolution of artificially sampled images are first presented, analyzed and compared to traditional techniques In reference (Zhu and Doermann, 2009), their approach is segmentation free and layout independent and they address logo retrieval in an unconstrained setting of 2-D feature point matching. Finally, they quantitatively evaluate the effectiveness of their approach using large collections of real-world complex document images.
In this study the trademark image was modeled using boundaries pixel distance. The invariant has been resulted from the distance of each boundaries pixel to their central point. By performing polynomial regression on each set of sorted distances, a unique set of numerical features from the coefficients of this polynomial function has been produced.

Representation model:
Representation model proposed in this work is based on the boundaries' pixel distance, as they carry useful information about object boundaries which can be used for object recognition. The distance of interest is between the boundary's pixel coordinate and the object's central point (Fig. 1).
Basically, boundaries' pixels are produced by performing a series of edge detection algorithm over the entire image plane. Next, a central point of that particular object is determined. For each pixel, a distance from its boundary's coordinate to their centralpoint is measured.
The entire procedures are simplified as follows: • Let the trademark symbol is represented in binary level with a matrix of order N x N, with a practical threshold value • Perform the edge detection algorithm over the entire image plane • Compute the object's central point, C(x, y) using this relation: The invariant properties, i.e. invariant to rotation, scaling and translation; are produced by putting the boundaries' pixels distance in one ordering set.

MATERIALS AND METHODS
Feature selection and recognition scheme: As a first step in deriving a possible mathematical equation to tell the behavior of each set of sorted distances, the collected sample data set are plotted on a scatter diagram. A scatter diagram provides a visual picture of the kind of relationship involved and suggests the type of equation that will best fit the data. The usual way to construct a scatter diagram is to have the dependent variable Y (pixel count) on the vertical axis and the independent variable X (pixel distance) scaled on the horizontal line; thus a two-dimensional plane is form by X and Y. There are two ways here to be discussed, as to how the data set could be represented. One way is by using a linear equation and secondly is a polynomial equation. Later, it will be shown the comparison of performances between both two ways.
Linear regression: There is, of course, no limit to the number of straight lines that could be drawn on any scatter diagram. Obviously, many of the lines do not fit the data and must disregard, while others may appear to fit the points very well. Only one line is needed, however; the primary objective is to select the line that best fits the data. The criterion to select the best-fitting line that is most commonly used here is the least squares. Basically, the least squares method is discussed as follows.
Let, there are n set of ordered pairs of pixel distances:

∑
Must be the minimum sum. Thus, any line that minimizes this quantity is called the least squares line. Since b in Equation (12) determines the slop of the line, then the behavior of each trademark, i.e., the trademark feature, could be best described by this constant value. The experiment has shown that each trademark has different slope line.
In the recognition stage, the unknown pattern x u is first put under preprocessing stage where a sequence of pixels distance is produced. Secondly, the Euclidean distance is used to detect the recognized trademark for the minimum distance d k :

RESULTS
Using the technique described above, 224 trademarks were digitized. The digitized image size is set to 256×256 pixels using multiple scanning resolution of 75, 100, 150 and 200 dpi. Figure 2 shows some of threshold digitized samples. Out of 224 trademarks, 100 were used as a reference and the rest were used for testing. For every trademark, the distance of each boundary's pixel to their central point is calculated and put in sequential order. Next, 124 trademarks (some of those are shown in Fig. 2 with different scaling, translation and rotation were placed in the similar prescribed process and matched to provide recognition using both regression methods. The results will be discussed in the next follows. Figure 3 shows the plot of each set of sorted distance for 10 trademarks. The plots depict different slope for different trademarks. Figure 4 shows the plot of linear regression for a selected plot from Fig. 3.

DISCUSSION
From each plot of sorted boundaries pixel distance, it gives different tangent value or different coefficient of linear regression function. The plot of each sorted boundaries pixel distance still remain the same for every similar object with various orientation, scaling and translation Hence, it proves that this coefficient of linear function can be used as a unique feature for that particular object because it carries general behavior of each trademark. In this way, linear regression provide good matching result as 90% succesful rate of accuracy has been achieved.

CONCLUSION
The trademark matching task is a very interesting challenge in the area of pattern recognition as it involves not only with a complex image but its pattern itself is diversified. Due to this complexity, very few researchers had attempted to solve it and not many reports are available on this subject for the past three decades. In this study, the author has described a simpler approachable method based on mathematical reasoning. For the subject as such, a numerical solution is seen to be more viable compared to other conventional methods such as, contour based descriptions, feature based descriptions based on corners, T-junction.