Steel Plates Faults Diagnosis with Data Mining Models

: Problem statement: Over the last two decades, Fault Diagnosis (FD) has a major importance to enhance the quality of manufacturing and to lessen the cost of product testing. Actually, quick and correct FD system helps to keep away from product quality problems and facilitates precautionary maintenance. FD may be considered as a pattern recognition problem. It has been gaining more and more attention to develop methods for improving the accuracy and efficiency of pattern recognition. Many computational tools and algorithms that have been recently developed could be used. Approach: This study evaluates the performances of three of the popular and effective data mining models to diagnose seven commonly occurring faults of the steel plate namely; Pastry, Z_Scratch, K_Scatch, Stains, Dirtiness, Bumps and Other_Faults. The models include C5.0 decision tree (C5.0 DT) with boosting, Multi Perception Neural Network (MLPNN) with pruning and Logistic Regression (LR) with step forward. The steel plates fault dataset investigated in this study is taken from the University of California at Irvine (UCI) machine learning repository. Results: Given a training set of such patterns, the individual model learned how to differentiate a new case in the domain. The diagnosis performances of the proposed models are presented using statistical accuracy, specificity and sensitivity. The diagnostic accuracy of the C5.0 decision tree with boosting algorithm has achieved a remarkable performance with 97.25 and 98.09% accuracy on training and test subset. C5.0 has outperformed the other two models. Conclusion: Experimental results showed that data mining algorithms in general and decision trees in particular have the great impact of on the problem of steel plates fault diagnosis.


INTRODUCTION
A fault may be defined as an unacceptable difference of at least one characteristic property or attribute of a system from acceptable usual typical performance. Therefore, fault diagnosis is the description of the kind, size, location and time of discover of a fault. The main purpose of any fault diagnosis system is to determine the location and occurrence time of possible faults on the basis of accessible data and knowledge about the performance of diagnosed processes. Manual fault diagnosis system is the traditional way where an expert with electronic meter tries to obtain some information about relevant operational equipment, check the maintenance manual and then diagnosed the probable causes of a particular fault. However, intelligent fault diagnosis techniques can provide quick and correct systems that help to keep away from product quality problems and facilitates precautionary maintenance. These intelligent systems have been used different artificial intelligent and data mining models and they should be simple and efficient. Decision tree, support vector machine, fuzzy logic algorithm, neural network and statistical algorithms are alternative approaches that are commonly employed nowadays in the industrial context to detect the occurrence of failure or faults (Seng-Yi and Chang, 2011). The numbers of fault diagnosis papers that are published in the sciencedirect database in 2010, 2011 and 2012 (in press) are 512, 732 and 39 studys respectively (searching date 31-10-2011). Faults diagnosis problems are representing challenging and attracting applications for experts and researchers. Recent reviews articles can be found in (Faiz and Ojaghi, 2009;Venkatasubramanian et al., 2003a;2003b;Zhang and Jiang, 2008;Ma and Jiang, 2011;Maurya et al., 2007).
This study evaluates the performances of three of the popular and effective data mining models to diagnose seven commonly occurring faults of the steel plate namely; Pastry, Z_Scratch, K_Scatch, Stains, Dirtiness, Bumps and Other_Faults. The models include C5.0 decision tree (C5.0 DT) with boosting, Multi Perceptron Neural Network (MLPNN) with pruning and logistic regression (LR) with step forward. DTs focus on conveying the relationship among the rules that expressed the results. They have expressive design and allow for non-linear relations between independent attributes and their outcomes and isolates outliers. The C5.0 DT model is a recent invented DT algorithm; it includes discretization of numerical attributes using information theory based functions, boosting, pre and post-pruning and some other state-ofthe-art options for building DT model. Logistic Regression (LR), also known as nominal regression, is a statistical technique for classifying records based on values of input attributes. It is similar to linear regression but takes a categorical target field instead of a numeric one. LR works by building a set of equations that relate the input attribute values to the probabilities associated with each of the output attribute categories. Once the model is generated, it can be used to estimate probabilities for new data. For each record, a probability of membership is computed for each possible output category. The target category with the highest probability is assigned as the predicted output value for that record (Maalouf, 2011). A neural network, sometimes called a multilayer perception, is basically a simplified model of the way the human brain processes information. It works by simulating a large number of interconnected simple processing units that resemble abstract versions of neurons. The processing units are arranged in layers. There are typically three parts in a neural network: an input layer, with units representing the input fields; one or more hidden layers and an output layer, with a unit or units representing the output field (s). The units are connected with varying connection strengths or weights.

Steel plate's faults dataset: The Steel Plates Faults
Data Set used in the study comes from the UCI Machine Learning Repository (Frank and Asuncion, 2010). Steel Plates Faults Data Set is one of the datasets in the Repository, which classifies steel plates' faults into 7 different types: Pastry, Z_Scratch, K_Scatch, Stains, Dirtiness, Bumps and Other_Faults. The goal was to train machine learning for automatic pattern recognition. The dataset includes 1941 instances, which have been labeled by different fault types. Table 1 shows class distribution and list of attributes. The detailed information and the whole dataset can be accessed from http://archive.ics.uci.edu/ml/datasets/Steel+Plates+Fau lts. The dataset was donated by Semeion, Research Center of Sciences of Communication, Via Sersale 117, 00128, Rome, Italy. The first used of the dataset in July 2010 (Buscema et al., 2010). Each instance of the dataset owns 27 independent variables and one fault type.
Literature review: Artificial Intelligence (AI) tools are introduced for enhancing the accuracy of faults identification. In (Leger et al., 1998) the author examined the feasibility of applying cumulative summation control charts and artificial neural networks together for fault diagnosis. Simani and Fantuzzi (2000) proposed a two-stage faults diagnosis method to test a neural network model of a power plant. Lo et al. (2002) tried to address the problem of fault diagnosis via integration of genetic algorithms and qualitative bond graphs. Hung and Wang (2004) presented a novel cerebellar model articulation controller neural network method for the fault diagnosis of power transformers. Dong et al. (2008) combined rough set and fuzzy wavelet neural network to diagnose faults of power transformers, concluding that the diagnosis accuracy may be limited by the hidden layer numbers and correlated training parameters of neural networks. Lau et al. (2010) presented an adaptive neuro-fuzzy inference system for online fault diagnosis of a gasphase polypropylene production process. Eslamloueyan (2011) proposed a hierarchical artificial neural network for isolating the faults of the Tennessee-Eastman process, which was proved efficient.
Classification models: Decision tree: DT models are powerful classification algorithms. They are becoming increasingly more popular with the growth of data mining applications (Nisbet et al., 2009). As the name implies, this model recursively separates data samples into branches to construct a tree structure for the purpose of improving the classification accuracy. Each tree node is either a leaf node or decision node. All decision nodes have splits, testing the values of some functions of data attributes. Each branch from the decision node corresponds to a different outcome of the test. Each leaf node has a class label attached to it. General algorithm to build a DT is as follows: • Start with the entire training subset and a vacant tree • If all training samples at the current node n are of the same class label c, then the node becomes a leaf node with label c • Or else, select the splitting attribute x that is the most important in separating the training samples into different classes. This attribute x becomes a decision node • A branch is created for each individual value of x and the samples are partitioned accordingly • The process is iterated recursively until a certain value of specified stopping criterion is achieved Different DT models use different splitting algorithms that maximize the purity of the resulting classes of data samples. Popular DT models include ID3, C4.5 (Quinlan, 1986;1993), CART (Breiman, 1984), QUEST (Loh and Shih, 1997), CHAID (Berry and Linoff, 1997) and C5.0. Common splitting algorithms include Entropy based information gain (used in ID3, C4.5, C5.0), Gini index (used in CART) and Chi-squared test (used in CHAID). This study uses the C5.0 DT algorithm which is an improved version of C4.5 and ID3 algorithms. It is a commercial product designed by Rule Quest Research Ltd Pty to analyze huge datasets and is implemented in SPSS Clementine workbench data mining software. C5.0 uses information gain as a measure of purity, which is based on the notion of entropy as in Eq. 1 and 2. If the training subset consists of n samples (x 1 ,y 1 ),…,(x n ,y n ), x i ∈R p is the independent attributes of the sample i and y i is a predefined class Y={c 1 ,c 2 ,…,c k }. Then the entropy, entropy(X), of the set X relative to this n-wise classification is defined as: where, p i is the ratio of X fitting in class c i . Information gain, gain(X, A) is simply the expected reduction in entropy caused by partitioning the set of samples, X, based on an attribute A: where, values(A) is the set of all possible values of attribute A and X v is the subset of X for which attribute A has the attribute value v, i.e., X v = {x ∈ X | A(x) = v}.
Boosting, winnowing and pruning are three methods used in the C5.0 tree construction; they propose to build the tree with the right size (Berry and Linoff, 1997). They increase the generalization and reduce the over fitting of the DT model. Boosting is a method for combining classifiers; it works by building multiple models in a sequence. The first model is built in the usual way. Then, the second model is built in such a way that it focuses on the samples that were misclassified by the first model. Then the third model is built to focus on the second model's errors and so on (Nisbet et al., 2009). When a new sample is to be classified, each model votes for its predicted class and the votes are counted to determine the final class. Winnowing method investigates the usefulness of predictive attributes before starting to build the model (Lau et al., 2010). This ability to pick and choose among the predictive attributes is an important advantage of tree-based modeling techniques. Winnowing method preselect a subset of the attributes that will be used to construct the tree. Attributes that are irrelevant are excluded from the tree-building process.
In case of the current steel plate dataset, only 13 attributes have been selected to build the tree. Pruning is the last method used to increase the performance of the C5.0 DT model here. It consists of two steps; prepruning and post-pruning (Eslamloueyan, 2011). Prepruning step allows only nodes with minimum number of samples (node size). Post-pruning step reduces the tree size based on the estimated classification errors.

Multilayer perceptron neural network: Artificial
Neural Networks (ANNs) are normally known as biologically motivated and highly sophisticated analytical techniques. They are capable of modelling extremely complex non-linear functions. Formally defined, ANNs are analytic techniques modelled after the processes of learning in the cognitive system and the neurological functions of the brain and capable of predicting new patterns (on specific attributes) from other patterns (on the same or other attributes) after executing a process of so-called learning from existing data (Haykin, 2009). Multilayer Perceptron Neural Network (MLPNN) with back-propagation is the most popular ANN architecture. MLPNN is known to be a powerful function approximator for prediction and classification problems. MLPNN's structure is organized into layers of neurons input, output and hidden layers. There is at least one hidden layer, where the actual computations of the network are processed. Each neuron in the hidden layer sums its input attributes x i after multiplying them by the strengths of the respective connection weights w ij and computes its output y j using Activation Function (AF) of this sum. AF may range from a simple threshold function, or a sigmoidal, hyperbolic tangent, or radial basis function Eq. 3: where, f is the activation function Back-Propagation (BP) is a common training technique for MLPNN. BP works by presenting each input patterns to the network where the estimated output is computed by performing weighted sums and transfer functions. The sum of squared differences between the desired and estimated values of the output neurons E is defined as: where, y dj is the desired value of output neuron j and y j is the estimated output of that neuron. In Equation 3, each weight w ij is adjusted to reduce the error E of Eq. 4 as fast, quickly as possible. BP applies a weight correction to reduce the difference between the network estimated outputs and the desired ones; i.e., the neural network can learn and can thus reduce the future errors (Quinlan, 1993;Berry and Linoff, 1997). Figure 1 shows the architecture of three layers perceptron neural network used for the diagnosis of Steel plates faults. Although, BP is standard, simple to implement and in general it works well, it has slow convergence approach and can get stuck in local minima (Haykin, 1994). Another drawback of MLPNN models is that they require the initialization and adjustment of many individual parameters to optimize their performance. In this study, the network is trained using the pruning approach to find the optimal network structure (Chaudhuri and Bhattacharya, 2000;Thimm et al., 1996). The network starts with a large network and removes (prunes) the weakest neurons in the hidden and input layers as training proceeds.
Logistic regression: Logistic Regression (LR) is a generalization of linear regression (Maalouf, 2011). It is a nonlinear regression technique for prediction of dichotomous (binary) class attribute in terms of the predictive ones. The class attribute represent the status of the consumer (creditworthy, y = 1 or Not creditworthy, y = 0). Actually, the algorithm does not predict the class attribute but predicts the odds of its occurrence. The expected probability of a positive outcome P(y=1) for the class attribute is modeled as follows: where, x i , i = 1,…,n are the predictive attributes with real values, B i are the corresponding regression coefficients and B 0 is a constant, all of which contribute to the probability.
While LR is a very powerful modeling tool, it assumes that the class attribute (the log odds, not the event itself) is linear in the coefficients of the predictive attributes (Bewick et al., 2005). Eq. 5 is reduced to a linear regression model for the logarithm of Odds Ratio (OR) of positive outcome, i.e.: However, the right inputs must be chosen with their functional relationship to the class attribute.

RESULTS
The classification performance of each model is evaluated using three statistical measures; classification accuracy, sensitivity and specificity which are Sensitivity in Equation 8 measures the proportion of actual positives which are correctly identified as such while specificity in Eq. 9 measures the proportion of negatives which are correctly identified. Finally, accuracy in Eq. 7 is the proportion of true results, either true positive or true negative, in a population. It measures the degree of veracity of a diagnostic test on a condition. Fig. 2 demonstrates the component nodes of the proposed stream. This stream is implemented in SPSS Clementine data mining workbench using Intel core 2 Dup CPU with 2.1 GHz. Clementine uses client/server architecture to distribute requests for resource-intensive operations to powerful server software, resulting in faster performance on larger datasets. The software offers many modeling techniques, such as prediction, classification, segmentation and association detection algorithms.
Faults dataset node is connected directly to an Excel file that contains the source data. The dataset includes 1941 instances, which have been labeled by one of the prescribed 7 fault types: Pastry, "Z Scratch", "K Scatch", Stains, Dirtiness, Bumps and "Other Faults". Each instance of the dataset owns 27 independent variables and one fault type. The dataset is explored for incorrect, inconsistent or missing data. These predictive attributes are of mixed numeric types; flag or range, while the target class is of type nominal. Type node specifies the field metadata and properties that are important for modeling and other works in Clementine. These properties include specifying a usage type, setting options for handling missing values, as well as setting the role of an attribute for modeling purposes; input or output. The first 27 attributes in Table 1 are defined as input (predictive) attributes and the fault's type is defined as target class. Partition node is used to generate a partition field that splits the dataset into separate subsets for the training and test the models by the ratio of 70:30% respectively. The training subset is used to estimate the model parameters, while the test one is used to independently assess the individual model. These models are applied again to the entire dataset and to any new data. Logistic node is trained using forward method where the model is built by moving forward step by step. With this method, the initial model is the simplest model and only the constant and terms can be added to the model as in Eq. 6. At each step, attributes not yet in the model are tested based on how much they would improve the model; and the best of those attributes is added to the model. When no more terms can be added, or the best candidate term does not produce a large-enough improvement in the model, the final model is generated. Two min and twenty four seconds are required to build this model for the steel plate's faults dataset. However, LR system came out the second best classifier with classification accuracies of 73.26% of training and 72.59% of the test samples. C5.0 node is a boosted decision tree model with C5.0 algorithm which is trained using pruning and winnowing methods to increase the model accuracy. The number of trails of boosting algorithm is 10, the minimum number of samples per node is set to be 2 and the system uses equal misclassification costs. The high speed property is a notable feature of C5.0 DT model; it clearly uses a special technique, although this has not been described in the open literature. The classification accuracies without boosting algorithms are 90.57% of training subset and 90.57% for test one while these accuracies with boosting algorithms are 97.25 and 98.09%. Boosting with 10 trails enhances the accuracies of the tree to reach higher accuracies for training and test samples. The time required to build single C5.0 tree is below one sec. While boosting tree requires 11 seconds with ten trails. This model is the best one among the probability estimation classifiers. MLPNN classifier node is trained using the pruning method (Thimm et al., 1996). It begins with a large network and removes the weakest neurons in the hidden and input layers as training proceeds. The stopping criterion is set based on time. The network is given two min for training. Using the steel plate's faults dataset, the MLPNN with pruning method has achieved 74.79 and 79.14% classification accuracies for training and test datasets. The resulting structure consists of four layers; one input, two hidden layers and the output one with 6, 20, 15 and 7 neurons respectively. Filter, Analysis and Evaluation nodes are used to select and rename the classifier outputs in order to compute the performance statistical measures and to graph the evaluation charts. The steel plate's faults dataset is divided for training the models and test them by the ratio of 70:30% respectively.     0  0  0  0  16  4  3  Bumps  3  4  1  0  4  156  42  Other_Faults  12  14  2  0  2  44  275  LOG  Pastry  53  11  0  3  4  4  7  Z_Scratch  0  85  1  0  1  2  5  K_Scatch  1  1  173  7  1  4  5  Stains  0  0  0  46  0  1  0  Dirtiness  1  0  0  0  19  2  1  Bumps  13  26  1  12  8  112  38  Other_Faults  38  40  6  43  24  58  140 The training set is used to estimate the model parameters, while the test set is used to independently assess the individual model. These models are applied again to the entire dataset and to any new data. The time required to build each model with the dataset is variable; ranging from few seconds up to two min for the neural network. In C5.0 DT model, boosting can significantly improve the accuracy of model, but it also requires longer training. It works by building multiple models in a sequence. Cases are classified by applying the whole set of models to them and using a voting procedure to combine the separate predictions into one overall prediction. The predictions of all models are compared to the original classes to identify the values of true positives, true negatives, false positives and false negative. These values have been computed to construct the confusion matrix as shown in Table 2. The values of the statistical measures (sensitivity, specificity and total classification accuracy) of the three models were computed and presented in Table 3 and 4. Sensitivity and Specificity approximate the probability of the positive and negative labels being true. They assess the usefulness of the algorithm on a single model. Using the results shown in Table 2, it can be seen that the sensitivity, specificity and classification accuracy of C5.0 DT model has achieved the best success of test samples classification.
Correct (  These charts depict that the performances of the decision tree with C5.0 learning algorithm is the best model for training and test subsets. Neural network model is the second best classifier and finally the logistic regression is the worst one. Sensitivity analysis helps to gain some insight into the predictive attributes used in the present classification problem. The analysis provides information about the relative importance of the predictive (input) attributes in predicting the output attribute(s). The basic idea is that the inputs to the classifier are perturbed slightly and the corresponding change in the output is reported as a percentage change in the output (Principe et al., 2000). The first input is varied between its mean plus (or minus) a user-defined number of standard deviations, while all other inputs are fixed at their respective means. The classifier output is computed and recorded as the percent change above and below the mean of that output channel. This process is repeated for each and every input attribute. The sensitivity analysis is performed for this study and presented in a graphical format in Fig. 4

DISCUSSION
Experimental results have demonstrated that advanced data mining techniques can be used to develop models that possess a high degree of diagnostic accuracy. However, there are several issues involved with the data collection and data mining that warrant for further discussion. The amount and quality of the data are key components of the diagnostic accuracy. The measuring process may contain many features that create problems for the data mining techniques. The datasets could be consisted of a large volume of heterogeneous data fields which usually complicates the use data mining techniques. The main criticism about data mining techniques is that they are not following all of the requirements of classical statistics. They use training and testing data sets drawn from the same dataset. In classical statistics, it can be argued that the testing set used in this case is not truly independent and for that reason, the results may be biased.
Despite these criticisms, data mining can be an important tool in the fault diagnosis problem by identifying patterns within the large sums of data, data mining can and should, be used to gain more insight into the faults, generate knowledge that can potentially fuel lead to further research in many areas of manufacturing. The high degree of diagnostic accuracy of the models evaluated here is just one example of the value of data mining.

CONCLUSION
The objective of this study was to demonstrate the application of classification techniques in the problem of steel plates fault diagnosis and to describe the strengths and weaknesses of the methods described. Three different classification models have been evaluated. These models are derived from different family namely; Multi-Layer Perceptron Neural Network (MLPNN), C5.0 Decision Tree (DT) and Logistic Regression (LR). These models are optimized using different methods. Pruning method was used to find the optimal structure of MLPNN model. The C5.0 DT has been built using boosting algorithm with 10 trails. Finally, the LR model was constructed using the stepwise forward method to gradually build the system. The performance of these models were investigated using known sets of steel plates proven faults features obtained from the University of California at Irvine (UCI) machine learning repository. Experimental results have shown that the C5.0 decision tree gives better performance than the other models. Furthermore the boosting algorithm enhances the performance of C5.0 DT algorithm. Although data mining techniques are capable of extracting patterns and relationships hidden deep into large datasets, without the cooperation and feedback from the experts and professional, their results are useless. The patterns found via data mining techniques should be evaluated by professionals who have years of experience in Predicting steel plates faults.