Performance Analysis of Data Mining Tools Cumulating with a Proposed Data Mining Middleware

: Data mining has becoming increasingly popular in helping to reveal important knowledge from the organization’s databases and has led to the emergence of a variety of data mining tools to help in decision making. Present study described a test bed to investigate five major data mining tools, namely IBM intelligent miner, SPSS Clementine, SAS enterprise miner, oracle data miner and Microsoft business intelligence development studio. Present studies focus on the performance of these tools. Results provide a review of these tools and propose a data mining middleware adopting the strengths of the tools.


INTRODUCTION
In today's information age, in order to stay competitive in the market, there is a need for a powerful analytic solution to help in the extraction of useful information from the large amount of data collected and stored in an organization's databases or repositories. This has led to the emergence of Knowledge Discovery in Databases (KDD) which is responsible to transform low-level data into high-level knowledge for decision support. According to Fayyad et al. [12] "Knowledge discovery in databases is the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data" [13] . Knowledge discovery process consists of the list of iterative sequence steps of processes and data mining is one of the KDD processes. Data mining is the application of algorithms for extracting patterns from data [13] . These extracted patterns will provide useful knowledge to decision makers. As such, there has been an increasing demand for data mining tools to help organizations uncover knowledge that can guide in decision making. Each tool offers a wide range of functionalities to users. The functionalities provided are important in a data mining tool, however, the performance of the tool is also a necessary feature. Therefore, in this research, we have selected five widely used data mining tools to study their performance.

MATERIALS AND METHODS
Motivation: Successful implementation of the data mining effort requires a careful assessment of the various tools and algorithms available. Frequent pattern mining forms a core component in mining associations, correlations, sequential patterns, partial periodicity, and so forth, which are of great potential value in applications. Many methods have been proposed and developed for efficient frequent pattern mining in various kinds of databases, including transaction databases, time-series databases, and so forth. As we are flooded with the wide range of features, we should not omit data mining tools' performance. In today's information age, companies are no longer dealing with megabytes of database but instead gigabytes and terabytes. Based on the polls from KDNuggets, in the year 2006, the median value for the largest database size used to perform data mining is between 1.1 and 10 Gigabytes, and 12% report mining terabyte size of databases [27] . Therefore, the scalability of a data mining tool should be leveraged to enable massive data sets to be analyzed. As data sets increase in size, data mining tools become less and less efficient. This is an important feature for a data mining tool which motivates us to study the performance of the data mining tools.

Analysis of data mining tools:
In present study, we identified five major data mining tools which we perceived as market leaders. We study the performance criteria of each tool based on six attributes which we have introduced. Each of the attributes is associated with metrics that we perceived as important measurement to the attribute. A known limitation to our study is that we only study a subset of the data mining tools that are currently available. However, they are either market leaders or major data mining tools that are deemed adequate to represent the current state of technologies in data mining.
Selected data mining tools: There are numerous data mining tools available in the market. Based on the META Group's META spectrum report for data mining, a few data mining tools have been cited as leaders in the data mining market, namely SPSS Clementine, Oracle Data Mining, and SAS Enterprise Miner 0. Therefore we limit our scope of study to these few data mining market leaders. In addition, we have also chosen IBM Intelligent Miner and Microsoft Business Intelligence Development Studio that we considered as the major data mining tools in our analysis. The selected data mining tools are shown in Table 1.

Data mining performance measures:
We identified several recommended metrics that we perceived as significant to determine the performance attributes of data mining tools such as IBM Intelligent Miner, SPSS Clementine, SAS Institute Enterprise Miner, Oracle Data Miner, and Microsoft Business Intelligence Development Studio. These metrics are measures on the application tier and thus also referred to as application tier metrics. Table 2 shows the attributes measured and the associated metrics as included in [31] . Specific thresholds are used as a baseline to measure the performance attributes of IBM Intelligent Miner, SPSS Clementine, SAS Institute Enterprise Miner, Oracle Data Miner, and Microsoft Business Intelligence Development Studio. We identified several metrics from Table 2 to measure the performance of these tools. Table 3 shows these metrics together with the description of the metrics. These metrics were chosen due to our focus on increasing performance of data mining by maximizing memory usage and reducing I/O.

Analysis strategy: Test bed:
To facilitate an analytical comparison of different frequent mining methods based on the listed performance metrics in Table 3, an open test bed has been constructed to study the performance of the various data mining tools based on the adventure works database [25] with different types of algorithms that the data mining tools supported as depicted in Fig. 1.
The test bed consists of a synthetic data generator. It is used to generate large sets of synthetic data in various kinds of distributions. These data will act as the  The data will be mined by various data mining methods ranging from different types of algorithms that are supported by IBM Intelligent Miner [22] , SPSS Clementine [26] , SAS Institute Enterprise Miner [24] , Oracle Data Miner [23] and Microsoft Business Intelligence Development Studio [25] . We will test the performance of the tools based on a few data mining algorithms which are Classification algorithms, Configuration and selection of data mining algorithms will be done during Test Administration. Data mining process is done by the selected data mining algorithms and the result will be either stored in the Results Database or in file based on the nature of the data mining tools.
Adventure Works Cycles, a fictitious company on which the Adventure Works database is based [25] , is a large multinational manufacturing company. The company manufactures and sells metal and composite bicycles to North America, European and Asian commercial markets. Throughout the process of the analysis in the test bed, we used Windows Performance Monitor to monitor the performance of each data mining tool based on the performance measures defined in Table 3.

Analysis strategy:
Test environment: Each tool may perform differently in behaviour. In order to achieve standardization during testing we have proposed a fix test environment shown in Table 4 and 5.   Table 6, Fig. 2-6 show the test results of each data mining tool mentioned above. We measured the performance of five algorithms, namely classification algorithms, regression algorithms, segmentation algorithms, association algorithms, and sequential analysis algorithms based on a subset of the metrics in Table 3. Throughout testing, we encountered several exceptions which cause data mining computations to terminate unexpectedly. Such exceptions are denoted by an "E" notation on the table.

RESULTS AND DISCUSSION
Test results comparison: Memory access times are measured in Nanoseconds. Disk access times are measured in Milliseconds. The difference factor is a million which means that disk access times are about a million times slower than memory access. As such, we believe that data mining should be performed inmemory. Based on our study, tools like Microsoft Business Intelligence Development Studio and IBM Intelligent Miner consume a large amount of both memory usage and disk usage on the application tier. These tools are distributing part of the data mining load to the application tier. Such a strategy delegates computing cycles from the backend systems right to the application systems. However, this strategy potentially might lead to problems such as disk issue and memory issue. With reference to Table 6 and the given charts, there might be possibly of slight variations on the results collected on the metrics on different algorithms such as sequence analysis algorithms, association algorithms and segmentation algorithms. A possible hypothesis is that some of the tools might produce better performance on different algorithms and as such the tools are able to compute better performance attributes. For example, Microsoft Business Intelligence Studio (90% of physical disk\ disk time and 80% logical disk\disk time) consumes a large amount of disk usage on classification algorithms as compared to tools like SPSS Clementine (30% of physical disk\disk time and 30% logical disk\disk time) but only consumes a small amount of disk usage on segmentation algorithms (10% of physical disk\disk time and 10% logical disk\disk time) as compared to SPSS Clementine (40% of physical disk\disk time and 50% logical disk\disk time). Such a result might be likely caused by the implementation strategy on the algorithms by each vendor. For example, vendors like SPSS might be more efficient in implementing classifications as compared to Microsoft. On the other hand, Microsoft might be more efficient in implementing segmentation algorithms as compared to IBM. In short, the implementation strategy of algorithms indirectly affects the performance of data mining tools. A recommended solution would be to allocate, if possible, a major percentage of data mining computations at the memory level to minimize disk activity and maximize memory activity.
Our study also reveals that data mining at the memory level will lead to better performance. For example, IBM Intelligent Miner consumes only 15% of physical disk\disk time and 100mb of memory\available bytes. This explains that there is a trade-off between memory and disk. If we spend more time at the memory level, then we should spend less time on disk activity (also referred to as Disk I/O). Disk I/O is often a major bottleneck to data mining performance.
Proposed architecture of server-based data mining middleware: Based on the test results, the strategies used by data mining tools both show their strength and weaknesses. The proposed data mining middleware will adopt the strengths of dominant data mining tools, coupled with its added features.
With reference to the test results in Table 6, Fig. 2-6, we noticed that there is a correlation between memory and disk. If the memory activity is high, then the disk activity will be low and vice versa. Such a relationship explains that computations performed at the memory level are faster than computations performed at the disk level. This leads to the proposed data mining middleware, namely Java-Based Data Mining Middleware (also referred to as JDMM). JDMM will use memory extensively on data mining computations as we believe that the memory is more efficient on handling computations as compared to disk which requires input/output. The objective is to minimize disk usage and to maximize memory usage. As such, the number of algorithms supported by JDMM is scalable. If a specific algorithm is identified to be the potential bottleneck to JDMM, we can easily remove the plugin of this algorithm. Alternative, JDMM can also tackle such inefficiency through its data mining repository which is memory-based. We ultimately transfer the computations of these inefficient algorithms to the memory repository and let the repository tackle the issue of performance. In addition, the proposed JDMM will be able to handle multiple data sources. JDMM is a server centric middleware. It is platform-data source and data mining techniqueindependent which are accessible from front-, back-and web-office environments. The high-level architecture of the JDMM is depicted in Fig. 7. JDMM will allow users to mine data from multiple data sources ranging from relational databases, object-oriented databases, flat files and so forth. This is possible through the in bound thread where different adapters are allowed to plugin into the JDMM. These adapters allow users to connect to different data sources. The Java-based Data Miner (JDM) will be the core engine to perform data mining operations on the incoming uninterpreted data from the data sources. The Data Mining Repository is used to persist mined objects. The sole objective of the memory repository is to reduce any I/O during the process of mining the data sources. Majority of the data mining processes within JDMM are performed using the cached data from the memory repository. Lastly the results are delivered to users through the Outbound thread where the mined data are formats into either PDF file, XLS file, XML file, text file, html/htm file or any proprietary formats that are incorporated into the JDMM. CONCLUSION We believe, in the near future, most data mining products will effectively utilize the memory extensively during data mining. With 64-bit computing, we believe that memory limitation is no longer an issue. Alternatively, the continual reduction in the cost of memory will benefit 64-bit computing. 64-bit computing has gradually shifted down to the personal computer desktop which means, in the near future, data mining products no longer need a powerful dedicated server to compute small to medium size data mining computations. Apart from 64-computing, future research on data mining products will need to be more comprehensive covering attributes such as maintainability, adaptability, scalability, reliability and portability.