CLASSIFICATION MODEL FOR HOTSPOT OCCURRENCES USING SPATIAL DECISION TREE ALGORITHM

Developing a predictive model for forest fires occurrence is an important activity in a fire prevention program. The model describes characteristics of areas where fires occur based on past fires data. It is essential as an early warning system for preventing forest fires, thus major damages because of fires can be avoided. This study describes the application of data mining technique namely decision tree on forest fires data. We improved the ID3 decision tree algorithm such that it can be utilized on spatial data in order to develop a classification model for hotspots occurrence. The ID3 algorithm which is originally designed for a non-:>patial dataset has been improved to construct a spatial decision tree from a spatial dataset containing discrete features (points, lines and polygons). As the 103 algorithm that uses information gain in the attribute selection, the proposed algorithm uses spatial information gain to choose the best spliuing layer from a set of explanatory layers. The new formula for spatial information gain is proposed using spatial measures for point, line and polygon features. The proposed algorithm has been applied on the forest fire dataset for Rokan Hilir district in Riau Province in Indonesia. The dataset contains physical data, socio economic, weather data as well as hotspots and non-hotspots occurrence as target objects. The result is a spatial decision tree with 276 leaves with distance from target objects to the nearest river as the first test layer and the accuracy on the training set of 87 .69%. Empirical result demonstrates that the proposed algorithm can be used to join two spatial objects in constructing a spatial decision tree from a spatial dataset. The algorithm results a predictive model for hotspots occurrence from the real dataset on forest fires with high accuracy on the training set.


INTRODUCTION
neighbouring countries such as Malaysia and Singapore.Fire prevention has an important role in minimizing the Forest fires in various parts in Sumatera and damage due to forest fires.An early warning system as Kalimantan, Indonesia occur every year especially in dry one of the activities in fire prevention needs to be season.This phenomenon causes many negative effects developed in order to assess forest fire risks. in various aspects of life such as natural environment, Many studies have been conducted in developing economjc and health.Forest fire is considered as a wildfire risk models by integrating Geographical regional and global disaster because its effects influence Information Systems (GISs) and remote sensing.Spatial not only people in local areas but also those who live in operations in GISs have widely. of fires and their relationships as well as to produce fire risk maps.The GIS-based method of Complete Mapping Analysis (CMA) is applied in (Boonyanuphap et al., 2001) to create the wildfire risk model for the area of Sasamba in East Kalimantan in Indonesia.A model of forest fire hazard in East KaJimantan in Indonesia using the remote sensing technique integrated with the GIS has been constructed in (Darmawan et al., 2001).In other studies, GlSs and remote sensing are used to analyze forest fire data to create forest fires risk models for some regions in Indonesia (Hadi, 2006;Danan, 2008).The method Complete Mapping Analysis (CMA) in (Hadi, 2006) and Multi-criteria Analysis (MCA) in (Danan, 2008) are also applied to analyse the fire risk factors and relations between the factors.A-GIS-based peat swamp forest fire hazard model that integrates Analytical Hierarchy Process (AHP) and GIS analysis was developed for I.he region Pekan in Pahang Malaysia (Setiawan, 2007).The model is based on five parameters: fuel type, road proximity, elevation, slope and aspect, which influence the occurrence and spread of forest fires.A GIS is utilized in fire hazard modeling and mapping of fire hazard rating in peat swamp forest of Penor/Kuantan District of Pahang in Malaysia (Sheriza, 2007).The data used in this work include fuel types, roads and canals.
Studies about forest fire risks may include huge spatial data including physical, climate and socioeconomic data.The data are stored in spatial databases in GISs.Spatial databases contain large number of spatial features and their relationships for further manipulation and analysis to help users in decision making process.Criteria evaluation and weighting methods, such as Complete Mapping Analysis (CMA) and Multi-criteria Analysis (MCA), are most applied to evaluate small problems containing few criteria.This situation has lead to the increasing in applying data mining techniques to extract interesting spatial patterns from large spatial data.Data mining tasks including association rules mining, classification and prediction, as well as cluster analysis have been successfully employed to analyse spatial data related to forest fires (Tay et al., 2003;Stojanova et al., 2007; Yu and Bian, 2007; Kalli and Ramakrishna, 2008;Hu et al., 2009).
Extracting interesting and useful patlems from spatial datasets is more difficult than those from traditional numeric and categorical data because spatial data types are complex (Shekhar et al., 2004).Funhermore, extracting patlems from spatial datasets includes spatial relationships and spatial autocorrelation (Shekhar er al., 2004).Classical data mining methods do not support locations of objects or relationships between objects that implicitly exist in a spatial dataset (Zeitouni, 2000).Therefore the methods cannot be utilized to discover knowledge from spatial datasets.Locations of objects determine relations of the objects to its neighbours.According to Koperski et al. ( 1998), there are three types of relation to relate an object to its neighbor, i.e., topological-relations, metric-relations and direction-relations.In order to handle spatial data and the relations to its neighbors that implicitly exist in the spatial data, new data mining methods need to be developed.One of the spatial data mining algorithms that has been introduced in many studies is spatial classification.Spatial classification methods broaden non-spatial classification methods by involving attributes of neighboring objects and their spatial relations in addition to attributes of the objects to be classified (Koperski et al., 1998; Ester et al., 1997).In a spatial classification task, we want to extract rules that split a spatial dataset consisting of classified objects into a number of classes based on non-spatial and spatial properties, as well as spatial relations of the classified objects to other objects (Koperski et al., 1998;Ester et al., 2000;Zeitouni and Chelghoum, 2001).
In this study, we extend the ID3 decision tree algorithm (Quinlan, 1986) for a non-spatial dataset such that the algorithm can be applied on a spatial dataset containing point, line and polygon features.The proposed algorithm uses information gain for spatial data, namely spatial information gain, to choose a layer as the splitting layer.Instead of using number of tuples in a partition, spatial information gain is calculated using spatial measures.Our study adopts the formula for spatial information gain as described in (Rinzivillo and Turini, 2004).The spatial measure formu la is extended for the geometry type of points, lines and polygons rather than only for polygons as in (Rinzivillo and Turini, 2004).The proposed algorithm was applied to the forest fires dataset containing physical, socioeconomic and weather data as well as hotspots occtuTence for Rokan Hilir district in Riau Province in Indonesia.

Dataset and Study Area
The dataset for modeling hotspots occurrence contains spread and coordinates of hotspots in 2008, physical, socio-econorruc, as well as weather data.The study area is Rokan Hilir district in Riau Province in , Indonesia (fig.1).Hotspots "d physical data are obtained from Ministry of Environtnent Indonesia and National Land Agency (BPN) Riau Province respectively.Socio-economic data are collected from Statistics-Indonesia (BPS).Weather data of 2008 including precipitation (mm/day), screen temperature (K), IOm wind speed (mis) and swface height (m) are gathered from Meteorological Climatologically and Geophysical Agency (BMKG) Indonesia.There are two categories of data: spatial and non-spatial data.Non spatial data are socio-economic data for villages in Rokan Hilir which are represented in the DBF format.The data include population density and inhabitant's income source and number of schools.For mining purpose using the spatial decision tree algorithm, these non spatial data were converted to spatial data in the shp format by involving the shp files for administrative borders of villages and subdistricts in Rokan Hilir.Spatial data include physical (roads, rivers land cover and city centers) and weather data (precipitation, screen temperature, !Om wind speed and surface height).We assign the spatial refere nce system UTM 47N and datum WGS84 to all data.Preprocessing was conducted for hotspots occurrence to generate target objects, as well as for physical and socio-economic data.Data preprocessing is important to improve the quality of the data, thereby it can improve the accuracy of the resulted model as well as efficiency of data mining process.Moreover, we performed spatial interpolation using Cokriging method for weather data that arc originally represented in the Net CDF format.

Spatial Relationships
Determining spatial relationships between two features is a major function of a Geographical Information System (GIS).Spatial relationships include topological (Egenhofer and Franzosa, 1991) such as overlap, touch and intersect as well as metric such as distance.For example, two different polygo n features may be either overlap, touch, or intersect each other.Spatial relauonsh1ps make spatial data mining algorithms 246 differ from non-spatial data mining algorithms.Spatial relationships are implemented by an extension of the well-known join indices (Valduriez, 1987).The result of join index between two relations is a new relation consisting of indices pairs each referencing a tuple of each relation.The pairs of indices refer to objects that meet the join criterion.The structure Spatial Join lndex (SJJ) as an extension of the join indices (Valduriez, 1987) in the relational database framework is introduced in (Zeitouni et al., 2000).Join indices can be handled in the same way as other tables and manipulated using the standardized SQL query language (Zeitouni et al., 2000).In addition to two columns of object identifiers, a SJI has a third column that contains spatial relationship between two layers .
Our study adopts the concept of SJI as in (Zeitouni et al., 2000) to store relations between two different layers in a spatial database.Instead of spatial relationship that may be either numerical or Boolean value, the quantitative values in the third column of SJI are spatial measures of features resulted from spatial relationships between two layers.
We consider an input for the algorithm a spatial dataset as a set of layers L.  between two points or an intersection area of two palygons .We denote these values as spatial measures as in (Rinzivillo and Franco. 2004) which is used in calculating spatial information gain in the proposed algorithm.For the case of topological relation, the spatial measure of a feature is defined as follows.Let L; and L, in a set of layers L, L, t-L,, for each feature r, in R = SpatRel(L,.L;). a spatial measure of r, denoted by SpatMes(r;) is defined as: Area of r,, if< L, in L, >or< L, overlap L; >hold for all features in L, and L, represented in polygons Count of ri, if< Li in L, > holds for all features in L, represented in points and all features in L 1 represented in polygons For the case of metric relation, we define a distance function from p co q as dist(p, q), distance from a point (or line) p in L; to a point (or hoe) q in L,.
Spacial measure of R 1s denoted by SpatMer(R) and defined as Equation I : for r, in R, i = 1, 2 .... , n and n number of features in R. f is an aggregate function that may be either sum, min, max or average.
A spatial relationship applied to L, and l; in L results a new layer R. We define a Spacial Join Relation (SJR) for all features p in L; and q in L; as follows (Equation 2): SJR = ((p.SpalMes(r),q Ir is a feature in R associated to pandq) (2)

An Extended ID3 Algorithm for Spatial Data
A spatial dataset is composed of a set of layers in which all features in a layer have the same geometry type.
This study considers only discrete features including points, lines and polygons.There are two groups of layers: explanatory layers and one target layer (or reference layer) where spatial relationships are applied to construct a set of lllples.The target layer has some attributes including a target attribute that store target classes.Each explanatory layer has several attributes.One of the attributes is a predictive attribute that classify tuples in the dataset to target classes.In this study the target attribute and predictive attributes are categorical.Features (polygons, bnes or points) in the target layer are related to features in explanatory layers to create a set of tuples in which each value in a tuple corresponds to value of these layers.Two distinct layers are associated to produce a new layer using a spatial relationship.A relation between two layers produces a spatial measure (Equation I) for the new layer.Spatial measures are used in the formula for spatial information gain.
Building a spatial decision tree follows the basic learning process in the algorithm ID3 (Quinlan, 1986).
The 103 calculates information gain to define the best splitting layer for the dataset.In the spatial decision tree algorithm, we define spatial information gain to select an explanatory layer L that gives best splitting the spatial dataset according to values of predictive attribute in the layer L. For this purpose, as in (Rinzivillo and Turini, 2004), we apply the spatial measure (Equation I) to the formula.
Let a dataset D be a training set of class-labelled tuples.In the non-spatial decision tree algorithm, we calculate probability that an arbitrary tuple in D belong to class Ci and it is estimated by IC 1 .0VIDIwhere IDI is number of tuples in D and IC;,ol is number of tuples of class Ci in D (Han and Kamber, 2001).In this study, a dataset contains some layers including a target layer which stores class labels.All objects in the target layer are represented in points.Number of tuples in the dataset is the: same as number of objects in the target layer because each tuple is created by relating features in the target layer to features in explanatory layers.One feature in the target layer associates with exactly one tuple in the dataset.For simplicity we use number of objects in the target layer instead of using number of tuples in the spatial dataset in the formula of spatial en!ropy (Equation 3).Furthermore, in a non-spatial dataset, target classes are discrete-valued and unordered (categorical) and explanatory attributes are categorical or numerical.In spatial dataset, features in layers are represented in a particular geometric type (polygons, lines or points) that have quantitative measurements such as area and distance.lberefore we calculate spatial measures of layers (Equation 1) to replace number of tuples in a non-spatial dataset.

Entropy
Let a target attribute C in a target layer S has I distinct classes (i.e., c., c 2 , ••• , Ct). entropy for S represents the expected information needed to determine the class of tuples in the dataset and defined as Equation 3: L ' SpatMes(Ss) SpalMes(S.,) .... , SpatMes(S) represents the spatial .measure of layer S as defined in (Equation I).~ Spatial datasd D. \\illch is a sd of tn.ining tuples and thcix associated class labels.
These tuples ue constructed from a sd of lay=.P, using spatial relations.b.A urgd I.a.ya S e P with a tMget attn'b-.iteC. c.A non empty set of cxpl.matorylaym L t; P and L e L bu a predictive ;;ittnoutt V.

P=SvL.
d .Spatial Join Relation (SIR) on the set of lay= P. SJR(P).as defined in (2).Attach node N, to N and labd the edge ";th a sdtttcd ,oaJue of prcdicti,•e attn"bute Vin L9. udfor.Let an explanatory attribute V in an explanatory (non-target) layer L has q distinct values (i.e., v., v 2 , .... vq).We partition the objects in target layer S according lo the layer L then we have a set of layers L(v;, S) for each possible value v 1 in L. In our study, we assume that the layer L covers all areas in lhe layer S. The expected entropy value for splitting is given by Equation 4: partitioned according to the layer L to result lhe "best classification" sue~ that H(S!L) minimum.

Spatial Decision tree Algorithm
Figure 2 shows our proposed algorithm to generate a Spatial Decision Tree (SDT) as discussed in our previous work (Sitanggang et al., 2011 ).Inputs of the algorithm are divided into two groups: (I) a set of layers containing some explanatory layers and one target layer that hold class labels for tuples in the dataset and (2) Spatial Join Re lations (SJRs) storing spatial measures for features resulted from spatial relations between two layers.The algorithm generates a tree by selecting the best layer to separate the d ataset into smaller partitions as pure as possible meaning that all tuples in partitions belong lo the same c lass.
H(S!L) represents the amount of infonnation needed (after the partitioning) in order to arrive at an exact classification.

Spatial lnf ormation Gain
The spatial information gain for the layer Lis given by Equation 5: Gain(L) denotes how much infonnation that would be gained by branching on the layer L. The layer L with the highest information gain, (Gain(L)), is chosen as the splitting layer al a node N. Objects in the dat.asel are 248

RESULTS
The proposed algorithm has been applied to the real active fires dat.aset for the Rokan Hilir District in Riau Province Indonesia with the total area of 8,881.59lcm 2 • The dataset contains ten explanatory ~y~ and one target layer.
The target layer consists of acti~ fires (hotspots) as true alarm data and non-hotspots as false alarm data.-------------------- 1 summaries the number of fealures in the dataset for each layer.Spatial relationship /11 is applied to the target layer and explanatory layers represented in polygon (land_cover, income_source, population, school, precipitation, screen_temp and wind_speed) to result spatial measure Count.Additionally, the spatial relationship Distance is calculaled to relate the target layer to the river, road and city centre layer.We use the aggregate function Min to determine distance from target objects to nearest river, road and city centre.111c aggregate function Sum is applied to extract all objec~ in the target layer which are located inside polygon features.
The decision tree generated from the proposed spatial decision tree algorithm contains 276 leaves with the accuracy of 87 .69%.The first test attribute of the tree is distance from target objects to nearest river.Below are some rules generated from the tree:

DISCUSSION
The proposed algorithm performs well on the spatial dataset containing polygon, line and point features 10 classify 744 target objects (hotspots and non-hotspots) with 1he accuracy of 87.69%.Based on the experiment on the real forest.fires daca, the proposed algorithm can be used to join two spatial objects in constructing a spatial decision tree from a spatial dataset.From the tree we can generate 276 rules in which the first test condition for each rule is distance from a target object to nearest river.
The spatial daiaset may contain noises and outliers that may result the problem of over fitting.We will implement the tree pruning method for the resulted tree to overcome the problem such that the tree becomes simpler and has higher accuracy.Due lo non-availability a real testing dataset, the accuracy was calculated based on the training dataset.Therefore, in the future we will apply the tree to another are;t as a testing spatial dataset to study the performance of the proposed algorithm on a new area.

CONCLUSION
This study presents an eX«.flJ.dedlD3 algorithm to create a classifier namely spalfal decision tree from spatial data.For mining purpose,' a spatial dataset is lmas Sukaesih Sitanggang et al./ Journal of Computer Science 9 (2): 244 -251, 2013 organized in a set of layers in which the layers are grouped into two categories i.e .. explanatory layers and a 1arge1 layer.All layers are represented in discrete features (polygons, lines and points).The algorithm calculates spatial information gain as the extension of information gain in 1he non-spatial ID3 algorithm.Spatial measure resulted from 1he spatial relationships 1ha1 may be either 1opological or metric (distance) is used in the formula of spatial information gain instead of number of tuples in the non-spatial information gain.The algorithm selects an explanatory layer which has the highest spatial information gain as the best spliuing la)er.This layer separates the dataset into smaller partitions as pun: as possible such 1hat all tuples in panitions belong to the same class.The extended 103 algorithm has been applied on the real spatial dataset on forest fires consisting of ten explanatory layers and a target layer.Empirical result shows that the algorithm can be used to join 1wo spatial objects in constructing spatial decision trees.The result is a spatial decision tree consisting of the root (distance from target objects to the nearest river) and 276 leaves.The accuracy of the tree on the training set is 87.69%.

ACKNOWLEDGMENT
The researchers would like to thank Indonesia Directorate General of Higher Education (IDGHE).Ministry of National Education.Indonesia for supporting PhD Scholarship (Contract No. 1724.2/04.4/2008)and Southeast Asian Regional Center for Graduate Study and Research in Agriculture (SEARCA) for partially supporting the research.