A DATA WAREHOUSE DESIGN FOR THE DETECTION OF FRAUD IN THE SUPPLY CHAIN BY USING THE BENFORD’S LAW

Large data volumes and the inability to analyse the m enables fraudulent activities to go unnoticed in supply chain management processes such as procurement, war ehouse management and inventory management. This fraud increases the cost of the supply chain m a agement and a fraud detection mechanism is necess ary to reduce the risk of fraud in this business area. This study was carried out in order to develop a da ta warehouse design that supports forensic analytics b y using the Benford’s law in order to detect fraud. The approach relies on a generic and re-usable store pr cedure for data analytics. The data warehouse was tested with two datasets collected from an operational sup ply chain database from the inventory management an d warranty claims processes. The results of the resea ch showed that the supply chain data analyzed obey s to Benford’s theory and that parameterized stored proc edures with Dynamic SQL provide an excellent tool t analyze data in the supply chain for possible fraud detection. The implications of the results of the study are that the Benford’s law can be used to detect fraud in the supply chain with the help of parameterized stored procedures and a data ware house, this can ease the workload of the fraud analyst in the supply chain function. Although the research only used data from the inventory management and warranty claim processes, the proposed store procedures can be ext end d to any process in the supply chain making the results generalizable to the supply chain managemen t process.


INTRODUCTION
The amounts of data produced each day increase steadily and demand companies to either surrender or rise to the challenges and opportunities that this trend brings along. Large data volumes and the inability to analyse or even process then, enables fraudulent tendencies to go unnoticed, even when new tracking devices or automatic scans are introduced and are determined to prevent those tendencies, but also add to the amount of data. Business areas like Supply chains become more complex as the world is getting smaller, trading and shipment is done throughout the globe and the number of involved parties, systems and transactions is rising. Design and Implementation methodologies for systems to process that data must be aligned to keep pace and systems must offer new concepts to not only process the volumes of data, but also in terms of velocity and variety. Generic approaches that enable analysts to apply algorithms fast without re-implementing them but to concentrate on the definition of the analytical dataset and the interpretation of the results are necessary for effective and successful data warehouse design and implementation. Solutions to address these challenges are more or less nonexistent, data warehousing mostly focuses on reporting issues and a stable data basis, while forensic disciplines discuss algorithms and pretend the "perfect" data loaded, stored and ready for use. The goal of the project is to Science Publications AJAS develop a data warehouse design to support analytics, reporting and data mining and forensic analytics by using the Benford's law within the area of supply chain. This is to detect fraud or to point out data anomalies that are worth further examination by subject matter experts.

Supply Chain Fraud and Data Mining
Risk consulting company Kroll indicates that "the same information technologies that help shippers manage global supply chains could make them more vulnerable to supply-chain fraud" (Hoffman, 2008). The detailed fraud report of 2008 points out that especially the increasing demand for natural resources often forced to act fast, exploring new sites, setting them up for production and find supplier and logistic companies to provide energy, staff and the like. As these sites are often in remote areas, the number of available suppliers are quite limited and this might lead responsible managers into situations where "flexibility" and the need to "make things happen" cause non-legal or at least questionable actions. Dependency on one supplier, conflict of interest or bribery may be some of those actions taken (Kroll Advisory, 2010). Hesketh (2010) however points out that within the supply chain data is not uniformed, mostly not administrated properly and can be highly inaccurate or incomplete, which makes the whole process of the movement of goods in-transparent to both buyer and seller. Not describing the shipped goods creates a whole list of risks for all parties involved, like to the carrier if for example the weight is not properly defined, what seems to be common practice in the container ship industry (which is hereby the only sector of that industry) and "is due to many shippers deliberately under-declaring containers' weights in order to minimize import taxes calculated on cargo weight." This lack of information, data collection leads to a lack of knowledge, reduces the options for improvement and even enables criminals or shady operators to gain advantage of these deficits, causing damage of 20 billion USD annually.
The demand for data analytics in fraud prevention is high. Mayne (2010) points out and gives the example of enterprises that are using forensic methods to predict supply chain fluctuations, often driven by the need to improve their business, follow compliance rules or concerns regarding data breaches. Data analytics has a history of successful application in fraud detection; Massa and Valverde (2014) were able to implement a fraud detection system based of anomaly detection data analysis techniques to predict computer intrusion attacks in e-commerce web applications. Mishra et al. (2013) used data mining in the area of credit card transactional fraud for the Indian market. Forensics has become more and more the responsibility of IT departments, as they administer the data, maybe on multiple servers spread across the world, with virtual storage ("cloud solutions") involved, making forensic analysis not an easy task.
According to Hu (2009), the computer forensics process mainly consists of obtaining, identifying and analysing evidence. The author points out that one problem with forensic evidence is that already the collection might alter the original state of the evidence (such as last-accessed timestamp for a file). This seems highly relevant when it comes to intrusion detection and other more network or system breech related issues, but for legacy supply chain management systems, tax-compliance and accounting in general, most transactions are still to be preserved on paper or at least in the form they were transmitted.
Regarding the security of e-procurement solutions, security violations continue to rise, in number as well as frequency note Stephens and Valverde (2013). The more information technology is the basis to drive business value, the more important and critical transaction data becomes.
Data Mining is defined by Nirkhi et al. (2012) as "part of the interdisciplinary field of knowledge discovery in databases", which made its first steps in the 1980s and grew rapidly within the next decade and made its appearance within the area of digital forensics to gain insights from large volumes of data. Hand et al. (2001), cited by Nirkhi et al. (2012) describe the formal methodology of data mining by the following steps: • Determine type and structure of the datasets • Decide how best to quantify the data • Choose proper algorithmic processes for a scoring function • Decide what data management principles are needed to implement the algorithms efficiently Additionally Nirkhi et al. (2012) point out how important visualization is for the digital investigators, as it enables them to determine vital information "rapidly and efficiently".
Identifying data mining algorithms or methods to be used in the area of digital forensics in general and fraud detection in particular was either approached in general terms often by mere descriptions of the need of doing so or very detailed mathematical articles, demonstrating the statistical correctness of a method or algorithm. Both extremes were not included into the research for this project.

Benford's Law Theory
Benford's law may be categorized as a descriptive data mining method, as it discriminates data, but also as Science Publications AJAS predictive, as it identifies characteristics of datasets that may help to predict future schemas.
There is a large amount of research on Benford's law, Durtschi et al. (2004) counted more than 150 articles over the period of 50 years, approaching it from many aspects, like mathematical integrity to use in the economics. As most of the mathematical theory is beyond the focus of this project, only articles discussing Benford in general or with relation to fraud detection were regarded.
These expected frequencies of digits in lists of numbers were described by Benford (1938) and verified but also controversially discussed in numerous papers until today. The use of the method is widespread, Geyer and Marti (2012) applied it to volcanological data and indicate their datasets follow Benford's law and that their "analysis shows that excessive data round-off, data errors, or anomalies may be detected when comparing the data with Benford's law expected frequencies".
On the other end of the discipline scale, Nigrini and Mittermaier (1997) introduce in their study Benford's Law as a method for accounting auditing, where analytical procedures are required and successfully apply the law to a sample of data from an oil company. However, Özera and Babacanb (2013) found that the offbalance sheet account disclosures of a Turkish bank does not comply to Benford's law but admit that if "a data set does not comply with Benford Distribution is enough to be suspicious about that data set". Nigrini (1999) listed multiple applications for Benford's law and digital analysis, some of which give a hint as to where the analysis might be helpful within the supply chain process: • Estimations in the general ledger • The relative size of inventory unit prices among locations • Duplicate payments • Processing inefficiencies due to high quantity/low dollar transactions • Customer refunds Benford found out that numbers with low first digits occurred more often, after he observed that in a book with logarithmic Table 1 the pages with low first digits (1,2) were more worn out than the ones for high first digits (8,9). He then derived the expected frequencies of the digits: where, d is a number {1,2..9} and P is the probability (Nigrini, 2012). The first-two digit test is regarded as a more focussed test than the first digit test by Nigrini (2012) and "is there to detect abnormal duplications of digits and possible biases in the data:

MATERIALS AND METHODS
The design approach for the data warehouse that was followed is described below: • Discover the entities required for the data model for the data warehouse by examining the typical processes and events in supply chain management • Interview two forensic consultants at Teradata Germany with experience in supply chain in order to identify a set of rules that will be use to identify attributes for the different entities of the data model for the data warehouse. This part of research will be focused on the warranty claims and inventory management processes. Although only two processes were selected for this part of the study, the same procedure can be followed for all the different processes of the supply chain • Design a stored procedure that will dynamically create views for Benford Analysis • Identify suitable datasets to test data warehouse and store procedure • Generates a set of views for interpretation and visualization of the data

RESULTS
Based on the analysis of Hesketh (2010) that provides an excellent overview of the processes, events and Science Publications AJAS involved parties within the supply chain management, the following subject areas were identified as the main source of entities for our data warehouse design: damaged part returns, the warranty process and terms The major entities of every subject area were identified and an entity relationship data model for the data warehouse created (Fig. 1).
Two forensic consultants with Teradata Germany with experience in supply chain were interviewed in order to identify a set of rules that can be used to detect fraud in warranty claims and inventory management that is the main scope of this research. These rules will be use to identify attributes for the different entities of the data model for the data warehouse. The rules identified are summarized below: Inventory: Unpacking a container that was delivered to a production plant implies transactions to the inventory of the plant. Boxes within the container might be damaged along with the contained goods. By comparing the content with the Purchase order and bill of lading deviations might become obvious. Claims to consignor, packaging facility or supplier might be created on this basis. Fraud can be concealed by the taking of a small amount of parts from a number of boxes, then blaming the supplier for shortages or declaring items as damaged and then blaming the consignor or packaging facility.
Returns and Warranty Claims: Claims may occur at certain stages of the supply chain or affect it end-to-end. For example, a car manufacturer uses spark plugs from a supplier to build its cars. The car is shipped to the dealer who sells it to a customer. Three months later the customer claims to the dealer that his car does not ignite. Inspection shows that the spark plugs are malfunctioning so the dealer will exchange the plugs but open a warranty claim to the manufacturer of the car. The car manufacturer then will raise another warranty claim to the manufacturer of the plug. Not all failed parts are sent back to the original supplier for inspection (regarding to the costs and amounts) but payments are granted either on a subset that was inspected or on good-will.
Receiving goods: Goods might be stolen or damaged on a frequent basis, maybe depending on the numbers of parts delivered or price of the parts.

Store Procedure Design
A stored procedure was designed to dynamically create four types views, based on the input-parameters handed over with the procedure call. The design is depicted in Fig. 2. The store procedure's in parameters are described in the Table 2 below.
The store procedure has four views. The VV_BENFORD_STAT view contains statistical information about the dataset chosen. VV_BENFORD_DET view provides detailed information of the Bendford analysis and helps to determine the first and first two digits of a set of numbers. The VV_BENFORD_FIRST_RES counts how often the first digit appeared in the dataset while the VV_BENFORD_TWO_RES counts how often the first two digits appeared in the dataset. The results of the views provides a Z score that measures how statistically different the actual result is from the expected result and therefore attempts to help answer the question of how confident one can be about the observed relationship.

Data Collection
Two datasets were collected from projects developed by Teradata Germany. Both datasets are car manufacturer supply chain/quality data. As the data was not allowed to leave the company premises, all testing was done on-site and on anonymised production data. Therefore the developed objects were deployed to a sandbox on the production system and executed.
The first dataset (Fig. 3) reflects two years of warranty claims from dealers towards the car manufacturer regarding defective parts and consequential costs.
The second data set (Fig. 4) is much smaller dataset reflects one year of inventory data, filtered only for transactions that have a transaction type = 'LOST', which might include thefts, damaged or lost items.  The name of the database that contains the table to be analysed table_nm: The name of the table key_col: The name of the key column of that table. value_col: The column with the data to be analysed. result_db_nm: The database where the views will be created

Benford Implementation
The Benford probabilities were calculated and stored in a lookup table with SQL-Statements one at a time for the numbers from 1 to 9 and then from 10 to 99 so that the first digit test as well as the first-two digit test can be performed. In case also the second, third or fourth digit test needs to be performed, the used pattern can be extended easily. As described in section 3.1, the stored procedure is intended to create 4 views, which is where the Dynamic SQL is used. For the first view (Statistical information about the chosen dataset) the call for generating the view within the stored procedure is depicted in Fig. 5.
The stored procedure is called handing over the IN-Parameters as depicted in Fig. 6.
The IN-Parameters are replaced upon execution within the SYSEXECSQL call, creating a valid SQL statement which is executed directly and results in the creation of the view as depicted in Fig. 7.
Once the stored procedure is successfully executed with a first set of data (in T_SAMPLE_SET), all used and created database objects are shown below in Fig. 8.
Third or fourth digit tests might be implemented following the same pattern.

Visualization
Tableau (www.tableausoftware.com) was used for visualisation. The views provided by the stored procedure were the basis for reporting and via the VV_BENFORD_DET view any connection to related information could be achieved. Following these guidelines, three visualisation-sets where created and used to display the three datasets available. Figure 11 Dataset 2-first two digit test (Benford in red, actual values in blue) The Visualisation in Fig. 9 depicts dataset 1. Benfords Values for first digits are in orange, the result set values in blue. As the pattern of the actual values align with the Benford-values, this dataset obeys to Beford's Law, however some values step out marginally, like for the digits 1, 2 and 3. The Visualisation in Fig. 10 depicts dataset 2. Benford Values for first digits are in red, the result set values in green. As the pattern of the actual values align with the Benford-values, this dataset obeys close to Benford's Law, even more than Dataset 1. For a more detailed look into large datasets, two options were chosen for dataset 2 to find peculiarities. This can be regarded as the next step after identifying peculiarities in Fig. 12.
Example for groupings applied next: • A grouping based on vendors • Only vendors having a deviation from Benford's law > 0.15 for at least one digit While reviewing the result, vendors with low numbers of claims had high deviations from Benford. Therefore a filter was applied that only vendors with more than 20 claims in the period where considered. Figure 13 shows the result of applying those to filters for one particular vendor (#9206).

DISCUSSION
For Benford's first digit test, both datasets show no outstanding peculiarities and follow the pattern of Benford's law. According to Nigrini (1996), if the complete dataset follows Benford, then also subsets (not grouped by digits, of course) should also comply.
In Fig. 11, the first two digit test is depicted with Tableau. Here, a slightly different chart type was chosen, as the number of digits was too high to assign them to the x-axis and still get a chart to fit on the screen without scrolling.
Overall, the dataset obeys also to Benford's law (first two digit test), as the pattern matches for most of the first two digits to Benford's numbers. The visualisation is much more granular and more details need to be reviewed to find peculiarities.
The analyst would use this chart to identify deviations and "drill" down for some to review the source of the deviation. In Fig. 12 three of the values that deviate from Benford's value are highlighted.
Overall, the dataset obeys also to Benford's law (first two digit test), as the pattern matches for most of the first two digits to Benford's numbers. The visualisation is much more granular and more details need to be reviewed to find peculiarities.
The analyst would use this chart to identify deviations and "drill" down for some to review the source of the deviation. In Fig. 12 three of the values that deviate from Benford's value are highlighted.
The advantage of a drill-down option or further groupings & filters is obvious when looking at the many deviations that might be worth to investigate further, as not only the highlighted values but even more might be worth to investigate.
The chart in Fig. 13, shows that especially the first digit 2 was highly frequented in his claim amounts, much more than others and having a high deviation from Benford for this digit. This observation makes an analysis on the lowest level necessary, the actual records underneath. A drill down to the detail data for dataset 1 lists all the amounts claimed in Fig. 13 by vendor #9206. After some further research into this it was identified that these claims were part of a product recall campaign, where a huge amount of faulty parts had to be replaced, ordered by the manufacturer but taken out by the vendors. No actual case of fraud was identified, but a case quite out of the ordinary was detected without knowing about it beforehand.

CONCLUSION
By applying the method to two datasets for the business area, the following aspects surfaced and can be projected to all data within the supply chain: • The chosen approach (apply the method on full sets of data, in-database style) is applicable; performs well in terms of query response time and can be applied repeatedly on either newer datasets or other datasets without change • Even if data is available, it might not be eligible for applying Benford's law if the dataset is too small or is of such poor quality that interpretation would only generate effort but no proper results • Both datasets followed Benford's law for the first digit, with some deviations also for the first two digits. As the datasets are different in size, origin and type of data this is a very interesting observation and gives room for the assumption that also other data from the area of supply chain might be worth to test against Benford The results of the research were consistent with Tödter (2009), who states that "Benford's law is a potentially useful instrument to discover fraud and manipulation" as the implementation and application to the datasets documented showed the general obedience to Benford's law (first and first-two digits) overall and enables the researcher to detect anomalies, maybe only in the form of peculiarities for further analysis. Tödter (2009) states that "in every single case there may be other plausible reasons for deviations from Benford's law, such as insufficient variability of the underlying data, rounding effects or other irregularities. This was consistent with the results of this study as it was observed that more the data was filtered and the smaller and less diverse the datasets was, the bigger the diversion to Benford's law was.
Supply chain data as well as any other data environment requires a detailed business know how for interpretation and to prevent misinterpretations. Additionally, mathematical and statistical skills are vital for choosing the most promising dataset and the appropriate method to test it with.
One limitation of this study is that focused mainly in inventory management and warranty claims, future research in this area should explore other supply chain processes such as procurement and warehouse management in order to test the generalizability of the data warehouse design. Future research should also focus on the development of a generic approach using parameterized stored procedures with Dynamic SQL that could provide the possibility to operationalize data mining methods and implement other different fraud detection algorithms such as neural networks, frequent pattern analysis or classification methods in order to ease the workload of the analyst and check data on a frequent basis rather than one-time only.

ACKNOWLEDGMENT
Special thanks to Dr. Michael Benz from Teradata for his support and expertise in the area forensic analysis and data mining. Benford, F., 1938. The law of anomalous numbers. Proc.