1/11/13

The Phases of the KDD Process


The Phases of the KDD Process

A KDD process consists of several tasks. Indeed, the actual mining, that is to say the application of a data mining algorithm to a dataset, is only one of these steps. Following the CRISP-Data Mining model [9,31] we distinguish the following tasks:

1. Business Understanding
The very first step of a KDD project should be a close look from the business point of view. The goal of this phase is to gain a deeper understanding of the project objectives and further circumstances strictly from the business perspective. Finally the insights from this initial phase are to be turned into a data mining problem definition.

2. Data Understanding
Based on the results from the business point of view the second step is to get familiar with the available data. The goal is to understand the attributes and the corresponding attribute values and to find out hidden semantics possibly in the data. Furthermore at this stage one should figure out what exactly the available data offers. That is to say, whether it has the potential to answer our mining questions or not, and if possible to select promising subsets of the data.

3. Data Preparation
The next step is to construct the dataset where the mining algorithm is to be run on. This phase covers both syntactic aspects – format transformations for the employed mining algorithm – and semantic aspects like table, record and attribute selection. Last but not least this phase also includes deriving new attributes that contain higher information only implicitly contained in the raw data (e.g. deriving “day of the week” from “date”).

4. Modeling (or Mining)
In the modeling phase the actual data mining takes place. Based on the identified business goals and the assessment of the available data an appropriatemining algorithm is chosen and run on the prepared data.

5. Evaluation
Evaluating the results of the mining run mainly covers three aspects. First of all, it is necessary to ensure whether everything went right from the technical point of view. Was the mining algorithms finally able to read and interpret the prepared dataset correctly? Were all designated information actually given to the algorithm? Etc. Second, one needs to investigate whether the mining results are sound from the mining methods point of view. Some methods directly support this decision by computing certain significance measures whereas others leave this aspect completely to the analyst and his experience. Third, a key objective of the evaluation phase is to determine if all important business issues have been considered adequately.

6. Deployment
After mining the data and assessing the data mining results one needs to transfer the results back into the business environment. This can be rather straight forward like preparing the results in form of a report that is understandable by business people (who of course typically are non data mining experts). Or, as the other extreme, can be quite complex like implementing a repeatable data mining process across the enterprise.

No comments: