The
Phases of the KDD Process
A
KDD process consists of several tasks. Indeed, the actual mining,
that is to say the application of a data mining algorithm to a
dataset, is only one of these steps. Following the CRISP-Data Mining
model [9,31] we distinguish the following tasks:
1. Business
Understanding
The very first step of a KDD project should be a close look from the
business point of view. The goal of this phase is to gain a deeper
understanding of the project objectives and further circumstances
strictly from the business perspective. Finally the insights from
this initial phase are to be turned into a data mining problem
definition.
2. Data
Understanding
Based on the results from the business point of view the second step
is to get familiar with the available data. The goal is to understand
the attributes and the corresponding attribute values and to find out
hidden semantics possibly in the data. Furthermore at this stage one
should figure out what exactly the available data offers. That is to
say, whether it has the potential to answer our mining questions or
not, and if possible to select promising subsets of the data.
3. Data Preparation
The next step is to construct the dataset where the mining algorithm
is to be run on. This phase covers both syntactic aspects – format
transformations for the employed mining algorithm – and semantic
aspects like table, record and attribute selection. Last but not
least this phase also includes deriving new attributes that contain
higher information only implicitly contained in the raw data (e.g.
deriving “day of the week” from “date”).
4. Modeling (or
Mining)
In the modeling phase the actual data mining takes place. Based on
the identified business goals and the assessment of the available
data an appropriatemining algorithm is chosen and run on the prepared
data.
5. Evaluation
Evaluating the results of the mining run mainly covers three aspects.
First of all, it is necessary to ensure whether everything went right
from the technical point of view. Was the mining algorithms finally
able to read and interpret the prepared dataset correctly? Were all
designated information actually given to the algorithm? Etc. Second,
one needs to investigate whether the mining results are sound from
the mining methods point of view. Some methods directly support this
decision by computing certain significance measures whereas others
leave this aspect completely to the analyst and his experience.
Third, a key objective of the evaluation phase is to determine if all
important business issues have been considered adequately.
6. Deployment
After mining the data and assessing the data mining results one needs
to transfer the results back into the business environment. This can
be rather straight forward like preparing the results in form of a
report that is understandable by business people (who of course
typically are non data mining experts). Or, as the other extreme, can
be quite complex like implementing a repeatable data mining process
across the enterprise.