Monday, March 30, 2009

Decision Trees (part I)

Let's talk about our beloved Decision Trees (DTs). I have a very good friend who always suggests using a decision tree in every data mining task. After all, we use DTs every day in our real life.

What is a decision tree? nothing more that a long sequence of "if .. then else .. if then else .. if then .. else ... " ;-) This definition is not making justice to the power of DTs. The sequence of best choices is not pre-defined, but it is the result of a machine learning process on the training dataset. In DTs, leaves represent classifications and branches represent conjunctions of features that lead to those classifications. All the binary structure of the DT is learnt automatically.

Training dataset is seen as a matrix, where each row is a record. For each row in the dataset, for each feature, the dataset is split in two parts A and B. An auxiliar score function (such as entropy or gini impurity) is used to compute the information gain for A, B. The feature obtaining the best information gain is used as split feature [ e.g. if (... feature_chosen ... ) then ... else ... ]. A node N is created containing the splitting feature. The left child and right child of N are built recursively by applying the same learning process on the subsets A and on B, respectively.
DTs can work with both numerical and categorical (e.g. strings) features.

Here you have the C++ code for building the DT (note that features are represented with boost::variant).

In a future posting, we will discuss about pruning DTs and machine learning classification using DTs.

No comments:

Post a Comment