Decision Tree

Decision tree is non-parametric, supervised learning method.

There are three types of nodes:

  • Root

  • Inner node

  • Leaf node

To build a tree, we can use top-down method: from root to leaves, we select the features that can split the nodes with largest information gain.

Information gain:

  • Entropy: H(T)=IE(p1,p2,...,pJ)=−∑i=1Jpilog2piH(T) = I_E(p_1, p_2, ..., p_J) = -\sum^J_{i=1}p_ilog_2p_i

  • Information Gain: Difference of entropy before and after node splitting

IG(T,a)=H(T)−∑aH(T∣a)=−∑i=1Jpilog2pi−∑ap(a)(∑i=1J[−p(i∣a)log2P(i∣a)])IG(T,a) = H(T) - \sum_a H(T|a) \\ = -\sum^J_{i=1}p_ilog_2p_i - \sum_ap(a)(\sum^J_{i=1}[-p(i|a)log_2P(i|a)])
  • Example:

Pruning:

One of the issue of decision tree is the overfitting. We can use pruning to reduce the overfitting.

There are two methods:

  • Pre-pruning: early stoping based on child weights, child count in leave node

  • Post-pruning: The basic idea is using pruning to remove nodes that do not provide additional information. There are mainly two techniques:

    • bottom up method;

    • top down method;

Bottom up method: Reduced error pruning

  • Starting at the leaves, replace each node with its most popular class

    • If prediction accuracy is not affected --> Keep this change

    • If not --> change back

Top down method: Cost complexity pruning

  • Generate a series of trees T0,...,TmT_0, ..., T_m, where T0T_0is the initial tree and TmT_mis the root alone.

  • At step i, create TiT_i by removing a subtree from Ti−1T_{i-1}.

    • Replacing it with a leaf node with value chosen as in the tree building algorithm

    • The subtree is chosen as

      • Define the error rate of tree T over dataset S as err(T, S)

      • Enumerate all subtrees in Ti−1T_{i-1}and select the subtree that minimize the function:err(prune(T,t),S)−err(T,S)∣leaves(T)∣−∣leaves(prune(T,t))∣\frac{err(prune(T,t),S) - err(T,S)}{|leaves(T)| - |leaves(prune(T,t))|}

      • prune(T, t) defines the tree obtained by pruning the subtrees t from the tree T

Pros & Cons

  • Pros:

    • Simply understandable and interpret

    • Dealing with categorical and numerical data

    • Easy to deal with missing samples

    • Easy to deal with uncorrelated features

  • Cons:

    • Easy overfitting

    • Neglect the relationship and correlation between features

    • Different tree node splitting strategy lead to different bias:

      • Information gain bias towards the features with more data (ID3)

      • Information gain ratio bias towards the features with less data (CART)

Last updated

Was this helpful?