GBDT

GBDT is Gradient Boost Decision Tree.

1. Basic idea

The idea is similar to AdaBoost tree. It trained a series of weak learners to optimize the parameters in each learner and obtain the best performance.

The special of GBDT is that, except the first step, it uses the gradients of each points at current step as the target values. The reason is that the gradients can be used to find the correction direction of the point.

2. Key points in GBDT

Boosting
- The basic idea of boosting is convert weak learners to strong learners. It trains weak learners sequentially and try to correct its predecessor.
- Since it iterates in each step to correct the bias of the whole model, boosting is a low bias high variance model.
Gradient v.s. Residual
- In Adaboost, the weak learners changes the weights of every incorrect classified observations at every iterations.
- In GBDT, except the first iteration, it fits the gradients of the loss function to build the model.
$\large {r_{mi}} = -\left[\frac{\partial L(y_i,f(x_i))}{\partial f(x_i)} \right]_{f(x)=f_{m-1}(x)}$
- Gradient to minimize loss function
  - Traditionally, Gradient descent is used to minimize a set of parameters to minimize the error
  - In GBDT, gradient descent minimize the loss function by parameterizing a tree, to make sure the loss function moving to the right direction.
- If the loss function of the GBDT is a Sum of Squared Error, $L(y_i,f(x_i)) = \frac{1}{2}(y_i - f(x_i))^2$ , its gradients equals to its residual: $r_{mi} = y_i - f(x_i)$ .
Loss functions:
- The loss function should depends on the type of problem
  - Regression problem: Squared Error
  - Binary classification: Log-loss: $L(y, f(x)) = -log(1+exp(-yf(x)))$
  - Multi classification: $L(y, f(x)) = - \sum_{k=1}^K y_k log p_k(x)$
    When the output class is $K$ , => $y_k = 1$ , the probability of K, $p_k(x)$ is:
    $p_k(x) = exp(f_k(x))/\sum_{l=1}^K exp(f_l(x))$

CART (Classification and Regression Tree)
- CART is the basic weak learner of GBDT. It can be used for classification and regression models.
  - For classification, it uses Gini index to splitting the tree
  - For regression, it uses Mean Square Error to splitting the tree

3. Steps

Initialize the model
- Find a value $c$ to minimize the function
$f_0(X) = argmin_c\sum_{i=1}^NL(y_i, c)$
For 1st to Mth trees:
- For each observation, $i = \{1, 2, ..., N\}$ , calculate the gradients:
  - $\large {r_{mi}} = -\left[\frac{\partial L(y_i,f(x_i))}{\partial f(x_i)} \right]_{f(x)=f_{m-1}(x)}$
- Set the gradient, $r_{mi}$ , as the new targets => train $(x_i, r_{mi})$ , and get the new CART tree: $f_m(x)$
  - $R_{mj}$ is the region of the leaf in $f_m(x)$ , $j = 1, 2, ..., J$
  - $J$ is the count of leaves in the CART tree
  - For a leaf, the optimal value is the mean of the gradients at this iteration:
    $r_{mj} = argmin_{r} \sum_{x_i \in R_{mj}} L(y_i, f_{m-1}(x_i) + r)$
Add up the weak learners
- $f_m(x) = f_{m-1}(x) + \rho \sum_{j=1}^J r_{mj} I(x\in R_{jm})$
- $\rho$ is the learning rate

4. Pros and Cons

Pros:
- Deal with continuous and discrete values
- Requires relative small time for parameter tuning
- Robust loss function makes the model robust to abnormal data
Cons:
- The training is sequentially, which is hard for parallel computing

5. Improve the basic GBDT

Tree Constraints
Stochastic Gradient Boosting
Penalized Gradient Boosting

https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/

PreviousAdaBoost NextRandom Forest

Last updated 5 years ago

Was this helpful?