GBDT

GBDT is Gradient Boost Decision Tree.

1. Basic idea

The idea is similar to AdaBoost tree. It trained a series of weak learners to optimize the parameters in each learner and obtain the best performance.

The special of GBDT is that, except the first step, it uses the gradients of each points at current step as the target values. The reason is that the gradients can be used to find the correction direction of the point.

2. Key points in GBDT

  • Boosting

    • The basic idea of boosting is convert weak learners to strong learners. It trains weak learners sequentially and try to correct its predecessor.

    • Since it iterates in each step to correct the bias of the whole model, boosting is a low bias high variance model.

  • Gradient v.s. Residual

    • In Adaboost, the weak learners changes the weights of every incorrect classified observations at every iterations.

    • In GBDT, except the first iteration, it fits the gradients of the loss function to build the model.

    rmi=[L(yi,f(xi))f(xi)]f(x)=fm1(x)\large {r_{mi}} = -\left[\frac{\partial L(y_i,f(x_i))}{\partial f(x_i)} \right]_{f(x)=f_{m-1}(x)}

    • Gradient to minimize loss function

      • Traditionally, Gradient descent is used to minimize a set of parameters to minimize the error

      • In GBDT, gradient descent minimize the loss function by parameterizing a tree, to make sure the loss function moving to the right direction.

    • If the loss function of the GBDT is a Sum of Squared Error, L(yi,f(xi))=12(yif(xi))2L(y_i,f(x_i)) = \frac{1}{2}(y_i - f(x_i))^2 , its gradients equals to its residual: rmi=yif(xi)r_{mi} = y_i - f(x_i).

  • Loss functions:

    • The loss function should depends on the type of problem

      • Regression problem: Squared Error

      • Binary classification: Log-loss: L(y,f(x))=log(1+exp(yf(x)))L(y, f(x)) = -log(1+exp(-yf(x)))

      • Multi classification: L(y,f(x))=k=1Kyklogpk(x)L(y, f(x)) = - \sum_{k=1}^K y_k log p_k(x)

        • When the output class is KK , => yk=1y_k = 1, the probability of K, pk(x)p_k(x) is:

        • pk(x)=exp(fk(x))/l=1Kexp(fl(x))p_k(x) = exp(f_k(x))/\sum_{l=1}^K exp(f_l(x))

For binary classification, \pi_i = sigmoid(2f(x_i)), y_i \in {-1, 1}
  • CART (Classification and Regression Tree)

    • CART is the basic weak learner of GBDT. It can be used for classification and regression models.

      • For classification, it uses Gini index to splitting the tree

      • For regression, it uses Mean Square Error to splitting the tree

3. Steps

  • Initialize the model

    • Find a value cc to minimize the function

    f0(X)=argminci=1NL(yi,c)f_0(X) = argmin_c\sum_{i=1}^NL(y_i, c)

  • For 1st to Mth trees:

    • For each observation, i={1,2,...,N}i = \{1, 2, ..., N\}, calculate the gradients:

      • rmi=[L(yi,f(xi))f(xi)]f(x)=fm1(x)\large {r_{mi}} = -\left[\frac{\partial L(y_i,f(x_i))}{\partial f(x_i)} \right]_{f(x)=f_{m-1}(x)}

    • Set the gradient, rmir_{mi} , as the new targets => train (xi,rmi)(x_i, r_{mi}), and get the new CART tree: fm(x)f_m(x)

      • RmjR_{mj} is the region of the leaf in fm(x)f_m(x), j=1,2,...,J j = 1, 2, ..., J

      • JJis the count of leaves in the CART tree

      • For a leaf, the optimal value is the mean of the gradients at this iteration:

        • rmj=argminrxiRmjL(yi,fm1(xi)+r)r_{mj} = argmin_{r} \sum_{x_i \in R_{mj}} L(y_i, f_{m-1}(x_i) + r)

  • Add up the weak learners

    • fm(x)=fm1(x)+ρj=1JrmjI(xRjm)f_m(x) = f_{m-1}(x) + \rho \sum_{j=1}^J r_{mj} I(x\in R_{jm})

    • ρ\rho is the learning rate

4. Pros and Cons

  • Pros:

    • Deal with continuous and discrete values

    • Requires relative small time for parameter tuning

    • Robust loss function makes the model robust to abnormal data

  • Cons:

    • The training is sequentially, which is hard for parallel computing

5. Improve the basic GBDT

  • Tree Constraints

  • Stochastic Gradient Boosting

  • Penalized Gradient Boosting

https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/

Last updated

Was this helpful?