GBDT
GBDT is Gradient Boost Decision Tree.
1. Basic idea
The idea is similar to AdaBoost tree. It trained a series of weak learners to optimize the parameters in each learner and obtain the best performance.
The special of GBDT is that, except the first step, it uses the gradients of each points at current step as the target values. The reason is that the gradients can be used to find the correction direction of the point.
2. Key points in GBDT
Boosting
The basic idea of boosting is convert weak learners to strong learners. It trains weak learners sequentially and try to correct its predecessor.
Since it iterates in each step to correct the bias of the whole model, boosting is a low bias high variance model.
Gradient v.s. Residual
In Adaboost, the weak learners changes the weights of every incorrect classified observations at every iterations.
In GBDT, except the first iteration, it fits the gradients of the loss function to build the model.
Gradient to minimize loss function
Traditionally, Gradient descent is used to minimize a set of parameters to minimize the error
In GBDT, gradient descent minimize the loss function by parameterizing a tree, to make sure the loss function moving to the right direction.
If the loss function of the GBDT is a Sum of Squared Error, , its gradients equals to its residual: .
Loss functions:
The loss function should depends on the type of problem
Regression problem: Squared Error
Binary classification: Log-loss:
Multi classification:
When the output class is , => , the probability of K, is:

CART (Classification and Regression Tree)
CART is the basic weak learner of GBDT. It can be used for classification and regression models.
For classification, it uses Gini index to splitting the tree
For regression, it uses Mean Square Error to splitting the tree
3. Steps
Initialize the model
Find a value to minimize the function
For 1st to Mth trees:
For each observation, , calculate the gradients:
Set the gradient, , as the new targets => train , and get the new CART tree:
is the region of the leaf in ,
is the count of leaves in the CART tree
For a leaf, the optimal value is the mean of the gradients at this iteration:
Add up the weak learners
is the learning rate
4. Pros and Cons
Pros:
Deal with continuous and discrete values
Requires relative small time for parameter tuning
Robust loss function makes the model robust to abnormal data
Cons:
The training is sequentially, which is hard for parallel computing
5. Improve the basic GBDT
Tree Constraints
Stochastic Gradient Boosting
Penalized Gradient Boosting
https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/
Last updated
Was this helpful?