Random Forest
Random forest is a boosting ensemble model based on decision tree.
1. Bagging v.s. Boosting
Bagging used to decrease the variance of the prediction
Subsampling the from the training data
Train different models with the subsampled data
Tasks:
For classification: Voting
For regression: Average
Pros:
Less variance
Increase accuracy
Help unstable learners
Boosting calculate the output using several different models
Training the models in steps
For the following step, only focus on the samples that make mistakes
Pros:
Focus new learners on samples that get wrong
Train learners sequentially
Convert weak learners into strong classifier
2. Idea of Random Forest
Using Bagging idea, Random Forest builds a bunch of decision trees by randomly selecting a subset of features and a subset of instances.
It helps de-correlated the trees
Help using less predictive features in the model: There is one very predictive feature and a couple of moderately predictive features.
Without subsampling features, the trees will be very similar and highly correlated. The reason is most of the trees will use this very predictive feature in the top split.
Averaging many highly correlated results won't lead to a large reduction in variance compared with uncorrelated results.
The more trees in Random Forest, the better performance. The only limit is the computational constraint
3. Steps:
Initialize the model by setting the number of trees, subsampling ratio of trees, subsampling ratio of instances, minimum samples in leaves, maximum depth of a tree
Training:
For the tree in the forest:
= sample_with_replacement(X, Y)
Model = DesicionTree()
While not a terminal nodes and not reach max_depth:
Select features randomly: = subsample_features
Choose best split from the d features by information gain
Split the instances based on the best split point and continue
4. Key points
Control depth of each tree by
Minimum samples in leaves
Maximum depth of a tree
Always cross-validation the parameters
Recommended size of subset of features
Classification:
Regression:
Random Forest is used for feature importance:
Look at how much a tree node use each feature for dividing
With the feature importance, decide with feature to drop
Last updated
Was this helpful?