Random Forest

Random forest is a boosting ensemble model based on decision tree.

Using Bagging idea, Random Forest builds a bunch of decision trees by randomly selecting a subset of features and a subset of instances.

It helps de-correlated the trees
- Help using less predictive features in the model: There is one very predictive feature and a couple of moderately predictive features.
- Without subsampling features, the trees will be very similar and highly correlated. The reason is most of the trees will use this very predictive feature in the top split.
- Averaging many highly correlated results won't lead to a large reduction in variance compared with uncorrelated results.
The more trees in Random Forest, the better performance. The only limit is the computational constraint

Initialize the model by setting the number of trees, subsampling ratio of trees, subsampling ratio of instances, minimum samples in leaves, maximum depth of a tree
Training:
- For the $i^{th}$ tree in the forest:
  - $(X_i, Y_i)$ = sample_with_replacement(X, Y)
  - Model = DesicionTree()
    While not a terminal nodes and not reach max_depth:
    Select $d$ features randomly: $(X_{id}, Y_{id})$ = subsample_features $(X_i, Y_i)$
    Choose best split from the d features by information gain
    Split the instances based on the best split point and continue

Control depth of each tree by
- Minimum samples in leaves
- Maximum depth of a tree
Always cross-validation the parameters
Recommended size of subset of features
- Classification: $d = floor(sqrt(D)) \; or \; 1$
- Regression: $d = floor(D/3) \; or \; 5$
Random Forest is used for feature importance:
- Look at how much a tree node use each feature for dividing
- With the feature importance, decide with feature to drop

Last updated 5 years ago

Was this helpful?