Random Forest

Random forest is a boosting ensemble model based on decision tree.

1. Bagging v.s. Boosting

  • Bagging used to decrease the variance of the prediction

    • Subsampling the from the training data

    • Train different models with the subsampled data

    • Tasks:

      • For classification: Voting

      • For regression: Average

    • Pros:

      • Less variance

      • Increase accuracy

      • Help unstable learners

  • Boosting calculate the output using several different models

    • Training the models in steps

    • For the following step, only focus on the samples that make mistakes

    • Pros:

      • Focus new learners on samples that get wrong

      • Train learners sequentially

      • Convert weak learners into strong classifier

2. Idea of Random Forest

Using Bagging idea, Random Forest builds a bunch of decision trees by randomly selecting a subset of features and a subset of instances.

  • It helps de-correlated the trees

    • Help using less predictive features in the model: There is one very predictive feature and a couple of moderately predictive features.

    • Without subsampling features, the trees will be very similar and highly correlated. The reason is most of the trees will use this very predictive feature in the top split.

    • Averaging many highly correlated results won't lead to a large reduction in variance compared with uncorrelated results.

  • The more trees in Random Forest, the better performance. The only limit is the computational constraint

3. Steps:

  • Initialize the model by setting the number of trees, subsampling ratio of trees, subsampling ratio of instances, minimum samples in leaves, maximum depth of a tree

  • Training:

    • For the ithi^{th} tree in the forest:

      • (Xi,Yi)(X_i, Y_i) = sample_with_replacement(X, Y)

      • Model = DesicionTree()

        • While not a terminal nodes and not reach max_depth:

          • Select dd features randomly:(Xid,Yid)(X_{id}, Y_{id}) = subsample_features (Xi,Yi)(X_i, Y_i)

          • Choose best split from the d features by information gain

          • Split the instances based on the best split point and continue

4. Key points

  • Control depth of each tree by

    • Minimum samples in leaves

    • Maximum depth of a tree

  • Always cross-validation the parameters

  • Recommended size of subset of features

    • Classification: d=floor(sqrt(D))  or  1d = floor(sqrt(D)) \; or \; 1

    • Regression: d=floor(D/3)  or  5d = floor(D/3) \; or \; 5

  • Random Forest is used for feature importance:

    • Look at how much a tree node use each feature for dividing

    • With the feature importance, decide with feature to drop

Last updated

Was this helpful?