CART (Classification and regression tree)

CART is the base method for XGBoost. It can be used to build (1) regression tree and (2) classification tree

Deal with Categorical and Numerical feature

  • Categorical

    • If the category number is more than 2,

      • Enumerate all of splitting combinations

      • Pick splitting point with lowest Gini index

    • If category number is 2.

      • Split directly

  • Numerical

    • Sort values based on the numerical feature

    • Find the splitting point between numerical values

      • Calculate the Gini index of all possible splitting point

      • Find the splitting point corresponding to the minimum value

CART regression tree

Target is minj,s{minc1xiR1(yic1)2+minc2xiR2(yic2)2}\min_{j,s}\{\min_{c_1}\sum_{x_i \in R_1}(y_i-c_1)^2 + \min_{c_2}\sum_{x_i \in R_2}(y_i-c_2)^2 \}

Where, c1=avg(yixR1)c_1 = avg(y_i|x \in R_1) and c2=avg(yixR2)c_2 = avg(y_i|x \in R_2)

CART regression tree is

f(x)=m=1McmI(xRm)f(x) = \sum_{m=1}^{M} c_mI(x \in R_m)

Where, cm=avg(yixiRm)c_m^* = avg(y_i | x_i \in R_m)

Steps to generate a CART regression Tree:

  • Iterate all features

    • For each feature, browsing all possible splitting point

    • For each splitting point, Measure the sum of square root error

    • --> Each feature: Find the best splitting point

    • --> Combine all features: find the best splitting point

  • Split the samples based on the best splitting point of a feature:

    • Output the child nodes:

    R1={xxi(j)s(j)}R_1 = \{x|x_i^{(j)} \leq s^{(j)}\} and R2={xxi(j)>s(j)}R_2 = \{x|x_i^{(j)} > s^{(j)}\}

c1=1R1xiR1yic_1 = \frac{1}{|R_1|}\sum_{x_i \in R_1}y_i and c2=1R2xiR2yic_2 = \frac{1}{|R_2|}\sum_{x_i \in R_2}y_i

  • Finally, split the input space into M areas R1,R2,...,RMR_1, R_2, ..., R_M, and output the CART model as f(x)=m=1McmI(xRm)f(x) = \sum_{m=1}^{M} c_mI(x \in R_m)

CART classification tree

Target is Gini index:

Gini(D)=1k(CkD)2,where,  Gini(Di)=1k(CikDi)2Gini(D) = 1 - \sum_k(\frac{|C_k|}{|D|})^2 , \\ where,\; Gini(D_i) = 1 - \sum_k(\frac{|C_{ik}|}{|D_i|})^2
  • DiD_i is feature A=aiA = a_i的samples

  • CikC_{ik} is samples in DiD_iwith label k

Gini index the smaller, the better

Steps to generate a CART classification Tree:

  • Iterate all features

    • For each feature, browsing all possible splitting point

    • For each splitting point, Measure theGini(Di)Gini(D_i)

    • --> Each feature: Find the best splitting point

  • Select the feature with minimum Gini(Di)Gini(D_i), as the best splitting point

  • Recursive the subsamples of each child tree

Last updated

Was this helpful?