CART (Classification and regression tree)

CART is the base method for XGBoost. It can be used to build (1) regression tree and (2) classification tree

Categorical
- If the category number is more than 2,
  - Enumerate all of splitting combinations
  - Pick splitting point with lowest Gini index
- If category number is 2.
  - Split directly
Numerical
- Sort values based on the numerical feature
- Find the splitting point between numerical values
  - Calculate the Gini index of all possible splitting point
  - Find the splitting point corresponding to the minimum value

Target is $\min_{j,s}\{\min_{c_1}\sum_{x_i \in R_1}(y_i-c_1)^2 + \min_{c_2}\sum_{x_i \in R_2}(y_i-c_2)^2 \}$

Where, $c_1 = avg(y_i|x \in R_1)$ and $c_2 = avg(y_i|x \in R_2)$

CART regression tree is

f(x) = \sum_{m=1}^{M} c_mI(x \in R_m)

Where, $c_m^* = avg(y_i | x_i \in R_m)$

Steps to generate a CART regression Tree:

Iterate all features
- For each feature, browsing all possible splitting point
- For each splitting point, Measure the sum of square root error
- --> Each feature: Find the best splitting point
- --> Combine all features: find the best splitting point
Split the samples based on the best splitting point of a feature:
- Output the child nodes:
$R_1 = \{x|x_i^{(j)} \leq s^{(j)}\}$ and $R_2 = \{x|x_i^{(j)} > s^{(j)}\}$

$c_1 = \frac{1}{|R_1|}\sum_{x_i \in R_1}y_i$ and $c_2 = \frac{1}{|R_2|}\sum_{x_i \in R_2}y_i$

Finally, split the input space into M areas $R_1, R_2, ..., R_M$ , and output the CART model as $f(x) = \sum_{m=1}^{M} c_mI(x \in R_m)$

Target is Gini index:

Gini(D) = 1 - \sum_k(\frac{|C_k|}{|D|})^2 , \\ where,\; Gini(D_i) = 1 - \sum_k(\frac{|C_{ik}|}{|D_i|})^2

Gini index the smaller, the better

Steps to generate a CART classification Tree:

Iterate all features
- For each feature, browsing all possible splitting point
- For each splitting point, Measure the $Gini(D_i)$
- --> Each feature: Find the best splitting point
Select the feature with minimum $Gini(D_i)$ , as the best splitting point
Recursive the subsamples of each child tree

Last updated 5 years ago

Was this helpful?