master-degree-notes/Foundation of data science/notes/9 XGBoost.md

7.2 KiB
Raw Blame History

What is XGBoost?

XGBoost (eXtreme Gradient Boosting) is an optimized and scalable implementation of Gradient Boosting designed for speed and performance. It builds on the classic gradient boosting algorithm with several enhancements, making it one of the most widely used machine learning libraries for structured/tabular data.


Key Differences Between XGBoost and Classic Gradient Boosting

Feature Classic Gradient Boosting XGBoost
Regularization Basic or none L1 (Lasso) and L2 (Ridge) regularization for weights to control overfitting.
Loss Function Standard loss functions (e.g., MSE) Customizable loss functions with second-order Taylor approximation for faster optimization.
Tree Construction Level-wise growth (splits all nodes at a given depth) Leaf-wise growth with depth constraints (reduces loss more efficiently).
Parallelism Limited Parallelized tree construction for faster computation.
Missing Values Must be imputed Handles missing values internally by learning optimal splits for them.
Sparsity Awareness Not optimized Efficiently handles sparse data by skipping zero values.
Pruning None or minimal Post-pruning to remove nodes that do not improve the loss significantly.
Performance Moderate speed and scalability Highly optimized for speed and memory efficiency, often faster in practice.

XGBoost Loss Function

XGBoost allows users to define a custom loss function, but it relies on second-order Taylor expansion (both gradient and Hessian) to optimize the objective. The general loss function in XGBoost consists of two components:

L(Θ)=∑i=1nl(yi,y^i)+∑k=1TΩ(hk)L(\Theta) = \sum_{i=1}^n l(y_i, \hat{y}i) + \sum{k=1}^T \Omega(h_k)

  1. First Term: Training Loss (l(yi,y^i)l(y_i, \hat{y}_i))

    • Measures how well the predictions y^i\hat{y}_i match the true labels yiy_i.
    • Common choices:
      • Mean Squared Error (MSE) for regression: l(yi,y^i)=(yiy^i)2l(y_i, \hat{y}_i) = (y_i - \hat{y}_i)^2
      • Log Loss for binary classification: l(yi,y^i)=[yilog(y^i)+(1yi)log(1y^i)]l(y_i, \hat{y}_i) = - \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right]
      • Multiclass Log Loss for multiclass classification.
  2. Second Term: Regularization Term (Ω(hk)\Omega(h_k))

    • Adds penalties for model complexity to avoid overfitting: Ω(hk)=γT+12λ∑jwj2\Omega(h_k) = \gamma T + \frac{1}{2} \lambda \sum_j w_j^2
      • TT: Number of leaves in the tree.
      • wjw_j: Weights of the leaves.
      • γ\gamma: Penalizes additional leaves.
      • λ\lambda: Penalizes large leaf weights (L2 regularization).

Optimization in XGBoost

XGBoost uses a second-order Taylor approximation to expand the loss function around the current prediction:

L(Θ)≈∑i=1n[gihk(xi)+12hihk(xi)2]+Ω(hk)L(\Theta) \approx \sum_{i=1}^n \left[ g_i h_k(x_i) + \frac{1}{2} h_i h_k(x_i)^2 \right] + \Omega(h_k)

  • Gradient (gig_i): First derivative of the loss function with respect to predictions.
  • Hessian (hih_i): Second derivative of the loss function with respect to predictions.

This allows XGBoost to:

  1. Use both gradient and curvature (Hessian) information for more precise optimization.
  2. Efficiently determine splits and optimize leaf weights during tree construction.

Advantages of XGBoost Over Classic Gradient Boosting

  1. Speed: Parallel computation and optimized algorithms for faster training.
  2. Regularization: L1 and L2 regularization reduce overfitting.
  3. Handling Missing Data: Automatically manages missing values during training.
  4. Scalability: Works efficiently with large datasets and sparse data.
  5. Customizability: Allows custom loss functions and objective tuning.
  6. Pruning and Sparsity Awareness: More efficient model structures.

XGBoost has become the go-to algorithm in many data science competitions and practical applications due to these advantages.

In XGBoost, the gradient and Hessian of the loss function are used to update the model efficiently by guiding the optimization process during tree construction. These values provide first- and second-order information about the behavior of the loss function, allowing for more precise updates.

Heres a breakdown of how they are used:


1. Gradient and Hessian Computation

For a given loss function l(y,y^)l(y, \hat{y}), the gradient (gg) and Hessian (hh) are computed for each training example:

  • Gradient (gig_i): Measures the direction and magnitude of the steepest ascent in the loss function with respect to the model's prediction:

    gi=∂l(yi,y^i)∂y^ig_i = \frac{\partial l(y_i, \hat{y}_i)}{\partial \hat{y}_i}

  • Hessian (hih_i): Measures the curvature (second derivative) of the loss function with respect to the model's prediction:

    hi=∂2l(yi,y^i)∂y^i2h_i = \frac{\partial^2 l(y_i, \hat{y}_i)}{\partial \hat{y}_i^2}


2. Tree Splitting Using Gradient and Hessian

Split Criterion

XGBoost constructs decision trees by finding splits that minimize the loss function. At each split, the gain is calculated using the gradient and Hessian.

For a given split, the gain is computed as:

Gain=12[GL2HL+λ+GR2HR+λ−(GL+GR)2HL+HR+λ]γ\text{Gain} = \frac{1}{2} \left[ \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{(G_L + G_R)^2}{H_L + H_R + \lambda} \right] - \gamma

Where:

  • GLG_L, GRG_R: Sum of gradients for the left and right child nodes.
  • HLH_L, HRH_R: Sum of Hessians for the left and right child nodes.
  • λ\lambda: L2 regularization parameter (smooths the model).
  • γ\gamma: Minimum loss reduction required to make a split (controls tree complexity).

The algorithm selects the split that maximizes the gain.


3. Leaf Weight Optimization

Once a tree structure is determined, the weight of each leaf is optimized using both the gradients and Hessians. The optimal weight wjw_j for a leaf jj is calculated as:

wj=GjHj+λw_j = -\frac{G_j}{H_j + \lambda}

Where:

  • GjG_j: Sum of gradients for all examples in the leaf.
  • HjH_j: Sum of Hessians for all examples in the leaf.
  • λ\lambda: L2 regularization parameter.

This weight minimizes the loss for that leaf, balancing model complexity and predictive accuracy.


4. Model Update

After computing the optimal splits and leaf weights, the predictions for the dataset are updated:

y^i(t+1)=y^i(t)+η⋅w(xi)\hat{y}_i^{(t+1)} = \hat{y}_i^{(t)} + \eta \cdot w(x_i)

Where:

  • y^i(t)\hat{y}_i^{(t)}: Prediction for sample ii at iteration tt.
  • η\eta: Learning rate (controls step size).
  • w(xi)w(x_i): Weight of the leaf to which xix_i belongs in the new tree.

This iterative process improves the model's predictions by reducing the residual errors at each step.


Why Use Gradient and Hessian?

  1. Gradient (gg): Indicates the direction and magnitude of adjustments needed to reduce the loss.
  2. Hessian (hh): Helps adjust for the curvature of the loss function, leading to more precise updates (second-order optimization).

By leveraging both, XGBoost:

  • Makes more informed splits and weight calculations.
  • Optimizes the model efficiently while avoiding overfitting.