From 6608588f7a011c05291363e3db6e9bcd893518c0 Mon Sep 17 00:00:00 2001 From: Marco Realacci Date: Sat, 18 Jan 2025 00:00:19 +0100 Subject: [PATCH] vault backup: 2025-01-18 00:00:19 --- Foundation of data science/notes/9 XGBoost.md | 32 +++++++++---------- 1 file changed, 16 insertions(+), 16 deletions(-) diff --git a/Foundation of data science/notes/9 XGBoost.md b/Foundation of data science/notes/9 XGBoost.md index 3dfaa49..2046107 100644 --- a/Foundation of data science/notes/9 XGBoost.md +++ b/Foundation of data science/notes/9 XGBoost.md @@ -23,9 +23,9 @@ XGBoost (**eXtreme Gradient Boosting**) is an optimized and scalable implementat XGBoost allows users to define a custom loss function, but it relies on second-order Taylor expansion (both gradient and Hessian) to optimize the objective. The general loss function in XGBoost consists of two components: -L(Θ)=∑i=1nl(yi,y^i)+∑k=1TΩ(hk)L(\Theta) = \sum_{i=1}^n l(y_i, \hat{y}_i) + \sum_{k=1}^T \Omega(h_k) +$$L(\Theta) = \sum_{i=1}^n l(y_i, \hat{y}_i) + \sum_{k=1}^T \Omega(h_k)$$ -1. **First Term: Training Loss (l(yi,y^i)l(y_i, \hat{y}_i))** +1. **First Term: Training Loss $l(y_i, \hat{y}_i))$** - Measures how well the predictions y^i\hat{y}_i match the true labels yiy_i. - Common choices: @@ -100,10 +100,10 @@ $$\text{Gain} = \frac{1}{2} \left[ \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_ Where: -- GLG_L, GRG_R: Sum of gradients for the left and right child nodes. -- HLH_L, HRH_R: Sum of Hessians for the left and right child nodes. -- λ\lambda: L2 regularization parameter (smooths the model). -- γ\gamma: Minimum loss reduction required to make a split (controls tree complexity). +- $G_L$, $G_R$: Sum of gradients for the left and right child nodes. +- $H_L$, $H_R$: Sum of Hessians for the left and right child nodes. +- $\lambda$: L2 regularization parameter (smooths the model). +- $\gamma$: Minimum loss reduction required to make a split (controls tree complexity). The algorithm selects the split that maximizes the gain. @@ -113,13 +113,13 @@ The algorithm selects the split that maximizes the gain. Once a tree structure is determined, the weight of each leaf is optimized using both the gradients and Hessians. The optimal weight wjw_j for a leaf jj is calculated as: -wj=−GjHj+λw_j = -\frac{G_j}{H_j + \lambda} +$$w_j =-\frac{G_j}{H_j + \lambda}$$ Where: -- GjG_j: Sum of gradients for all examples in the leaf. -- HjH_j: Sum of Hessians for all examples in the leaf. -- λ\lambda: L2 regularization parameter. +- $G_j$: Sum of gradients for all examples in the leaf. +- $H_j$: Sum of Hessians for all examples in the leaf. +- $\lambda$: L2 regularization parameter. This weight minimizes the loss for that leaf, balancing model complexity and predictive accuracy. @@ -129,13 +129,13 @@ This weight minimizes the loss for that leaf, balancing model complexity and pre After computing the optimal splits and leaf weights, the predictions for the dataset are updated: -y^i(t+1)=y^i(t)+η⋅w(xi)\hat{y}_i^{(t+1)} = \hat{y}_i^{(t)} + \eta \cdot w(x_i) +$$\hat{y}_i^{(t+1)} = \hat{y}_i^{(t)} + \eta \cdot w(x_i)$$ Where: -- y^i(t)\hat{y}_i^{(t)}: Prediction for sample ii at iteration tt. -- η\eta: Learning rate (controls step size). -- w(xi)w(x_i): Weight of the leaf to which xix_i belongs in the new tree. +- $\hat{y}_i^{(t)}$: Prediction for sample ii at iteration $t$. +- $\eta$: Learning rate (controls step size). +- $w(x_i)$: Weight of the leaf to which $x_i$ belongs in the new tree. This iterative process improves the model's predictions by reducing the residual errors at each step. @@ -143,8 +143,8 @@ This iterative process improves the model's predictions by reducing the residual ### **Why Use Gradient and Hessian?** -1. **Gradient (gg):** Indicates the direction and magnitude of adjustments needed to reduce the loss. -2. **Hessian (hh):** Helps adjust for the curvature of the loss function, leading to more precise updates (second-order optimization). +1. **Gradient ($g$):** Indicates the direction and magnitude of adjustments needed to reduce the loss. +2. **Hessian ($h$):** Helps adjust for the curvature of the loss function, leading to more precise updates (second-order optimization). By leveraging both, XGBoost: