From 21bd52ea7b46765a5330933bd6d863712f649a66 Mon Sep 17 00:00:00 2001 From: Marco Realacci Date: Sat, 18 Jan 2025 00:03:39 +0100 Subject: [PATCH] vault backup: 2025-01-18 00:03:39 --- Foundation of data science/notes/9 XGBoost.md | 20 +++++++++---------- 1 file changed, 9 insertions(+), 11 deletions(-) diff --git a/Foundation of data science/notes/9 XGBoost.md b/Foundation of data science/notes/9 XGBoost.md index 2046107..1ef7da9 100644 --- a/Foundation of data science/notes/9 XGBoost.md +++ b/Foundation of data science/notes/9 XGBoost.md @@ -27,12 +27,12 @@ $$L(\Theta) = \sum_{i=1}^n l(y_i, \hat{y}_i) + \sum_{k=1}^T \Omega(h_k)$$ 1. **First Term: Training Loss $l(y_i, \hat{y}_i))$** - - Measures how well the predictions y^i\hat{y}_i match the true labels yiy_i. + - Measures how well the predictions $\hat{y}_i$ match the true labels $y_i$. - Common choices: - - Mean Squared Error (MSE) for regression: l(yi,y^i)=(yi−y^i)2l(y_i, \hat{y}_i) = (y_i - \hat{y}_i)^2 - - Log Loss for binary classification: l(yi,y^i)=−[yilog⁡(y^i)+(1−yi)log⁡(1−y^i)]l(y_i, \hat{y}_i) = - \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right] + - Mean Squared Error (MSE) for regression: $l(y_i, \hat{y}_i) = (y_i - \hat{y}_i)^2$ + - Log Loss for binary classification: $l(y_i, \hat{y}_i) = - \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right]$ - Multiclass Log Loss for multiclass classification. -2. **Second Term: Regularization Term (Ω(hk)\Omega(h_k))** +2. **Second Term: Regularization Term ($\Omega(h_k)$)** - Adds penalties for model complexity to avoid overfitting: $\Omega(h_k) = \gamma T + \frac{1}{2} \lambda \sum_j w_j^2$ - T: Number of leaves in the tree. @@ -46,7 +46,7 @@ $$L(\Theta) = \sum_{i=1}^n l(y_i, \hat{y}_i) + \sum_{k=1}^T \Omega(h_k)$$ XGBoost uses a **second-order Taylor approximation** to expand the loss function around the current prediction: -L(Θ)≈∑i=1n[gihk(xi)+12hihk(xi)2]+Ω(hk)L(\Theta) \approx \sum_{i=1}^n \left[ g_i h_k(x_i) + \frac{1}{2} h_i h_k(x_i)^2 \right] + \Omega(h_k) +$$L(\Theta) \approx \sum_{i=1}^n \left[ g_i h_k(x_i) + \frac{1}{2} h_i h_k(x_i)^2 \right] + \Omega(h_k)$$ - **Gradient (gig_i):** First derivative of the loss function with respect to predictions. - **Hessian (hih_i):** Second derivative of the loss function with respect to predictions. @@ -79,12 +79,10 @@ Here’s a breakdown of how they are used: For a given loss function l(y,y^)l(y, \hat{y}), the **gradient** (gg) and **Hessian** (hh) are computed for each training example: -- **Gradient (gig_i)**: Measures the direction and magnitude of the steepest ascent in the loss function with respect to the model's prediction: - - gi=∂l(yi,y^i)∂y^ig_i = \frac{\partial l(y_i, \hat{y}_i)}{\partial \hat{y}_i} -- **Hessian (hih_i)**: Measures the curvature (second derivative) of the loss function with respect to the model's prediction: - - hi=∂2l(yi,y^i)∂y^i2h_i = \frac{\partial^2 l(y_i, \hat{y}_i)}{\partial \hat{y}_i^2} +- **Gradient ($g_i$)**: Measures the direction and magnitude of the steepest ascent in the loss function with respect to the model's prediction: + $$g_i = \frac{\partial l(y_i, \hat{y}_i)}{\partial \hat{y}_i}$$ +- **Hessian ($h_i$)**: Measures the curvature (second derivative) of the loss function with respect to the model's prediction: + $$h_i = \frac{\partial^2 l(y_i, \hat{y}_i)}{\partial \hat{y}_i^2}$$ ---