From 21bd52ea7b46765a5330933bd6d863712f649a66 Mon Sep 17 00:00:00 2001
From: Marco Realacci <marco@marcorealacci.me>
Date: Sat, 18 Jan 2025 00:03:39 +0100
Subject: [PATCH] vault backup: 2025-01-18 00:03:39

---
 Foundation of data science/notes/9 XGBoost.md | 20 +++++++++----------
 1 file changed, 9 insertions(+), 11 deletions(-)

diff --git a/Foundation of data science/notes/9 XGBoost.md b/Foundation of data science/notes/9 XGBoost.md
index 2046107..1ef7da9 100644
--- a/Foundation of data science/notes/9 XGBoost.md	
+++ b/Foundation of data science/notes/9 XGBoost.md	
@@ -27,12 +27,12 @@ $$L(\Theta) = \sum_{i=1}^n l(y_i, \hat{y}_i) + \sum_{k=1}^T \Omega(h_k)$$
 
 1. **First Term: Training Loss $l(y_i, \hat{y}_i))$**
     
-    - Measures how well the predictions y^i\hat{y}_i match the true labels yiy_i.
+    - Measures how well the predictions $\hat{y}_i$ match the true labels $y_i$.
     - Common choices:
-        - Mean Squared Error (MSE) for regression: l(yi,y^i)=(yi−y^i)2l(y_i, \hat{y}_i) = (y_i - \hat{y}_i)^2
-        - Log Loss for binary classification: l(yi,y^i)=−[yilog⁡(y^i)+(1−yi)log⁡(1−y^i)]l(y_i, \hat{y}_i) = - \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right]
+        - Mean Squared Error (MSE) for regression: $l(y_i, \hat{y}_i) = (y_i - \hat{y}_i)^2$
+        - Log Loss for binary classification: $l(y_i, \hat{y}_i) = - \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right]$
         - Multiclass Log Loss for multiclass classification.
-2. **Second Term: Regularization Term (Ω(hk)\Omega(h_k))**
+2. **Second Term: Regularization Term ($\Omega(h_k)$)**
     
     - Adds penalties for model complexity to avoid overfitting: $\Omega(h_k) = \gamma T + \frac{1}{2} \lambda \sum_j w_j^2$
         - T: Number of leaves in the tree.
@@ -46,7 +46,7 @@ $$L(\Theta) = \sum_{i=1}^n l(y_i, \hat{y}_i) + \sum_{k=1}^T \Omega(h_k)$$
 
 XGBoost uses a **second-order Taylor approximation** to expand the loss function around the current prediction:
 
-L(Θ)≈∑i=1n[gihk(xi)+12hihk(xi)2]+Ω(hk)L(\Theta) \approx \sum_{i=1}^n \left[ g_i h_k(x_i) + \frac{1}{2} h_i h_k(x_i)^2 \right] + \Omega(h_k)
+$$L(\Theta) \approx \sum_{i=1}^n \left[ g_i h_k(x_i) + \frac{1}{2} h_i h_k(x_i)^2 \right] + \Omega(h_k)$$
 
 - **Gradient (gig_i):** First derivative of the loss function with respect to predictions.
 - **Hessian (hih_i):** Second derivative of the loss function with respect to predictions.
@@ -79,12 +79,10 @@ Here’s a breakdown of how they are used:
 
 For a given loss function l(y,y^)l(y, \hat{y}), the **gradient** (gg) and **Hessian** (hh) are computed for each training example:
 
-- **Gradient (gig_i)**: Measures the direction and magnitude of the steepest ascent in the loss function with respect to the model's prediction:
-    
-    gi=∂l(yi,y^i)∂y^ig_i = \frac{\partial l(y_i, \hat{y}_i)}{\partial \hat{y}_i}
-- **Hessian (hih_i)**: Measures the curvature (second derivative) of the loss function with respect to the model's prediction:
-    
-    hi=∂2l(yi,y^i)∂y^i2h_i = \frac{\partial^2 l(y_i, \hat{y}_i)}{\partial \hat{y}_i^2}
+- **Gradient ($g_i$)**: Measures the direction and magnitude of the steepest ascent in the loss function with respect to the model's prediction:
+    $$g_i = \frac{\partial l(y_i, \hat{y}_i)}{\partial \hat{y}_i}$$
+- **Hessian ($h_i$)**: Measures the curvature (second derivative) of the loss function with respect to the model's prediction:
+    $$h_i = \frac{\partial^2 l(y_i, \hat{y}_i)}{\partial \hat{y}_i^2}$$
 
 ---