vault backup: 2025-01-18 00:03:39

2025-01-18 00:03:39 +01:00 · 2025-01-18 00:03:39 +01:00 · 21bd52ea7b
commit 21bd52ea7b
parent 6608588f7a
1 changed files with 9 additions and 11 deletions
--- a/science/notes/9
+++ b/science/notes/9
@ -27,12 +27,12 @@ $$L(\Theta) = \sum_{i=1}^n l(y_i, \hat{y}_i) + \sum_{k=1}^T \Omega(h_k)$$

 1. **First Term: Training Loss $l(y_i, \hat{y}_i))$**
    
-    - Measures how well the predictions y^i\hat{y}_i match the true labels yiy_i.
+    - Measures how well the predictions $\hat{y}_i$ match the true labels $y_i$.
    - Common choices:
-        - Mean Squared Error (MSE) for regression: l(yi,y^i)=(yi−y^i)2l(y_i, \hat{y}_i) = (y_i - \hat{y}_i)^2
-        - Log Loss for binary classification: l(yi,y^i)=−[yilog⁡(y^i)+(1−yi)log⁡(1−y^i)]l(y_i, \hat{y}_i) = - \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right]
+        - Mean Squared Error (MSE) for regression: $l(y_i, \hat{y}_i) = (y_i - \hat{y}_i)^2$
+        - Log Loss for binary classification: $l(y_i, \hat{y}_i) = - \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right]$
        - Multiclass Log Loss for multiclass classification.
-2. **Second Term: Regularization Term (Ω(hk)\Omega(h_k))**
+2. **Second Term: Regularization Term ($\Omega(h_k)$)**
    
    - Adds penalties for model complexity to avoid overfitting: $\Omega(h_k) = \gamma T + \frac{1}{2} \lambda \sum_j w_j^2$
        - T: Number of leaves in the tree.
@ -46,7 +46,7 @@ $$L(\Theta) = \sum_{i=1}^n l(y_i, \hat{y}_i) + \sum_{k=1}^T \Omega(h_k)$$

 XGBoost uses a **second-order Taylor approximation** to expand the loss function around the current prediction:

-L(Θ)≈∑i=1n[gihk(xi)+12hihk(xi)2]+Ω(hk)L(\Theta) \approx \sum_{i=1}^n \left[ g_i h_k(x_i) + \frac{1}{2} h_i h_k(x_i)^2 \right] + \Omega(h_k)
+$$L(\Theta) \approx \sum_{i=1}^n \left[ g_i h_k(x_i) + \frac{1}{2} h_i h_k(x_i)^2 \right] + \Omega(h_k)$$

 - **Gradient (gig_i):** First derivative of the loss function with respect to predictions.
 - **Hessian (hih_i):** Second derivative of the loss function with respect to predictions.
@ -79,12 +79,10 @@ Here’s a breakdown of how they are used:

 For a given loss function l(y,y^)l(y, \hat{y}), the **gradient** (gg) and **Hessian** (hh) are computed for each training example:

- **Gradient (gig_i)**: Measures the direction and magnitude of the steepest ascent in the loss function with respect to the model's prediction:
-    
-    gi=∂l(yi,y^i)∂y^ig_i = \frac{\partial l(y_i, \hat{y}_i)}{\partial \hat{y}_i}
- **Hessian (hih_i)**: Measures the curvature (second derivative) of the loss function with respect to the model's prediction:
-    
-    hi=∂2l(yi,y^i)∂y^i2h_i = \frac{\partial^2 l(y_i, \hat{y}_i)}{\partial \hat{y}_i^2}
+- **Gradient ($g_i$)**: Measures the direction and magnitude of the steepest ascent in the loss function with respect to the model's prediction:
+    $$g_i = \frac{\partial l(y_i, \hat{y}_i)}{\partial \hat{y}_i}$$
+- **Hessian ($h_i$)**: Measures the curvature (second derivative) of the loss function with respect to the model's prediction:
+    $$h_i = \frac{\partial^2 l(y_i, \hat{y}_i)}{\partial \hat{y}_i^2}$$

 ---