vault backup: 2025-01-18 00:00:19

This commit is contained in:
Marco Realacci 2025-01-18 00:00:19 +01:00
parent 779b4c8fc4
commit 6608588f7a

View file

@ -23,9 +23,9 @@ XGBoost (**eXtreme Gradient Boosting**) is an optimized and scalable implementat
XGBoost allows users to define a custom loss function, but it relies on second-order Taylor expansion (both gradient and Hessian) to optimize the objective. The general loss function in XGBoost consists of two components: XGBoost allows users to define a custom loss function, but it relies on second-order Taylor expansion (both gradient and Hessian) to optimize the objective. The general loss function in XGBoost consists of two components:
L(Θ)=∑i=1nl(yi,y^i)+∑k=1TΩ(hk)L(\Theta) = \sum_{i=1}^n l(y_i, \hat{y}_i) + \sum_{k=1}^T \Omega(h_k) $$L(\Theta) = \sum_{i=1}^n l(y_i, \hat{y}_i) + \sum_{k=1}^T \Omega(h_k)$$
1. **First Term: Training Loss (l(yi,y^i)l(y_i, \hat{y}_i))** 1. **First Term: Training Loss $l(y_i, \hat{y}_i))$**
- Measures how well the predictions y^i\hat{y}_i match the true labels yiy_i. - Measures how well the predictions y^i\hat{y}_i match the true labels yiy_i.
- Common choices: - Common choices:
@ -100,10 +100,10 @@ $$\text{Gain} = \frac{1}{2} \left[ \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_
Where: Where:
- GLG_L, GRG_R: Sum of gradients for the left and right child nodes. - $G_L$, $G_R$: Sum of gradients for the left and right child nodes.
- HLH_L, HRH_R: Sum of Hessians for the left and right child nodes. - $H_L$, $H_R$: Sum of Hessians for the left and right child nodes.
- λ\lambda: L2 regularization parameter (smooths the model). - $\lambda$: L2 regularization parameter (smooths the model).
- γ\gamma: Minimum loss reduction required to make a split (controls tree complexity). - $\gamma$: Minimum loss reduction required to make a split (controls tree complexity).
The algorithm selects the split that maximizes the gain. The algorithm selects the split that maximizes the gain.
@ -113,13 +113,13 @@ The algorithm selects the split that maximizes the gain.
Once a tree structure is determined, the weight of each leaf is optimized using both the gradients and Hessians. The optimal weight wjw_j for a leaf jj is calculated as: Once a tree structure is determined, the weight of each leaf is optimized using both the gradients and Hessians. The optimal weight wjw_j for a leaf jj is calculated as:
wj=GjHj+λw_j = -\frac{G_j}{H_j + \lambda} $$w_j =-\frac{G_j}{H_j + \lambda}$$
Where: Where:
- GjG_j: Sum of gradients for all examples in the leaf. - $G_j$: Sum of gradients for all examples in the leaf.
- HjH_j: Sum of Hessians for all examples in the leaf. - $H_j$: Sum of Hessians for all examples in the leaf.
- λ\lambda: L2 regularization parameter. - $\lambda$: L2 regularization parameter.
This weight minimizes the loss for that leaf, balancing model complexity and predictive accuracy. This weight minimizes the loss for that leaf, balancing model complexity and predictive accuracy.
@ -129,13 +129,13 @@ This weight minimizes the loss for that leaf, balancing model complexity and pre
After computing the optimal splits and leaf weights, the predictions for the dataset are updated: After computing the optimal splits and leaf weights, the predictions for the dataset are updated:
y^i(t+1)=y^i(t)+η⋅w(xi)\hat{y}_i^{(t+1)} = \hat{y}_i^{(t)} + \eta \cdot w(x_i) $$\hat{y}_i^{(t+1)} = \hat{y}_i^{(t)} + \eta \cdot w(x_i)$$
Where: Where:
- y^i(t)\hat{y}_i^{(t)}: Prediction for sample ii at iteration tt. - $\hat{y}_i^{(t)}$: Prediction for sample ii at iteration $t$.
- η\eta: Learning rate (controls step size). - $\eta$: Learning rate (controls step size).
- w(xi)w(x_i): Weight of the leaf to which xix_i belongs in the new tree. - $w(x_i)$: Weight of the leaf to which $x_i$ belongs in the new tree.
This iterative process improves the model's predictions by reducing the residual errors at each step. This iterative process improves the model's predictions by reducing the residual errors at each step.
@ -143,8 +143,8 @@ This iterative process improves the model's predictions by reducing the residual
### **Why Use Gradient and Hessian?** ### **Why Use Gradient and Hessian?**
1. **Gradient (gg):** Indicates the direction and magnitude of adjustments needed to reduce the loss. 1. **Gradient ($g$):** Indicates the direction and magnitude of adjustments needed to reduce the loss.
2. **Hessian (hh):** Helps adjust for the curvature of the loss function, leading to more precise updates (second-order optimization). 2. **Hessian ($h$):** Helps adjust for the curvature of the loss function, leading to more precise updates (second-order optimization).
By leveraging both, XGBoost: By leveraging both, XGBoost: