I was going to write a post about how prediction contributions in XGBoost are calculated. But I quickly came to realize that it would be logical to go through a few other things first, namely Quality and Cover. Although this is all well described in the documentation, a practical example is sometimes useful. It has been very helpful for me at least in gaining a better understanding of how the algorithm works. So I will simply go through the model dump and try to explain how the numbers are calculated and affected by the model inputs.
The data
I’m using the SingaporeAuto
dataset from the insuranceData
R package, but it can be fetched from other sources as well. The dataset contains claim frequency from a general insurance auto portfolio together with a few policy holder characteristics.
## 'data.frame': 7483 obs. of 15 variables:
## $ SexInsured : Factor w/ 3 levels "F","M","U": 3 3 3 3 3 3 3 3 3 3 ...
## $ Female : int 0 0 0 0 0 0 0 0 0 0 ...
## $ VehicleType: Factor w/ 9 levels "A","G","M","P",..: 7 7 7 7 7 7 3 3 3 3 ...
## $ PC : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Clm_Count : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Exp_weights: num 0.668 0.567 0.504 0.914 0.537 ...
## $ LNWEIGHT : num -0.4034 -0.5679 -0.6856 -0.0894 -0.6225 ...
## $ NCD : int 30 30 30 20 20 20 20 20 20 20 ...
## $ AgeCat : int 0 0 0 0 0 0 0 0 0 0 ...
## $ AutoAge0 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ AutoAge1 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ AutoAge2 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ AutoAge : int 0 0 0 0 0 0 0 0 0 0 ...
## $ VAgeCat : int 0 0 0 0 0 0 6 6 6 6 ...
## $ VAgecat1 : int 2 2 2 2 2 2 6 6 6 6 ...
The target, Clm_Count
, is the number of claims during the year and LNWEIGHT
is the logarithm of the exposure weight, i.e. the fraction of the year that the policy has been in force. Features ending with an integer are just predefined interactions. To keep things simple I’m only using PC
, NCD
, AgeCat
and VageCat
. The factors could contain some useful information but would have to be encoded in a smart way and I’m just dropping them out of convenience. Besides, you’re not allowed to use gender for pricing anyway!
Some kind of fish
In insurance frequency models a Poisson distribution is often assumed using a log link function. To adjust for different exposure weights, these can be added as an offset after taking the logarithm. This is equivalent to dividing the target with the exposure ensuring that all observations are at an annualized rate.
\[log(\hat y_i) = \sum_{k=0}^{K}f_k(x_i) + log(exp.weight_i)\]
Here \(K\) is the number of predictors. The offset is set as base_margin
on the matrix object.
features <- as.matrix(SingaporeAuto[c("PC", "NCD", "AgeCat", "VAgeCat")])
label <- SingaporeAuto$Clm_Count
offset <- SingaporeAuto$LNWEIGHT
dmat <- xgb.DMatrix(features, info = list(label = label, base_margin = offset))
When specifying a Poisson objective, max_delta_step
defaults to 0.7 (explicitly written out below). We will see how this affects the calculations further down. I’m setting max_depth
to two just to keep things simple and base_score
to one so it doesn’t affect our calculations (since we’re using a log link and \(log(1) = 0\)). Other parameters are set to their defaults. Note that subsample
needs to bet set to one for the numbers to add up further down. Otherwise we’ll have no way of knowing which observations were used for growing which tree.
params <- list(
booster = "gbtree",
objective = "count:poisson",
eta = 0.3,
max_depth = 2,
max_delta_step = 0.7,
lambda = 1,
base_score = 1
)
In this example I’m only growing two trees which are dumped to a data table by calling xgb.model.dt.tree()
bst <- xgb.train(params = params, data = dmat, nrounds = 2)
dt <- xgb.model.dt.tree(model = bst)
Dumping XGBoost
The model dump contains information about trees, nodes, split criteria and paths. The first tree is displayed below.
## Tree Node ID Feature Split Yes No Missing Quality Cover
## 1: 0 0 0-0 VAgeCat 3.5 0-1 0-2 0-1 3.0499489 7833.704
## 2: 0 1 0-1 NCD 15.0 0-3 0-4 0-3 2.3427782 5359.799
## 3: 0 2 0-2 VAgeCat 4.5 0-5 0-6 0-5 0.2904137 2473.904
## 4: 0 3 0-3 Leaf NA <NA> <NA> <NA> -0.1173621 2338.160
## 5: 0 4 0-4 Leaf NA <NA> <NA> <NA> -0.1304656 3021.639
## 6: 0 5 0-5 Leaf NA <NA> <NA> <NA> -0.1327546 1025.383
## 7: 0 6 0-6 Leaf NA <NA> <NA> <NA> -0.1414221 1448.521
The columns Quality
and Cover
are described in the documentation as
- Quality: either the split gain (change in loss) or the leaf value
- Cover: metric related to the number of observation either seen by a split or collected by a leaf during training.
You might recognize these as being returned from the importance function xgb.importance()
xgb.importance(model = bst)
## Feature Gain Cover Frequency
## 1: VAgeCat 0.5912865 0.6572412 0.6666667
## 2: NCD 0.4087135 0.3427588 0.3333333
The function aggregates Quality, Cover and frequency by feature excluding leaves, and then scales the values to unit range.
normalize <- function(x) x / sum(x)
imp <- dt[
Feature != "Leaf",
.(Gain = sum(Quality), Cover = sum(Cover), Frequency = .N),
by = .(Feature)
]
for (j in setdiff(names(imp), "Feature")) {
data.table::set(imp, j = j, value = normalize(imp[[j]]))
}
imp
## Feature Gain Cover Frequency
## 1: VAgeCat 0.5912865 0.6572412 0.6666667
## 2: NCD 0.4087135 0.3427588 0.3333333
That might bring some clarity to what’s going on. But instead of using the results from the output table, we can also calculate these numbers ourselves.
Taking derivatives
The value of the objective function in XGBoost depends only on two inputs; the first and second derivative of the loss function w.r.t. the predictions coming from the previous step, \(\hat y^{t-1}\). So in any node \(j\) (using the notation from the docs) \[G_j=\sum_{i\in I_j}\partial_{\hat y_i^{t-1}}l(y_i, \hat y_i^{t-1})\] \[H_j=\sum_{i\in I_j}\partial_{\hat y_i^{t-1}}^2l(y_i, \hat y_i^{t-1})\] where \(I_j\) is the set of data points belonging to the jth node. The optimal weights are calculated as \[w_j = -\frac{G_j}{H_j + \lambda}\] where \(\lambda\) is a regularization constant specified by the user. It shrinks the weights toward zero analogously to Ridge regression. Using the loss function \[l(y, \hat y) = \hat\mu - ylog(\hat\mu)\] where \(\hat\mu=exp(\hat y)\) and taking derivatives with respect to \(\hat y\) yields \[\frac{\partial l}{\partial \hat y}=\frac{\partial l}{\partial \hat\mu}\frac{\partial \hat\mu}{\partial \hat y}=\left(1-\frac{y}{\hat\mu}\right)\hat\mu=\hat\mu - y\] and \[\frac{\partial^2 l}{\partial \hat y^2}=\frac{\partial}{\partial \hat y}\left(\hat\mu-y\right)=\hat\mu\] So to get \(G\) and \(H\), all we have to do is to sum up these two quantities.
No Cover, no Gain
At this point we have everything we need to calculate Cover (which is just \(H\)) and Quality. We already know that Quality in the leaves are the weights, but in all other nodes it’s Gain, which is defined as
\[Gain=\frac{1}{2}\left[\frac{G_L^2}{H_L+\lambda}+\frac{G_R^2}{H_R+\lambda}-\frac{(G_L+G_R)^2}{H_L+H_R+\lambda}\right]-\gamma\]
As described in the documentation it is the gain in loss reduction when splitting a node (the first two terms) compared to not splitting (third term). If this is not greater than zero we’re better of staying where we are. This is also where gamma
comes in to play; it sets a threshold for adding a new branch to the tree. In our case it was set to zero (the default), but if we would have set it to 0.3, Node 2 would not have been split and instead contained a leaf weight. This is demonstrated by the example below.
params0.3 <- params
params0.3$gamma <- 0.3
bst0.3 <- xgb.train(params = params0.3, data = dmat, nrounds = 1L)
xgb.model.dt.tree(model = bst0.3)
## Tree Node ID Feature Split Yes No Missing Quality Cover
## 1: 0 0 0-0 VAgeCat 3.5 0-1 0-2 0-1 3.0499489 7833.704
## 2: 0 1 0-1 NCD 15.0 0-3 0-4 0-3 2.3427782 5359.799
## 3: 0 2 0-2 Leaf NA <NA> <NA> <NA> -0.1378847 2473.904
## 4: 0 3 0-3 Leaf NA <NA> <NA> <NA> -0.1173621 2338.160
## 5: 0 4 0-4 Leaf NA <NA> <NA> <NA> -0.1304656 3021.639
To calculate Cover we should only have to sum up our predictions in a given node. However,
max_delta_step
gets added in under the summation sign. Taking the root node in the first tree as an example.
cat("Cover (root node of the first tree): ", sum(exp(offset + params$max_delta_step)))
## Cover (root node of the first tree): 7833.703
If we hadn’t set an offset and instead used the default value of base_score
, then all predictions at this stage would equal 0.5. To calculate Gain in the same node we just have to keep track of which observations go right and left.
# intial predictions
mu <- exp(offset)
# indices of observations going left or right, the split criterian
# comes from the model dump table
L <- which(SingaporeAuto$VAgeCat < 3.5)
R <- which(SingaporeAuto$VAgeCat > 3.5)
# first derivates
GL <- sum(mu[L] - label[L])
GR <- sum(mu[R] - label[R])
# second derivatives with max_delta_step added
HL <- sum(exp(offset[L] + params$max_delta_step))
HR <- sum(exp(offset[R] + params$max_delta_step))
gain <- GL ^ 2 / (HL + params$lambda) +
GR ^ 2 / (HR + params$lambda) -
(GL + GR ) ^ 2 / (HL + HR + params$lambda)
cat("Quality: ", gain)
## Quality: 3.04999
In the best of worlds these numbers would be identical. I don’t know why they differ on some decimal, but I’m just going to go ahead and act as if they don’t.
Branching out
Continuing down the tree we can do the same calculations for the second layer as well, here just on the left branch.
LL <- intersect(which(SingaporeAuto$NCD < 15), L)
LR <- intersect(which(SingaporeAuto$NCD > 15), L)
GLL <- sum(mu[LL] - label[LL])
GLR <- sum(mu[LR] - label[LR])
HLL <- sum(exp(offset[LL] + params$max_delta_step))
HLR <- sum(exp(offset[LR] + params$max_delta_step))
gainL <- GLL ^ 2 / (HLL + params$lambda) +
GLR ^ 2 / (HLR + params$lambda) -
(GLL + GLR ) ^ 2 / (HLL + HLR + params$lambda)
coverL <-sum(exp(offset[L] + params$max_delta_step))
cat("Quality: ", gainL)
cat("\nCover: ", coverL)
## Quality: 2.342806
## Cover: 5359.799
Again, Quality is not exactly the same as in the table, but hey, at least Cover is spot on! Finally, we can also calculate leaf weights. You have probably noticed that using a low learning rate often requires more trees. This is because eta
gets multiplied in when calculating the weights thus requiring additional trees to build up the predictions.
wLL <- - GLL / (HLL + params$lambda) * params$eta
wLR <- - GLR / (HLR + params$lambda) * params$eta
cat("Weight node 3: ", wLL)
cat("\nWeight node 4: ", wLR)
## Weight node 3: -0.1173621
## Weight node 4: -0.1304656
Ascending the second tree
The second tree splits by the same features and conditions as the first, but Quality and Cover have changed as the predictions have been updated by the first tree.
## Tree Node ID Feature Split Yes No Missing Quality Cover
## 1: 1 0 1-0 VAgeCat 3.5 1-1 1-2 1-1 3.3744485 6886.6934
## 2: 1 1 1-1 NCD 15.0 1-3 1-4 1-3 2.5446322 4731.2920
## 3: 1 2 1-2 VAgeCat 4.5 1-5 1-6 1-5 0.3558144 2155.4016
## 4: 1 3 1-3 Leaf NA <NA> <NA> <NA> -0.1134273 2079.2393
## 5: 1 4 1-4 Leaf NA <NA> <NA> <NA> -0.1278871 2652.0525
## 6: 1 5 1-5 Leaf NA <NA> <NA> <NA> -0.1304543 897.9076
## 7: 1 6 1-6 Leaf NA <NA> <NA> <NA> -0.1402755 1257.4940
By setting ntreelimit = 1
, the predict()
function will return predictions after only adding the first tree.
p <- predict(bst, dmat, ntreelimit = 1, outputmargin = TRUE)
We can verify that we get the same value by first determining which path the first observation took down the first tree, and then add up its offset and weight.
## PC NCD AgeCat VAgeCat
## 0 30 0 0
The observation should end up in node 4.
cat("Adding weight and offset: ", dt[Tree == 0 & Node == 4]$Quality + offset[1])
cat("\nOutput from predict(): ", p[1])
## Adding weight and offset: -0.5338795
## Output from predict(): -0.5338795
So to calculate cover for the root node in the second tree we should sum predictions from the first tree, remembering to take max_delta_step
into account.
cat("Cover: ", sum(exp(p + params$max_delta_step)))
## Cover: 6886.694
Too much weight
As a last point I want to mention that weights get truncated by max_delta_step
when set to a number grater than zero. This might not have any practical implications, but it could be worth knowing. It definitely had me confused for a while when I was trying to figure out why I didn’t get the expected leaf weights. To demonstrate I’m multiplying the claim count by 100 for observations where the first split condition is true.
label100 <- ifelse(SingaporeAuto$VAgeCat < 4.5, label * 100, label)
dmat100 <- xgb.DMatrix(features, info = list(label = label100, base_margin = offset))
bst100 <- xgb.train(params = params, data = dmat100, nrounds = 1)
xgb.model.dt.tree(model = bst100)
## Tree Node ID Feature Split Yes No Missing Quality Cover
## 1: 0 0 0-0 VAgeCat 4.5 0-1 0-2 0-1 1988.53577000 7833.70361
## 2: 0 1 0-1 Leaf NA <NA> <NA> <NA> 0.21000001 6385.18262
## 3: 0 2 0-2 NCD 25.0 0-3 0-4 0-3 0.01097419 1448.52087
## 4: 0 3 0-3 Leaf NA <NA> <NA> <NA> -0.14215256 1383.86572
## 5: 0 4 0-4 Leaf NA <NA> <NA> <NA> -0.12385987 64.65511
The first tree now looks a little different with Leaf 1 having the suspicious looking weight of 0.21. This is because it’s been truncated by max_delta_step
and then multiplied by eta
cat("Weight Leaf 1: ", params$eta * params$max_delta_step)
## Weight Leaf 1: 0.21
Summary
That’s it. In this post I have described how
- the numbers in the model dump are calculated
- some of the parameters affect these numbers and how they impact the model. More specifically:
eta
gets multiplied with leaf weightslambda
shrinks weights toward zerogamma
sets a threshold for the Gain required to split a nodemax_delta_step
affects Cover and truncates the weightsbase_margin
andbase_score
are the initial predictions before the first tree
In the next post I will have a look at prediction contributions using the predcontrib
option.