XGBoost: Quality & Cover

I was going to write a post about how prediction contributions in XGBoost are calculated. But I quickly came to realize that it would be logical to go through a few other things first, namely Quality and Cover. Although this is all well described in the documentation, a practical example is sometimes useful. It has been very helpful for me at least in gaining a better understanding of how the algorithm works. So I will simply go through the model dump and try to explain how the numbers are calculated and affected by the model inputs.

The data

I’m using the SingaporeAuto dataset from the insuranceData R package, but it can be fetched from other sources as well. The dataset contains claim frequency from a general insurance auto portfolio together with a few policy holder characteristics.

## 'data.frame':    7483 obs. of  15 variables:
##  $ SexInsured : Factor w/ 3 levels "F","M","U": 3 3 3 3 3 3 3 3 3 3 ...
##  $ Female     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ VehicleType: Factor w/ 9 levels "A","G","M","P",..: 7 7 7 7 7 7 3 3 3 3 ...
##  $ PC         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Clm_Count  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Exp_weights: num  0.668 0.567 0.504 0.914 0.537 ...
##  $ LNWEIGHT   : num  -0.4034 -0.5679 -0.6856 -0.0894 -0.6225 ...
##  $ NCD        : int  30 30 30 20 20 20 20 20 20 20 ...
##  $ AgeCat     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AutoAge0   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AutoAge1   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AutoAge2   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AutoAge    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ VAgeCat    : int  0 0 0 0 0 0 6 6 6 6 ...
##  $ VAgecat1   : int  2 2 2 2 2 2 6 6 6 6 ...

The target, Clm_Count, is the number of claims during the year and LNWEIGHT is the logarithm of the exposure weight, i.e. the fraction of the year that the policy has been in force. Features ending with an integer are just predefined interactions. To keep things simple I’m only using PC, NCD, AgeCat and VageCat. The factors could contain some useful information but would have to be encoded in a smart way and I’m just dropping them out of convenience. Besides, you’re not allowed to use gender for pricing anyway!

Some kind of fish

In insurance frequency models a Poisson distribution is often assumed using a log link function. To adjust for different exposure weights, these can be added as an offset after taking the logarithm. This is equivalent to dividing the target with the exposure ensuring that all observations are at an annualized rate. \[log(\hat y_i) = \sum_{k=0}^{K}f_k(x_i) + log(exp.weight_i)\] Here \(K\) is the number of predictors. The offset is set as base_margin on the matrix object.

features <- as.matrix(SingaporeAuto[c("PC", "NCD", "AgeCat", "VAgeCat")])
label <- SingaporeAuto$Clm_Count
offset <- SingaporeAuto$LNWEIGHT
dmat <- xgb.DMatrix(features, info = list(label = label, base_margin = offset))

When specifying a Poisson objective, max_delta_step defaults to 0.7 (explicitly written out below). We will see how this affects the calculations further down. I’m setting max_depth to two just to keep things simple and base_score to one so it doesn’t affect our calculations (since we’re using a log link and \(log(1) = 0\)). Other parameters are set to their defaults. Note that subsample needs to bet set to one for the numbers to add up further down. Otherwise we’ll have no way of knowing which observations were used for growing which tree.

params <- list(
  booster = "gbtree",
  objective = "count:poisson",
  eta = 0.3,
  max_depth = 2,
  max_delta_step = 0.7,
  lambda = 1,
  base_score = 1
)

In this example I’m only growing two trees which are dumped to a data table by calling xgb.model.dt.tree()

bst <- xgb.train(params = params, data = dmat, nrounds = 2)
dt <- xgb.model.dt.tree(model = bst)

Dumping XGBoost

The model dump contains information about trees, nodes, split criteria and paths. The first tree is displayed below.

##    Tree Node  ID Feature Split  Yes   No Missing    Quality    Cover
## 1:    0    0 0-0 VAgeCat   3.5  0-1  0-2     0-1  3.0499489 7833.704
## 2:    0    1 0-1     NCD  15.0  0-3  0-4     0-3  2.3427782 5359.799
## 3:    0    2 0-2 VAgeCat   4.5  0-5  0-6     0-5  0.2904137 2473.904
## 4:    0    3 0-3    Leaf    NA <NA> <NA>    <NA> -0.1173621 2338.160
## 5:    0    4 0-4    Leaf    NA <NA> <NA>    <NA> -0.1304656 3021.639
## 6:    0    5 0-5    Leaf    NA <NA> <NA>    <NA> -0.1327546 1025.383
## 7:    0    6 0-6    Leaf    NA <NA> <NA>    <NA> -0.1414221 1448.521

The columns Quality and Cover are described in the documentation as

Quality: either the split gain (change in loss) or the leaf value
Cover: metric related to the number of observation either seen by a split or collected by a leaf during training.

You might recognize these as being returned from the importance function xgb.importance()

xgb.importance(model = bst)

##    Feature      Gain     Cover Frequency
## 1: VAgeCat 0.5912865 0.6572412 0.6666667
## 2:     NCD 0.4087135 0.3427588 0.3333333

The function aggregates Quality, Cover and frequency by feature excluding leaves, and then scales the values to unit range.

normalize <- function(x) x / sum(x)

imp <- dt[
  Feature != "Leaf",
  .(Gain = sum(Quality), Cover = sum(Cover), Frequency = .N),
  by = .(Feature)
]

for (j in setdiff(names(imp), "Feature")) {
  data.table::set(imp, j = j, value = normalize(imp[[j]]))
}

imp

##    Feature      Gain     Cover Frequency
## 1: VAgeCat 0.5912865 0.6572412 0.6666667
## 2:     NCD 0.4087135 0.3427588 0.3333333

That might bring some clarity to what’s going on. But instead of using the results from the output table, we can also calculate these numbers ourselves.

Taking derivatives

The value of the objective function in XGBoost depends only on two inputs; the first and second derivative of the loss function w.r.t. the predictions coming from the previous step, \(\hat y^{t-1}\). So in any node \(j\) (using the notation from the docs) \[G_j=\sum_{i\in I_j}\partial_{\hat y_i^{t-1}}l(y_i, \hat y_i^{t-1})\] \[H_j=\sum_{i\in I_j}\partial_{\hat y_i^{t-1}}^2l(y_i, \hat y_i^{t-1})\] where \(I_j\) is the set of data points belonging to the jth node. The optimal weights are calculated as \[w_j = -\frac{G_j}{H_j + \lambda}\] where \(\lambda\) is a regularization constant specified by the user. It shrinks the weights toward zero analogously to Ridge regression. Using the loss function \[l(y, \hat y) = \hat\mu - ylog(\hat\mu)\] where \(\hat\mu=exp(\hat y)\) and taking derivatives with respect to \(\hat y\) yields \[\frac{\partial l}{\partial \hat y}=\frac{\partial l}{\partial \hat\mu}\frac{\partial \hat\mu}{\partial \hat y}=\left(1-\frac{y}{\hat\mu}\right)\hat\mu=\hat\mu - y\] and \[\frac{\partial^2 l}{\partial \hat y^2}=\frac{\partial}{\partial \hat y}\left(\hat\mu-y\right)=\hat\mu\] So to get \(G\) and \(H\), all we have to do is to sum up these two quantities.

No Cover, no Gain

At this point we have everything we need to calculate Cover (which is just \(H\)) and Quality. We already know that Quality in the leaves are the weights, but in all other nodes it’s Gain, which is defined as \[Gain=\frac{1}{2}\left[\frac{G_L^2}{H_L+\lambda}+\frac{G_R^2}{H_R+\lambda}-\frac{(G_L+G_R)^2}{H_L+H_R+\lambda}\right]-\gamma\] As described in the documentation it is the gain in loss reduction when splitting a node (the first two terms) compared to not splitting (third term). If this is not greater than zero we’re better of staying where we are. This is also where gamma comes in to play; it sets a threshold for adding a new branch to the tree. In our case it was set to zero (the default), but if we would have set it to 0.3, Node 2 would not have been split and instead contained a leaf weight. This is demonstrated by the example below.

params0.3 <- params
params0.3$gamma <- 0.3
bst0.3 <- xgb.train(params = params0.3, data = dmat, nrounds = 1L)
xgb.model.dt.tree(model = bst0.3)

##    Tree Node  ID Feature Split  Yes   No Missing    Quality    Cover
## 1:    0    0 0-0 VAgeCat   3.5  0-1  0-2     0-1  3.0499489 7833.704
## 2:    0    1 0-1     NCD  15.0  0-3  0-4     0-3  2.3427782 5359.799
## 3:    0    2 0-2    Leaf    NA <NA> <NA>    <NA> -0.1378847 2473.904
## 4:    0    3 0-3    Leaf    NA <NA> <NA>    <NA> -0.1173621 2338.160
## 5:    0    4 0-4    Leaf    NA <NA> <NA>    <NA> -0.1304656 3021.639

To calculate Cover we should only have to sum up our predictions in a given node. However, max_delta_step gets added in under the summation sign. Taking the root node in the first tree as an example.

cat("Cover (root node of the first tree): ", sum(exp(offset + params$max_delta_step)))

## Cover (root node of the first tree):  7833.703

If we hadn’t set an offset and instead used the default value of base_score, then all predictions at this stage would equal 0.5. To calculate Gain in the same node we just have to keep track of which observations go right and left.

# intial predictions
mu <- exp(offset)

# indices of observations going left or right, the split criterian
# comes from the model dump table
L <- which(SingaporeAuto$VAgeCat < 3.5)
R <- which(SingaporeAuto$VAgeCat > 3.5)

# first derivates
GL <- sum(mu[L] - label[L])
GR <- sum(mu[R] - label[R])

# second derivatives with max_delta_step added
HL <- sum(exp(offset[L] + params$max_delta_step))
HR <- sum(exp(offset[R] + params$max_delta_step))

gain <- GL ^ 2 / (HL + params$lambda) +
  GR ^ 2 / (HR + params$lambda) -
  (GL + GR ) ^ 2 / (HL + HR + params$lambda)

cat("Quality: ", gain)

## Quality:  3.04999

In the best of worlds these numbers would be identical. I don’t know why they differ on some decimal, but I’m just going to go ahead and act as if they don’t.

Branching out

Continuing down the tree we can do the same calculations for the second layer as well, here just on the left branch.

LL <- intersect(which(SingaporeAuto$NCD < 15), L)
LR <- intersect(which(SingaporeAuto$NCD > 15), L)

GLL <- sum(mu[LL] - label[LL])
GLR <- sum(mu[LR] - label[LR])

HLL <- sum(exp(offset[LL] + params$max_delta_step))
HLR <- sum(exp(offset[LR] + params$max_delta_step))

gainL <- GLL ^ 2 / (HLL + params$lambda) +
  GLR ^ 2 / (HLR + params$lambda) -
  (GLL + GLR ) ^ 2 / (HLL + HLR + params$lambda)

coverL <-sum(exp(offset[L] + params$max_delta_step))

cat("Quality: ", gainL)
cat("\nCover: ", coverL)

## Quality:  2.342806
## Cover:  5359.799

Again, Quality is not exactly the same as in the table, but hey, at least Cover is spot on! Finally, we can also calculate leaf weights. You have probably noticed that using a low learning rate often requires more trees. This is because eta gets multiplied in when calculating the weights thus requiring additional trees to build up the predictions.

wLL <- - GLL / (HLL + params$lambda) * params$eta
wLR <- - GLR / (HLR + params$lambda) * params$eta
cat("Weight node 3: ", wLL)
cat("\nWeight node 4: ", wLR)

## Weight node 3:  -0.1173621
## Weight node 4:  -0.1304656

Ascending the second tree

The second tree splits by the same features and conditions as the first, but Quality and Cover have changed as the predictions have been updated by the first tree.

##    Tree Node  ID Feature Split  Yes   No Missing    Quality     Cover
## 1:    1    0 1-0 VAgeCat   3.5  1-1  1-2     1-1  3.3744485 6886.6934
## 2:    1    1 1-1     NCD  15.0  1-3  1-4     1-3  2.5446322 4731.2920
## 3:    1    2 1-2 VAgeCat   4.5  1-5  1-6     1-5  0.3558144 2155.4016
## 4:    1    3 1-3    Leaf    NA <NA> <NA>    <NA> -0.1134273 2079.2393
## 5:    1    4 1-4    Leaf    NA <NA> <NA>    <NA> -0.1278871 2652.0525
## 6:    1    5 1-5    Leaf    NA <NA> <NA>    <NA> -0.1304543  897.9076
## 7:    1    6 1-6    Leaf    NA <NA> <NA>    <NA> -0.1402755 1257.4940

By setting ntreelimit = 1, the predict() function will return predictions after only adding the first tree.

p <- predict(bst, dmat, ntreelimit = 1, outputmargin = TRUE)

We can verify that we get the same value by first determining which path the first observation took down the first tree, and then add up its offset and weight.

##      PC     NCD  AgeCat VAgeCat 
##       0      30       0       0

The observation should end up in node 4.

cat("Adding weight and offset: ", dt[Tree == 0 & Node == 4]$Quality + offset[1])
cat("\nOutput from predict(): ", p[1])

## Adding weight and offset:  -0.5338795
## Output from predict():  -0.5338795

So to calculate cover for the root node in the second tree we should sum predictions from the first tree, remembering to take max_delta_step into account.

cat("Cover: ", sum(exp(p + params$max_delta_step)))

## Cover:  6886.694

Too much weight

As a last point I want to mention that weights get truncated by max_delta_step when set to a number grater than zero. This might not have any practical implications, but it could be worth knowing. It definitely had me confused for a while when I was trying to figure out why I didn’t get the expected leaf weights. To demonstrate I’m multiplying the claim count by 100 for observations where the first split condition is true.

label100 <- ifelse(SingaporeAuto$VAgeCat < 4.5, label * 100, label)
dmat100 <- xgb.DMatrix(features, info = list(label = label100, base_margin = offset))
bst100 <- xgb.train(params = params, data = dmat100, nrounds = 1)
xgb.model.dt.tree(model = bst100)

##    Tree Node  ID Feature Split  Yes   No Missing       Quality      Cover
## 1:    0    0 0-0 VAgeCat   4.5  0-1  0-2     0-1 1988.53577000 7833.70361
## 2:    0    1 0-1    Leaf    NA <NA> <NA>    <NA>    0.21000001 6385.18262
## 3:    0    2 0-2     NCD  25.0  0-3  0-4     0-3    0.01097419 1448.52087
## 4:    0    3 0-3    Leaf    NA <NA> <NA>    <NA>   -0.14215256 1383.86572
## 5:    0    4 0-4    Leaf    NA <NA> <NA>    <NA>   -0.12385987   64.65511

The first tree now looks a little different with Leaf 1 having the suspicious looking weight of 0.21. This is because it’s been truncated by max_delta_step and then multiplied by eta

cat("Weight Leaf 1: ", params$eta * params$max_delta_step)

## Weight Leaf 1:  0.21

Summary

That’s it. In this post I have described how

the numbers in the model dump are calculated
some of the parameters affect these numbers and how they impact the model. More specifically:
- eta gets multiplied with leaf weights
- lambda shrinks weights toward zero
- gamma sets a threshold for the Gain required to split a node
- max_delta_step affects Cover and truncates the weights
- base_margin and base_score are the initial predictions before the first tree

In the next post I will have a look at prediction contributions using the predcontrib option.

Tags

Recent posts

fastText: Words, subwords & n-grams

Custom data visualisation with d3.js

XGBoost: prediction contributions

XGBoost: Quality & Cover

Related posts

XGBoost: prediction contributions

Archives