Boosting and calibrating a probabilistic classifier

In this vignette, we will see how we can use AdaBoost to turn a mediocre model into a less mediocre one.

julia

using SpeciesDistributionToolkit
const SDT = SpeciesDistributionToolkit
using CairoMakie

Data preparation

We will use the demonstration data, and setup a simple decision tree. We will refine its hyper-parameters later on.

julia

aoi = getpolygon(PolygonData(OpenStreetMap, Places); place = "Corse")
provider = RasterData(CHELSA2, BioClim)
L = SDMLayer{Float32}[
    SDMLayer(
        provider;
        layer = x,
        SDT.boundingbox(aoi)...,
    ) for x in 1:19
];
mask!(L, aoi)

19-element Vector{SDMLayer{Float32}}:
 🗺️  A 205 × 124 layer (13685 Float32 cells)
 🗺️  A 205 × 124 layer (13685 Float32 cells)
 🗺️  A 205 × 124 layer (13685 Float32 cells)
 🗺️  A 205 × 124 layer (13685 Float32 cells)
 🗺️  A 205 × 124 layer (13685 Float32 cells)
 🗺️  A 205 × 124 layer (13685 Float32 cells)
 🗺️  A 205 × 124 layer (13685 Float32 cells)
 🗺️  A 205 × 124 layer (13685 Float32 cells)
 🗺️  A 205 × 124 layer (13685 Float32 cells)
 🗺️  A 205 × 124 layer (13685 Float32 cells)
 🗺️  A 205 × 124 layer (13685 Float32 cells)
 🗺️  A 205 × 124 layer (13685 Float32 cells)
 🗺️  A 205 × 124 layer (13685 Float32 cells)
 🗺️  A 205 × 124 layer (13685 Float32 cells)
 🗺️  A 205 × 124 layer (13685 Float32 cells)
 🗺️  A 205 × 124 layer (13685 Float32 cells)
 🗺️  A 205 × 124 layer (13685 Float32 cells)
 🗺️  A 205 × 124 layer (13685 Float32 cells)
 🗺️  A 205 × 124 layer (13685 Float32 cells)

Fundamental of model use

There is a vignette on declaring, training, and using models, which covers a lot of ground that will help make sense of the syntax.

julia

model = SDM(RawData, DecisionTree, SDeMo.__demodata()...)

❎  RawData → DecisionTree → P(x) ≥ 0.5 🗺️

This is the dataset we start from:

Code for the figure

julia

fig = Figure()
ax = Axis(fig[1, 1]; aspect = DataAspect())
poly!(ax, aoi; color = :grey90, strokewidth = 1, strokecolor = :grey20)
scatter!(ax, presences(model); color = :black, markersize = 6)
scatter!(ax, absences(model); color = :grey50, markersize = 4)
hidedecorations!(ax)
hidespines!(ax)

Initial model

Before building the boosting model, we need to construct a weak learner.

The BIOCLIM model (Booth et al., 2014) is one of the first SDM that was ever published. Because it is a climate envelope model (locations within the extrema of environmental variables for observed presences are assumed suitable, more or less), it does not usually give really good predictions. BIOCLIM underfits, and for this reason it is also a good candidate for boosting. Naive Bayes classifiers are similarly suitable for AdaBoost: they tend to achieve a good enough performance in aggregate, but due to the way they learn from the data, they rarely excel at any single prediction.

The traditional lore is that AdaBoost should be used only with really weak classifiers, but Wyner et al. (2017) have convincing arguments in favor of treating it more like a random forest, by showing that it also works with comparatively stronger learners. In keeping with this idea, we will train a moderately deep tree as our base classifier.

julia

hyperparameters!(classifier(model), :maxdepth, 4)
hyperparameters!(classifier(model), :maxnodes, 4)

DecisionTree(SDeMo.DecisionNode(nothing, nothing, nothing, 0, 0.0, 0.5, false), 4, 4)

Decision stumps

using a maximum depth of 1 and a maximum nodes of 2 would turn this model into a decision stump, which is a very interesting type of classifier that relies on stacking many mediocre predictions into an often surprisingly good one.

This tree is not quite a decision stump (a one-node tree), but they are relatively small, so a single one of them is unlikely to give a good prediction. We can train the model and generate the initial prediction:

julia

train!(model)
prd = predict(model, L; threshold = false)

🗺️  A 205 × 124 layer with 13685 Float64 cells
   Spatial Reference System: +proj=longlat +datum=WGS84 +no_defs

As expected, although it roughly identifies the range of the species, it is not doing a spectacular job at it. This model is a good candidate for boosting.

Code for the figure

julia

f = Figure()
ax = Axis(f[1,1]; aspect=DataAspect())
hm = heatmap!(ax, prd; colormap = :tempo, colorrange = (0, 1))
lines!(ax, aoi; color = :grey20)
hidedecorations!(ax)
hidespines!(ax)
Colorbar(f[1, 2], hm; height = Relative(0.6))

Boosting

To train this model using the AdaBoost algorithm, we wrap it into an AdaBoost object, and specify the number of iterations (how many learners we want to train). The default is 50.

julia

bst = AdaBoost(model; iterations = 60)

AdaBoost {RawData → DecisionTree → P(x) ≥ 0.393} × 60 iterations

We will decrease the learning rate of this model a little bit:

julia

hyperparameters!(bst, :η, 0.9)

AdaBoost {RawData → DecisionTree → P(x) ≥ 0.393} × 60 iterations

Training the model is done in the exact same way as other SDeMo models, with one important exception. AdaBoost uses the predicted class (presence/absence) internally, so we will always train learners to return a thresholded prediction. For this reason, even it the threshold argument is used, it will be ignored during training.

julia

train!(bst)

AdaBoost {RawData → DecisionTree → P(x) ≥ 0.393} × 60 iterations

We can look at the weight (i.e. how much a weak learner is trusted during the final prediction) over time. It should decrease, and more than that, it should ideally decrease fast (exponentially). Some learners may have a negative weight, which means that they are consistently wrong. These are counted by flipping their prediction in the final recommendation.

Code for the figure

julia

scatterlines(bst.weights; color = :black)

Models with a weight of 0 are not contributing new information to the boosted model. These are expected to become increasingly common at later iterations, so it is a good idea to examine the cumulative weights to see whether we get close to a plateau.

Code for the figure

julia

scatterlines(cumsum(bst.weights); color = :black)

This is close enough. We can now apply this model to the bioclimatic variables:

julia

brd = predict(bst, L; threshold = false)

🗺️  A 205 × 124 layer with 13685 Float64 cells
   Spatial Reference System: +proj=longlat +datum=WGS84 +no_defs

This gives the following map:

Code for the figure

julia

f = Figure()
ax = Axis(f[1,1]; aspect=DataAspect())
hm = heatmap!(ax, brd; colormap = :tempo, colorrange = (0, 1))
lines!(ax, aoi; color = :grey20)
hidedecorations!(ax)
hidespines!(ax)
Colorbar(f[1, 2], hm; height = Relative(0.6))

Calibration

The scores returned by the boosted model are not calibrated probabilities, so they are centered on 0.5, but have typically low variance. This can be explained by the fact that each component models tends to make one-sided errors near 0 and 1, wich accumulate with more models over time. This problem is typically more serious when adding multiple classifiers together, as with bagging and boosting. Some classifiers tend to exagerate the proportion of extreme values (Naive Bayes classifiers, for example), some others tend to create peaks near values slightly above 0 and below 1 (tree-based methods), while methods like logistic regression typically do a fair job at predicting probabilities.

The type of bias that a classifier accumulates is usually pretty obvious from looking at the histogram of the results:

Code for the figure

julia

f = Figure(; size = (600, 300))
ax = Axis(f[1, 1]; xlabel = "Predicted score")
hist!(ax, brd; color = :lightgrey, bins = 30)
hideydecorations!(ax)
hidexdecorations!(ax; ticks = false, ticklabels = false)
hidespines!(ax, :l)
hidespines!(ax, :r)
hidespines!(ax, :t)
xlims!(ax, 0, 1)
tightlimits!(ax)

Here, we see that the predictions are starting to cluster away from 0 and 1, and converge onto about one half. This is not a problem when thinking about the species range, because the thresholding will account for this; but it is a problem when thinking about the predicted score.

We can also look at the reliability curve for this model. The reliability curve divides the predictions into bins, and compares the average score on the training data to the proportion of events in the training data. When these two quantities are almost equal, the score outputed by the model is a good prediction of the probability of the event.

Code for the figure

julia

f = Figure()
ax = Axis(
    f[1, 1];
    aspect = 1,
    xlabel = "Average predicted probability",
    ylabel = "Average empirical probability",
)
lines!(ax, [0, 1], [0, 1]; color = :grey, linestyle = :dash)
xlims!(ax, 0, 1)
ylims!(ax, 0, 1)
scatterlines!(ax, SDeMo.reliability(bst)...; color = :black, label = "Raw scores")

Based on the reliability curve, it is clear that the model outputs are not probabilities; otherwise, they would lay on the 1:1 line. In order to turn these scores into a value that are closer to probabilities, we can estimate calibration parameters for this model:

julia

C = calibrate(bst)

PlattCalibration(-8.763135276226246, 4.4798401313050595)

This step uses the Platt (1999) scaling approach, where the outputs from the model are regressed against the true class probabilities, to attempt to bring the model prediction more in line with true probabilities (Niculescu-Mizil and Caruana, 2005). Internally, the package uses the fast and robust algorithm of Lin et al. (2007).

These parameters can be turned into a correction function:

julia

f = correct(C)

#correct##0 (generic function with 1 method)

It is a good idea to check that the calibration function is indeed bringing us closer to the 1:1 line, which is to say that it provides us with a score that is closer to a true probability.

Code for the figure

julia

scatterlines!(
    ax,
    SDeMo.reliability(bst; link = f)...;
    color = :blue,
    marker = :rect,
    label = "Calibrated (Platt)",
)
axislegend(ax; position = :lt)

This calibration function is not quite right for this problem, so we can switch to the more flexible method of isotonic calibration:

julia

C = calibrate(IsotonicCalibration, bst)
calfunc = correct(C)

#correct##0 (generic function with 1 method)

Code for the figure

julia

scatterlines!(
    ax,
    SDeMo.reliability(bst; link = calfunc)...;
    color = :orange,
    marker = :rect,
    label = "Calibrated (isotonic)",
)
axislegend(ax; position = :lt)

This gives a much better result, and so we will use this calibration function to return the probability prediction.

Predicting probabilities

Of course the calibration function can (like any other function) be applied to a layer, so we can now use AdaBoost to estimate the probability of species presence. Dormann (2020) has several strong arguments in favor of this approach.

Code for the figure

julia

f = Figure()
ax = Axis(f[1,1]; aspect=DataAspect())
hm = heatmap!(ax, calfunc.(brd); colormap = :tempo, colorrange = (0, 1))
lines!(ax, aoi; color = :grey20)
hidedecorations!(ax)
hidespines!(ax)
Colorbar(f[1, 2], hm; height = Relative(0.6))

Note that the calibration is not required to obtain a binary map, as the transformation applied to the scores is monotonous. For this reason, the AdaBoost function will internally work on the raw scores, and turn them into binary predictions using thresholding.

The default threshold for AdaBoost is 0.5, which is explained by the fact that the final decision step is a weighted voting based on the sum of all weak learners. We can get the predicted range under both models:

julia

sdm_range = predict(model, L)
bst_range = predict(bst, L)

🗺️  A 205 × 124 layer with 13685 Bool cells
   Spatial Reference System: +proj=longlat +datum=WGS84 +no_defs

We now plot the difference, where the areas gained by boosting are in green, and the areas lost by boosting are in light grey. The part of the range that is conserved is in black.

Code for the figure

julia

fg, ax, pl = heatmap(
    gainloss(sdm_range, bst_range);
    colormap = [:forestgreen, :grey10, :grey75],
    colorrange = (-1, 1),
)
ax.aspect = DataAspect()
hidedecorations!(ax)
hidespines!(ax)
lines!(ax, aoi; color = :grey20)

SDeMo.AdaBoost Type

julia

AdaBoost <: AbstractBoostedSDM

A type for AdaBoost that contains the model, a vector of learners, a vector of learner weights, a number of boosting iterations, and the weights w of each point.

Note that this type uses training by re-sampling data according to their weights, as opposed to re-training on all samples and weighting internally. Some learners may have negative weights, in which case their predictions are flipped at prediction time.

The learning rate η defaults to one, but can be lowered to improve stability / prevent over-fitting. The learning rate must be larger than 0, and decreasing it should lead to a larger number of iterations.

source

SDeMo.calibrate Function

julia

calibration(sdm::T; kwargs...)

Returns a function for model calibration, using Platt scaling, optimized with the Newton method. The returned function can be applied to a model output.

source

julia

calibrate(::Type{IsotonicCalibration}, sdm::T; bins=25, kwargs...)

Returns the isotonic calibration result for a given SDM. Isotonic regression is done using the bin membership as a weight, which gives less importance to bins with a lower membership.

source

SDeMo.correct Function

julia

correct(cal::AbstractCalibration)

Returns a function that gives a probability given a calibration result.

source

julia

correct(cal::Vector{<:AbstractCalibration})

Returns a function that gives the average of probabilities from a vector of calibration results. This is used when bootstrapping or cross-validating probabilities using a pre-trained model.

source

SDeMo.reliability Function

julia

reliability(yhat, y; bins=9)

Returns a binned reliability curve for a series of predicted quantitative scores and a series of truth values.

source

julia

reliability(sdm::AbstractSDM, link::Function=identity; bins=9, kwargs...)

Returns a binned reliability curve for a trained model, where the raw scores are transformed with a specified link function (which defaults to identity). Keyword arguments other than bins are passed to predict.

source

References

Booth, T. H.; Nix, H. A.; Busby, J. R. and Hutchinson, M. F. (2014), bioclim: the first species distribution modelling package, its early applications and relevance to most current MaxEnt studies. Diversity & distributions 20, 1–9. Publisher: Wiley.
Dormann, C. F. (2020). Calibration of probability predictions from machine‐learning and statistical models. Global ecology and biogeography: a journal of macroecology 29, 760–765. Publisher: Wiley.
Lin, H.-T.; Lin, C.-J. and Weng, R. C. (2007). A note on Platt’s probabilistic outputs for support vector machines. Machine learning 68, 267–276. Publisher: Springer Science and Business Media LLC.
Niculescu-Mizil, A. and Caruana, R. (2005). Obtaining calibrated probabilities from boosting. In: Proceedings of the twenty-first conference on uncertainty in artificial intelligence, UAI'05 (AUAI Press, Arlington, Virginia, USA); pp. 413–420, tex.venue: Edinburgh, Scotland.
Platt, J. (1999). Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Advances in large margin classifiers, edited by Smola, A. J.; Bartlett, P.; Schölkopf, B. and Schuurmans, D. (MIT Press, Cambridge).
Wyner, A. J.; Olson, M.; Bleich, J. and Mease, D. (2017). Explaining the success of adaboost and random forests as interpolating classifiers. Journal of Machine Learning Research 18, 1558–1590. Publisher: JMLR.org.

Boosting and calibrating a probabilistic classifier ​

Data preparation ​

Initial model ​

Boosting ​

Calibration ​

Predicting probabilities ​

Related documentation ​

References ​

Boosting and calibrating a probabilistic classifier

Data preparation

Initial model

Boosting

Calibration

Predicting probabilities

Related documentation

References