Use with the SDeMo package

In this tutorial, we will see how to deal with ensemble models. It is assumed that you will have read the previous tutorial, explaining how to use prediction functions more generally.

julia

using SpeciesDistributionToolkit
using CairoMakie
using Statistics
using Dates

Getting the data

julia

CHE = SpeciesDistributionToolkit.openstreetmap("Switzerland");

We get the data on the BioClim variables from CHELSA2:

julia

provider = RasterData(CHELSA2, BioClim)
L = SDMLayer{Float32}[
    SDMLayer(
        provider;
        layer = x,
        SpeciesDistributionToolkit.boundingbox(CHE)...,
    ) for x in 1:19
];

And we then clip and trim to the polygon describing Switzerland:

julia

L = mask!(L, CHE)

19-element Vector{SDMLayer{Float32}}:
 🗺️  A 240 × 546 layer (70065 Float32 cells)
 🗺️  A 240 × 546 layer (70065 Float32 cells)
 🗺️  A 240 × 546 layer (70065 Float32 cells)
 🗺️  A 240 × 546 layer (70065 Float32 cells)
 🗺️  A 240 × 546 layer (70065 Float32 cells)
 🗺️  A 240 × 546 layer (70065 Float32 cells)
 🗺️  A 240 × 546 layer (70065 Float32 cells)
 🗺️  A 240 × 546 layer (70065 Float32 cells)
 🗺️  A 240 × 546 layer (70065 Float32 cells)
 🗺️  A 240 × 546 layer (70065 Float32 cells)
 🗺️  A 240 × 546 layer (70065 Float32 cells)
 🗺️  A 240 × 546 layer (70065 Float32 cells)
 🗺️  A 240 × 546 layer (70065 Float32 cells)
 🗺️  A 240 × 546 layer (70065 Float32 cells)
 🗺️  A 240 × 546 layer (70065 Float32 cells)
 🗺️  A 240 × 546 layer (70065 Float32 cells)
 🗺️  A 240 × 546 layer (70065 Float32 cells)
 🗺️  A 240 × 546 layer (70065 Float32 cells)
 🗺️  A 240 × 546 layer (70065 Float32 cells)

The next step is to get the data, using the eBird dataset:

julia

ouzel = taxon("Turdus torquatus")
presences = occurrences(
    ouzel,
    first(L),
    "occurrenceStatus" => "PRESENT",
    "limit" => 300,
    "datasetKey" => "4fa7b334-ce0d-4e88-aaae-2e0c138d049e",
)
while length(presences) < count(presences)
    occurrences!(presences)
end

And after this, we prepare a layer with presence data.

julia

presencelayer = mask(first(L), Occurrences(mask(presences, CHE)))

🗺️  A 240 × 546 layer with 70065 Bool cells
   Projection: +proj=longlat +datum=WGS84 +no_defs

The next step is to generate a pseudo-absence mask. There is a longer explanation on the ways to generate pseudo-absences in the "How-to" section of the manual. We will sample based on the distance to an observation, by also preventing pseudo-absences to be less than 4km from an observation:

julia

background = pseudoabsencemask(DistanceToEvent, presencelayer)
bgpoints = backgroundpoints(nodata(background, d -> d < 4), 2sum(presencelayer))

🗺️  A 240 × 546 layer with 45362 Bool cells
   Projection: +proj=longlat +datum=WGS84 +no_defs

Setting up the model

We will use a decision tree on data transformed through a PCA. Because we will eventually use an ensemble, we want the trees to be shallow (but not too shallow):

julia

sdm = SDM(PCATransform, DecisionTree, L, presencelayer, bgpoints)
variables!(sdm, ForwardSelection, kfold(sdm))

PCATransform → DecisionTree → P(x) ≥ 0.458

Making the first prediction

We can predict using the model by giving a vector of layers as an input. This will return the output as a layer as well:

julia

prd = predict(sdm, L; threshold = false)

🗺️  A 240 × 546 layer with 70065 Float64 cells
   Projection: +proj=longlat +datum=WGS84 +no_defs

This gives the following initial prediction:

Code for the figure

julia

f = Figure(; size = (600, 300))
ax = Axis(f[1, 1]; aspect = DataAspect(), title = "Prediction (tree)")
hm = heatmap!(ax, prd; colormap = :linear_worb_100_25_c53_n256, colorrange = (0, 1))
contour!(ax, predict(sdm, L); color = :black, linewidth = 0.5)
Colorbar(f[1, 2], hm)
lines!(ax, CHE; color = :black)
hidedecorations!(ax)
hidespines!(ax)

We can now create an ensemble model, by boostrapping random instances to train a larger number of trees:

julia

ensemble = Bagging(sdm, 30)

{PCATransform → DecisionTree → P(x) ≥ 0.458} × 30

In order to further ensure that the models are learning from different parts of the dataset, we can also bootstrap which variables are accessible to each model:

julia

bagfeatures!(ensemble)

{PCATransform → DecisionTree → P(x) ≥ 0.458} × 30

By default, the bagfeatures! function called on an ensemble will sample the variables forom the model, so that each model in the ensemble has the square root (rounded up) of the number of original variables.

A probably better alternative is to reset the variables of our model:

julia

variables!(ensemble, AllVariables)

{PCATransform → DecisionTree → P(x) ≥ 0.498} × 30

And then perform variable selection with feature bagging:

julia

variables!(ensemble, ForwardSelection, kfold(sdm); bagfeatures = true)

{PCATransform → DecisionTree → P(x) ≥ 0.524} × 30

This will take longer to run, but will usually provide better results.

About this ensemble model

The model we have constructed here is essentially a rotation forest. Note that the PCA is applied to each tree, so it is accounting for the selection of features, and does not introduce data leakage.

We can now train our ensemble model – for an ensemble, this will go through each model, and retrain them with the correct input data/features.

julia

train!(ensemble)

{PCATransform → DecisionTree → P(x) ≥ 0.524} × 30

Let's have a look at the out-of-bag error:

julia

ensemble |> outofbag |> (M) -> 1 - accuracy(M)

0.15537270087124877

Because this ensemble is a collection of model, we set the method to summarize all the scores as the median:

julia

prd = predict(ensemble, L; consensus = median, threshold = false)

🗺️  A 240 × 546 layer with 70065 Float64 cells
   Projection: +proj=longlat +datum=WGS84 +no_defs

When the model is trained, we can make a prediction asking to report the inter-quartile range of the predictions of all models, as a measure of uncertainty:

julia

unc = predict(ensemble, L; consensus = iqr, threshold = false)

🗺️  A 240 × 546 layer with 70065 Float64 cells
   Projection: +proj=longlat +datum=WGS84 +no_defs

Code for the figure

julia

f = Figure(; size = (600, 600))
ax = Axis(f[1, 1]; aspect = DataAspect(), title = "Prediction")
hm = heatmap!(ax, prd; colormap = :linear_worb_100_25_c53_n256, colorrange = (0, 1))
Colorbar(f[1, 2], hm)
contour!(
    ax,
    predict(ensemble, L; consensus = majority);
    color = :black,
    linewidth = 0.5,
)
lines!(ax, CHE; color = :black)
hidedecorations!(ax)
hidespines!(ax)
ax2 = Axis(f[2, 1]; aspect = DataAspect(), title = "Uncertainty")
hm =
    heatmap!(ax2, quantize(unc); colormap = :linear_gow_60_85_c27_n256, colorrange = (0, 1))
Colorbar(f[2, 2], hm)
contour!(
    ax2,
    predict(ensemble, L; consensus = majority);
    color = :black,
    linewidth = 0.5,
)
lines!(ax2, CHE; color = :black)
hidedecorations!(ax2)
hidespines!(ax2)

julia

# Heterogeneous ensembles

We can also build ensembles of different models. Let's bring in the model we used in the previous tutorial:

julia

sdm2 = SDM(ZScore, Logistic, L, presencelayer, bgpoints)
variables!(sdm2, [1, 11, 5, 8, 6])
hyperparameters!(classifier(sdm2), :η, 1e-4);
hyperparameters!(classifier(sdm2), :interactions, :all);
hyperparameters!(classifier(sdm2), :epochs, 10_000);

As well as a Naive Bayes classifier:

julia

sdm3 = SDM(RawData, NaiveBayes, L, presencelayer, bgpoints)
variables!(sdm3, ForwardSelection)

RawData → NaiveBayes → P(x) ≥ 0.03

These models can all be merged into an heterogeneous ensemble:

julia

hens = Ensemble(sdm2, sdm3)
train!(hens)

An ensemble model with:
	 ZScore → Logistic → P(x) ≥ 0.318
	 RawData → NaiveBayes → P(x) ≥ 0.03

Heterogeneous ensembles can be used in the exact same way as bagged models, so we can for example predict the range according to the three models:

julia

p_range = predict(hens, L; threshold = true, consensus = majority)

🗺️  A 240 × 546 layer with 70065 Bool cells
   Projection: +proj=longlat +datum=WGS84 +no_defs

Code for the figure

julia

f = Figure(; size = (600, 300))
ax = Axis(f[1, 1]; aspect = DataAspect())
hm = heatmap!(ax, p_range; colormap = Reverse(:terrain), colorrange = (0, 1))
contour!(ax, p_range; color = :black, linewidth = 0.5)
lines!(ax, CHE; color = :black)
hidedecorations!(ax)
hidespines!(ax)

Use with the SDeMo package ​

Getting the data ​

Setting up the model ​

Making the first prediction ​

Use with the SDeMo package

Getting the data

Setting up the model

Making the first prediction