Skip to content

The model training pipeline

The purpose of this vignette is to present the ways to train and predict with a model. This is not a complete overview of the features of the package, but rather a compendium of the ways to declare, train, and predict with a model.

julia
using SpeciesDistributionToolkit
import Statistics
using CairoMakie

The rest of the vignettes in this section present more advanced functionalities. In particular, the ones on cross-validation, variable selection, and hyper-parameters are key.

Data for a model

Models require two sources of information to be trained. First, the features (a matrix of predictors); second, the labels (a vector of boolean values representing the presence of the species). Thankfully, the package is built so that we can get at this information through different ways, and these are notably:

  • a vector of spatial layers for the features, and two layers with boolean values for the presence/absence

  • a vector of spatial layers for the features, and a collection of occurrences for the presence/absence

In addition, models can have geospatial information about the occurrences.

A lot of vignettes will use a demonstration dataset which contains all three:

julia
X, y, c = SDeMo.__demodata();

Components of a model

All models have three steps: the data transformation, the classifier, and the threshold. These are trained in sequence. The data transformation step is done in a way that avoids data leakage, which means that it is only trained on values that are accessible to the model at the time where it is trained.

The classifier is the algorithm used to transform the input values (possibly transformed) into a quantitative score. This score is then compared to the threshold, in order to return a yes/no answer.

The syntax to build a model is

julia
model = SDM(ZScore, NaiveBayes, X, y, c)
❎  ZScore → NaiveBayes → P(x) ≥ 0.5 🗺️

About coordinates

When constructing a model from layers and occurrences, the coordinates of the instances will be added automatically. Having coordinates is useful to plot models.

We can check whether this model is georeferenced:

julia
isgeoreferenced(model)
true

And whether it is trained:

julia
istrained(model)
false

Training a model

A model will by default train using all its available information.

julia
train!(model)
☑️  ZScore → NaiveBayes → P(x) ≥ 0.034 🗺️

We can verify that the training actually happened:

julia
istrained(model)
true

It is possible to specify which instances are used for training. Generally speaking, this is useful for cross-validation, and the function for cross-validation will take care of this automatically.

Thresholded v. unthresholded predictions

A model can be used to make a prediction with no argument, in which case it will predict on its entire training set. For example, we can check how many predictions of "presence" the model makes:

julia
sum(predict(model))
857

The predict function can also take a vector (prediction for a single instance), or a matrix. Finally, it can also take a vector of spatial information.

Models in space

The use of models with spatial information is covered in more depth in the vignette on training spatial models.

The predict function takes a threshold keyword argument, which defaults to true. When it is false, the function will return the score that is coming directly from the classifier. For example, we can get the score associated to the second training instance:

julia
predict(model, X[:,2]; threshold=false)
0.1572394311781998

Scores and probabilities

There is additional documentation covering the calibration functions, as well as an illustrative vignette on spatial prediction of probabilities.

The threshold of the model is optimized during training, and can be access with

julia
threshold(model)
0.03355704697986577

Modifying a model

The components of a model can be changed in real time.

But should they?

It may be better to declare a new model. One case in which rapidly changing the classifier or transformer is useful is when trying different combinations in order to get the best fit.

For example, we can move away from naive Bayes and use logistic regression:

julia
classifier!(model, Logistic)
❎  ZScore → Logistic → P(x) ≥ 0.034 🗺️

Note that this marked the model as untrained:

julia
istrained(model)
false

So we can re-train it:

julia
train!(model)
☑️  ZScore → Logistic → P(x) ≥ 0.582 🗺️

Bagging

Models can be aggregated into homogeneous ensembles, by using them as arguments to the Bagging function.

julia
tree = SDM(PCATransform, DecisionTree, X, y, c)
forest = Bagging(tree, 50)
{PCATransform → DecisionTree → P(x) ≥ 0.5} × 50

We can also set each component model to use a different set of features:

julia
bagfeatures!(forest)
{PCATransform → DecisionTree → P(x) ≥ 0.5} × 50

And we can now train this model:

julia
train!(forest)
{PCATransform → DecisionTree → P(x) ≥ 0.172} × 50

The prediction for these models takes another argument, consensus, which is a function that, when called on a vector (the output of each model), will return a single value. For example, we can measure how many models agreed that the first training instance is a presence:

julia
predict(forest, X[:,1]; consensus=sum)
5

A particularly useful consensus function is majority, which will return the outcome of a majority consensus:

julia
predict(forest, X[:,1]; consensus=majority)
false

When predicting without the threshold, this can be used to measure the variability of the different models:

julia
predict(forest, X[:,1]; threshold=false, consensus=iqr)
0.08736720305043286

Boosting

We can use the AdaBoost approach to boost a single model by re-training it sequentially on the least well predicted instances. For the sake of argument, let's do a boosted BIOCLIM. We only live once, and it's a very confusing experience, so we might as well embrace the weird.

julia
why = AdaBoost(SDM(RawData, BIOCLIM, X, y, c), 50)
train!(why)
AdaBoost {RawData → BIOCLIM → P(x) ≥ 0.029} × 50 iterations

More on boosting

There is a more complete vignette on AdaBoost, which also touches upon the issue of model calibration.

This model can be used just like any other model:

julia
predict(why, X[:,4])
false

Heterogeneous ensembles

Several models can be combined into an ensemble, which is done by using an array of models:

julia
ensemble = Ensemble([
    SDM(RawData, Maxent, X, y, c),
    SDM(PCATransform, Logistic, X, y, c),
    SDM(RawData, NaiveBayes, X, y, c),
    Bagging(
        SDM(RawData, DecisionTree, X, y, c),
        50
    )
])
train!(ensemble)
An ensemble model with:
	 ☑️  RawData → Maxent → P(x) ≥ 0.418 🗺️ 
	 ☑️  PCATransform → Logistic → P(x) ≥ 0.482 🗺️ 
	 ☑️  RawData → NaiveBayes → P(x) ≥ 0.034 🗺️ 
	 {RawData → DecisionTree → P(x) ≥ 0.242} × 50

This model can be predicted just like an homogeneus ensemble, including using threshold and consensus keywords:

julia
predict(ensemble, X[:,3]; threshold=false, consensus=Statistics.median)
0.5720326878872128
SDeMo.SDM Type
julia
SDM

This type specifies a full model, which is composed of a transformer (which applies a transformation on the data), a classifier (which returns a quantitative score), a threshold (above which the score corresponds to the prediction of a presence).

In addition, the SDM carries with it the training features and labels, as well as a vector of indices indicating which variables are actually used by the model.

The coordinates for each observation that is used to train the model are given in the coordinates field, and as with OccurrencesInterface, they must be given as longitude,latitude. If there are no known coordinates for the observations, this field must be an empty vector of the correct type. As of now, there is no plan to support datasets that only have some coordinates known.

source
SDeMo.Bagging Type
julia
Bagging

A bagged (bootstrap aggregated) model is the same template model repeated an arbitrary number of times, with each replicated model having access to a bootstrapped sample of the data. These models are represented by three fields:

  • model is the base model

  • bags is a vector of tuples with the in and out-of sample instances

  • models is an array of replicated models

source