The model training pipeline

The purpose of this vignette is to present the ways to train and predict with a model. This is not a complete overview of the features of the package, but rather a compendium of the ways to declare, train, and predict with a model.

julia

using SpeciesDistributionToolkit
import Statistics
using CairoMakie

The rest of the vignettes in this section present more advanced functionalities. In particular, the ones on cross-validation, variable selection, and hyper-parameters are key.

Data for a model

Models require two sources of information to be trained. First, the features (a matrix of predictors); second, the labels (a vector of boolean values representing the presence of the species). Thankfully, the package is built so that we can get at this information through different ways, and these are notably:

a vector of spatial layers for the features, and two layers with boolean values for the presence/absence
a vector of spatial layers for the features, and a collection of occurrences for the presence/absence

In addition, models can have geospatial information about the occurrences.

A lot of vignettes will use a demonstration dataset which contains all three:

julia

X, y, c = SDeMo.__demodata();

Components of a model

All models have three steps: the data transformation, the classifier, and the threshold. These are trained in sequence. The data transformation step is done in a way that avoids data leakage, which means that it is only trained on values that are accessible to the model at the time where it is trained.

The classifier is the algorithm used to transform the input values (possibly transformed) into a quantitative score. This score is then compared to the threshold, in order to return a yes/no answer.

The syntax to build a model is

julia

model = SDM(ZScore, NaiveBayes, X, y, c)

❎  ZScore → NaiveBayes → P(x) ≥ 0.5 🗺️

About coordinates

When constructing a model from layers and occurrences, the coordinates of the instances will be added automatically. Having coordinates is useful to plot models.

We can check whether this model is georeferenced:

julia

isgeoreferenced(model)

true

And whether it is trained:

julia

istrained(model)

false

Training a model

A model will by default train using all its available information.

julia

train!(model)

☑️  ZScore → NaiveBayes → P(x) ≥ 0.034 🗺️

We can verify that the training actually happened:

julia

istrained(model)

true

It is possible to specify which instances are used for training. Generally speaking, this is useful for cross-validation, and the function for cross-validation will take care of this automatically.

Thresholded v. unthresholded predictions

A model can be used to make a prediction with no argument, in which case it will predict on its entire training set. For example, we can check how many predictions of "presence" the model makes:

julia

sum(predict(model))

The predict function can also take a vector (prediction for a single instance), or a matrix. Finally, it can also take a vector of spatial information.

Models in space

The use of models with spatial information is covered in more depth in the vignette on training spatial models.

The predict function takes a threshold keyword argument, which defaults to true. When it is false, the function will return the score that is coming directly from the classifier. For example, we can get the score associated to the second training instance:

julia

predict(model, X[:,2]; threshold=false)

0.1572394311781998

Scores and probabilities

There is additional documentation covering the calibration functions, as well as an illustrative vignette on spatial prediction of probabilities.

The threshold of the model is optimized during training, and can be access with

julia

threshold(model)

0.03355704697986577

Modifying a model

The components of a model can be changed in real time.

But should they?

It may be better to declare a new model. One case in which rapidly changing the classifier or transformer is useful is when trying different combinations in order to get the best fit.

For example, we can move away from naive Bayes and use logistic regression:

julia

classifier!(model, Logistic)

❎  ZScore → Logistic → P(x) ≥ 0.034 🗺️

Note that this marked the model as untrained:

julia

istrained(model)

false

So we can re-train it:

julia

train!(model)

☑️  ZScore → Logistic → P(x) ≥ 0.582 🗺️

Bagging

Models can be aggregated into homogeneous ensembles, by using them as arguments to the Bagging function.

julia

tree = SDM(PCATransform, DecisionTree, X, y, c)
forest = Bagging(tree, 50)

{PCATransform → DecisionTree → P(x) ≥ 0.5} × 50

We can also set each component model to use a different set of features:

julia

bagfeatures!(forest)

{PCATransform → DecisionTree → P(x) ≥ 0.5} × 50

And we can now train this model:

julia

train!(forest)

{PCATransform → DecisionTree → P(x) ≥ 0.172} × 50

The prediction for these models takes another argument, consensus, which is a function that, when called on a vector (the output of each model), will return a single value. For example, we can measure how many models agreed that the first training instance is a presence:

julia

predict(forest, X[:,1]; consensus=sum)

A particularly useful consensus function is majority, which will return the outcome of a majority consensus:

julia

predict(forest, X[:,1]; consensus=majority)

false

When predicting without the threshold, this can be used to measure the variability of the different models:

julia

predict(forest, X[:,1]; threshold=false, consensus=iqr)

0.08736720305043286

Boosting

We can use the AdaBoost approach to boost a single model by re-training it sequentially on the least well predicted instances. For the sake of argument, let's do a boosted BIOCLIM. We only live once, and it's a very confusing experience, so we might as well embrace the weird.

julia

why = AdaBoost(SDM(RawData, BIOCLIM, X, y, c), 50)
train!(why)

AdaBoost {RawData → BIOCLIM → P(x) ≥ 0.029} × 50 iterations

Heterogeneous ensembles

Several models can be combined into an ensemble, which is done by using an array of models:

julia

ensemble = Ensemble([
    SDM(RawData, Maxent, X, y, c),
    SDM(PCATransform, Logistic, X, y, c),
    SDM(RawData, NaiveBayes, X, y, c),
    Bagging(
        SDM(RawData, DecisionTree, X, y, c),
        50
    )
])
train!(ensemble)

An ensemble model with:
	 ☑️  RawData → Maxent → P(x) ≥ 0.418 🗺️ 
	 ☑️  PCATransform → Logistic → P(x) ≥ 0.482 🗺️ 
	 ☑️  RawData → NaiveBayes → P(x) ≥ 0.034 🗺️ 
	 {RawData → DecisionTree → P(x) ≥ 0.242} × 50

This model can be predicted just like an homogeneus ensemble, including using threshold and consensus keywords:

julia

predict(ensemble, X[:,3]; threshold=false, consensus=Statistics.median)

0.5720326878872128

SDeMo.SDM Type

julia

SDM

This type specifies a full model, which is composed of a transformer (which applies a transformation on the data), a classifier (which returns a quantitative score), a threshold (above which the score corresponds to the prediction of a presence).

In addition, the SDM carries with it the training features and labels, as well as a vector of indices indicating which variables are actually used by the model.

The coordinates for each observation that is used to train the model are given in the coordinates field, and as with OccurrencesInterface, they must be given as longitude,latitude. If there are no known coordinates for the observations, this field must be an empty vector of the correct type. As of now, there is no plan to support datasets that only have some coordinates known.

source

SDeMo.Bagging Type

julia

Bagging

A bagged (bootstrap aggregated) model is the same template model repeated an arbitrary number of times, with each replicated model having access to a bootstrapped sample of the data. These models are represented by three fields:

model is the base model
bags is a vector of tuples with the in and out-of sample instances
models is an array of replicated models

source

The model training pipeline ​

Data for a model ​

Components of a model ​

Training a model ​

Thresholded v. unthresholded predictions ​

Modifying a model ​

Bagging ​

Boosting ​

Heterogeneous ensembles ​

Related documentation ​

The model training pipeline

Data for a model

Components of a model

Training a model

Thresholded v. unthresholded predictions

Modifying a model

Bagging

Boosting

Heterogeneous ensembles

Related documentation