The model training pipeline
The purpose of this vignette is to present the ways to train and predict with a model. This is not a complete overview of the features of the package, but rather a compendium of the ways to declare, train, and predict with a model.
using SpeciesDistributionToolkit
import Statistics
using CairoMakieThe rest of the vignettes in this section present more advanced functionalities. In particular, the ones on cross-validation, variable selection, and hyper-parameters are key.
Data for a model
Models require two sources of information to be trained. First, the features (a matrix of predictors); second, the labels (a vector of boolean values representing the presence of the species). Thankfully, the package is built so that we can get at this information through different ways, and these are notably:
a vector of spatial layers for the features, and two layers with boolean values for the presence/absence
a vector of spatial layers for the features, and a collection of occurrences for the presence/absence
In addition, models can have geospatial information about the occurrences.
A lot of vignettes will use a demonstration dataset which contains all three:
X, y, c = SDeMo.__demodata();Components of a model
All models have three steps: the data transformation, the classifier, and the threshold. These are trained in sequence. The data transformation step is done in a way that avoids data leakage, which means that it is only trained on values that are accessible to the model at the time where it is trained.
The classifier is the algorithm used to transform the input values (possibly transformed) into a quantitative score. This score is then compared to the threshold, in order to return a yes/no answer.
The syntax to build a model is
model = SDM(ZScore, NaiveBayes, X, y, c)❎ ZScore → NaiveBayes → P(x) ≥ 0.5 🗺️About coordinates
When constructing a model from layers and occurrences, the coordinates of the instances will be added automatically. Having coordinates is useful to plot models.
We can check whether this model is georeferenced:
isgeoreferenced(model)trueAnd whether it is trained:
istrained(model)falseTraining a model
A model will by default train using all its available information.
train!(model)☑️ ZScore → NaiveBayes → P(x) ≥ 0.034 🗺️We can verify that the training actually happened:
istrained(model)trueIt is possible to specify which instances are used for training. Generally speaking, this is useful for cross-validation, and the function for cross-validation will take care of this automatically.
Thresholded v. unthresholded predictions
A model can be used to make a prediction with no argument, in which case it will predict on its entire training set. For example, we can check how many predictions of "presence" the model makes:
sum(predict(model))857The predict function can also take a vector (prediction for a single instance), or a matrix. Finally, it can also take a vector of spatial information.
Models in space
The use of models with spatial information is covered in more depth in the vignette on training spatial models.
The predict function takes a threshold keyword argument, which defaults to true. When it is false, the function will return the score that is coming directly from the classifier. For example, we can get the score associated to the second training instance:
predict(model, X[:,2]; threshold=false)0.1572394311781998Scores and probabilities
There is additional documentation covering the calibration functions, as well as an illustrative vignette on spatial prediction of probabilities.
The threshold of the model is optimized during training, and can be access with
threshold(model)0.03355704697986577Modifying a model
The components of a model can be changed in real time.
But should they?
It may be better to declare a new model. One case in which rapidly changing the classifier or transformer is useful is when trying different combinations in order to get the best fit.
For example, we can move away from naive Bayes and use logistic regression:
classifier!(model, Logistic)❎ ZScore → Logistic → P(x) ≥ 0.034 🗺️Note that this marked the model as untrained:
istrained(model)falseSo we can re-train it:
train!(model)☑️ ZScore → Logistic → P(x) ≥ 0.582 🗺️Bagging
Models can be aggregated into homogeneous ensembles, by using them as arguments to the Bagging function.
tree = SDM(PCATransform, DecisionTree, X, y, c)
forest = Bagging(tree, 50){PCATransform → DecisionTree → P(x) ≥ 0.5} × 50We can also set each component model to use a different set of features:
bagfeatures!(forest){PCATransform → DecisionTree → P(x) ≥ 0.5} × 50And we can now train this model:
train!(forest){PCATransform → DecisionTree → P(x) ≥ 0.172} × 50The prediction for these models takes another argument, consensus, which is a function that, when called on a vector (the output of each model), will return a single value. For example, we can measure how many models agreed that the first training instance is a presence:
predict(forest, X[:,1]; consensus=sum)5A particularly useful consensus function is majority, which will return the outcome of a majority consensus:
predict(forest, X[:,1]; consensus=majority)falseWhen predicting without the threshold, this can be used to measure the variability of the different models:
predict(forest, X[:,1]; threshold=false, consensus=iqr)0.08736720305043286Boosting
We can use the AdaBoost approach to boost a single model by re-training it sequentially on the least well predicted instances. For the sake of argument, let's do a boosted BIOCLIM. We only live once, and it's a very confusing experience, so we might as well embrace the weird.
why = AdaBoost(SDM(RawData, BIOCLIM, X, y, c), 50)
train!(why)AdaBoost {RawData → BIOCLIM → P(x) ≥ 0.029} × 50 iterationsMore on boosting
There is a more complete vignette on AdaBoost, which also touches upon the issue of model calibration.
This model can be used just like any other model:
predict(why, X[:,4])falseHeterogeneous ensembles
Several models can be combined into an ensemble, which is done by using an array of models:
ensemble = Ensemble([
SDM(RawData, Maxent, X, y, c),
SDM(PCATransform, Logistic, X, y, c),
SDM(RawData, NaiveBayes, X, y, c),
Bagging(
SDM(RawData, DecisionTree, X, y, c),
50
)
])
train!(ensemble)An ensemble model with:
☑️ RawData → Maxent → P(x) ≥ 0.418 🗺️
☑️ PCATransform → Logistic → P(x) ≥ 0.482 🗺️
☑️ RawData → NaiveBayes → P(x) ≥ 0.034 🗺️
{RawData → DecisionTree → P(x) ≥ 0.242} × 50This model can be predicted just like an homogeneus ensemble, including using threshold and consensus keywords:
predict(ensemble, X[:,3]; threshold=false, consensus=Statistics.median)0.5720326878872128Related documentation
SDeMo.SDM Type
SDMThis type specifies a full model, which is composed of a transformer (which applies a transformation on the data), a classifier (which returns a quantitative score), a threshold (above which the score corresponds to the prediction of a presence).
In addition, the SDM carries with it the training features and labels, as well as a vector of indices indicating which variables are actually used by the model.
The coordinates for each observation that is used to train the model are given in the coordinates field, and as with OccurrencesInterface, they must be given as longitude,latitude. If there are no known coordinates for the observations, this field must be an empty vector of the correct type. As of now, there is no plan to support datasets that only have some coordinates known.
SDeMo.Bagging Type
BaggingA bagged (bootstrap aggregated) model is the same template model repeated an arbitrary number of times, with each replicated model having access to a bootstrapped sample of the data. These models are represented by three fields:
modelis the base modelbagsis a vector of tuples with the in and out-of sample instancesmodelsis an array of replicated models