SDeMo

SDeMo.__classsplit — Method

__classsplit(y)

Returns a tuple with the presences indices, and the absences indices - this is used to maintain class balance in cross-validation and bagging

source

SDeMo._explain_many_instances — Method

_explain_many_instances(f, Z, X, j, n)

Applies explainone_instance on the matrix Z

source

SDeMo._explain_one_instance — Method

_explain_one_instance(f, instance, X, j, n)

This method returns the explanation for the instance at variable j, based on training data X. This is the most granular version of the Shapley values algorithm.

source

SDeMo._mcsample — Method

_mcsample(x::Vector{T}, X::Matrix{T}, j::Int64, n::Int64) where {T <:Number}

This generates a Monte-Carlo sample for Shapley values. The arguments are, in order

x: a single instance (as a vector) to explain

X: a matrix of training data providing the samples for explanation

j: the index of the variable to explain

n: the number of samples to generate for evaluation

source

SDeMo._validate_one_model! — Method

_validate_one_model!(model::AbstractSDM, fold, τ, kwargs...)

Trains the model and returns the Cv and Ct conf matr. Used internally by cross-validation.

source

SDeMo.accuracy — Function

accuracy(C::Vector{<:ConfusionMatrix}, full::Bool=false)

Version of accuracy using a vector of confusion matrices. Returns the mean, and when the second argument is true, returns a tuple where the second argument is the CI.

source

SDeMo.accuracy — Method

accuracy(M::ConfusionMatrix)

Accuracy

$\frac{TP + TN}{TP + TN + FP + FN}$

source

SDeMo.backwardselection! — Method

backwardselection!(model, folds, pool; verbose::Bool = false, optimality=mcc, kwargs...)

Removes variables one at a time until the optimality measure stops increasing. Variables included in pool are not removed.

All keyword arguments are passed to crossvalidate and train!.

source

SDeMo.backwardselection! — Method

backwardselection!(model, folds; verbose::Bool = false, optimality=mcc, kwargs...)

Removes variables one at a time until the optimality measure stops increasing.

All keyword arguments are passed to crossvalidate and train!.

source

SDeMo.balancedaccuracy — Function

balancedaccuracy(C::Vector{<:ConfusionMatrix}, full::Bool=false)

Version of balancedaccuracy using a vector of confusion matrices. Returns the mean, and when the second argument is true, returns a tuple where the second argument is the CI.

source

SDeMo.balancedaccuracy — Method

balanced(M::ConfusionMatrix)

Balanced accuracy

$\frac{1}{2} (TPR + TNR)$

source

SDeMo.bootstrap — Method

bootstrap(y, X; n = 50)

Generate a series of n bootstrap samples for molde bagging. The present and absent classes are boostrapped separately so that in and out of bag respect (on average) class balance.

source

SDeMo.bootstrap — Method

bootstrap(sdm::SDM; kwargs...)

source

SDeMo.calibrate — Method

calibration(sdm::T; kwargs...) where {T <: AbstractSDM}

Returns a function for model calibration, using Platt scaling, optimized with the Newton method. The returned function can be applied to a model output.

source

SDeMo.ci — Method

ci(C::Vector{<:ConfusionMatrix}, f)

Applies f to all confusion matrices in the vector, and returns the 95% CI.

source

SDeMo.ci — Method

ci(C::Vector{<:ConfusionMatrix})

Applies the MCC (mcc) to all confusion matrices in the vector, and returns the 95% CI.

source

SDeMo.classifier — Method

classifier(model::Bagging)

Returns the classifier used by the model that is used as a template for the bagged model

source

SDeMo.classifier — Method

classifier(model::SDM)

Returns the classifier used by the model

source

SDeMo.coinflip — Method

coinflip(ensemble::Bagging)

Version of coinflip using the training labels for an homogeneous ensemble.

source

SDeMo.coinflip — Method

coinflip(sdm::SDM)

Version of coinflip using the training labels for an SDM.

source

SDeMo.coinflip — Method

coinflip(labels::Vector{Bool})

Returns the confusion matrix for the no-skill classifier given a vector of labels. Predictions are made at random, with each class being selected with a probability of one half.

source

SDeMo.constantnegative — Method

constantnegative(ensemble::Bagging)

Version of constantnegative using the training labels for an homogeneous ensemble.

source

SDeMo.constantnegative — Method

constantnegative(sdm::SDM)

Version of constantnegative using the training labels for an SDM.

source

SDeMo.constantnegative — Method

constantnegative(labels::Vector{Bool})

Returns the confusion matrix for the constant positive classifier given a vector of labels. Predictions are assumed to always be negative.

source

SDeMo.constantpositive — Method

constantpositive(ensemble::Bagging)

Version of constantpositive using the training labels for an homogeneous ensemble.

source

SDeMo.constantpositive — Method

constantpositive(sdm::SDM)

Version of constantpositive using the training labels for an SDM.

source

SDeMo.constantpositive — Method

constantpositive(labels::Vector{Bool})

Returns the confusion matrix for the constant positive classifier given a vector of labels. Predictions are assumed to always be positive.

source

SDeMo.counterfactual — Method

counterfactual(model::AbstractSDM, x::Vector{T}, yhat, λ; maxiter=100, minvar=5e-5, kwargs...) where {T <: Number}

Generates one counterfactual explanation given an input vector x, and a target rule to reach yhat. The learning rate is λ. The maximum number of iterations used in the Nelder-Mead algorithm is maxiter, and the variance improvement under which the model will stop is minvar. Other keywords are passed to predict.

source

SDeMo.crossvalidate — Method

crossvalidate(sdm, folds; thr = nothing, kwargs...)

Performs cross-validation on a model, given a vector of tuples representing the data splits. The threshold can be fixed through the thr keyword arguments. All other keywords are passed to the train! method.

This method returns two vectors of ConfusionMatrix, with the confusion matrix for each set of validation data first, and the confusion matrix for the training data second.

source

SDeMo.crossvalidate — Method

crossvalidate(sdm::T, args...; kwargs...) where {T <: AbstractSDM}

Performs cross-validation using 10-fold validation as a default. Called when crossvalidate is used without a folds second argument.

source

SDeMo.dor — Function

dor(C::Vector{<:ConfusionMatrix}, full::Bool=false)

Version of dor using a vector of confusion matrices. Returns the mean, and when the second argument is true, returns a tuple where the second argument is the CI.

source

SDeMo.dor — Method

dor(M::ConfusionMatrix)

Diagnostic odd ratio, defined as plr/nlr. A useful test has a value larger than unity, and this value has no upper bound.

source

SDeMo.explain — Method

explain(model::AbstractSDM, j; observation = nothing, instances = nothing, samples = 100, kwargs..., )

Uses the MCMC approximation of Shapley values to provide explanations to specific predictions. The second argument j is the variable for which the explanation should be provided.

The observation keywords is a row in the instances dataset for which explanations must be provided. If instances is nothing, the explanations will be given on the training data.

All other keyword arguments are passed to predict.

source

SDeMo.f1 — Function

f1(C::Vector{<:ConfusionMatrix}, full::Bool=false)

Version of f1 using a vector of confusion matrices. Returns the mean, and when the second argument is true, returns a tuple where the second argument is the CI.

source

SDeMo.f1 — Method

f1(M::ConfusionMatrix)

F₁ score, defined as the harmonic mean between precision and recall:

$2\times\frac{PPV\times TPR}{PPV + TPR}$

This uses the more general fscore internally.

source

SDeMo.fdir — Function

fdir(C::Vector{<:ConfusionMatrix}, full::Bool=false)

Version of fdir using a vector of confusion matrices. Returns the mean, and when the second argument is true, returns a tuple where the second argument is the CI.

source

SDeMo.fdir — Method

fdir(M::ConfusionMatrix)

False discovery rate, 1 - ppv

source

SDeMo.features — Method

features(sdm::SDM, n)

Returns the n-th feature stored in the field X of the SDM.

source

SDeMo.features — Method

features(sdm::SDM)

Returns the features stored in the field X of the SDM. Note that the features are an array, and this does not return a copy of it – any change made to the output of this function will change the content of the SDM features.

source

SDeMo.fnr — Function

fnr(C::Vector{<:ConfusionMatrix}, full::Bool=false)

Version of fnr using a vector of confusion matrices. Returns the mean, and when the second argument is true, returns a tuple where the second argument is the CI.

source

SDeMo.fnr — Method

fnr(M::ConfusionMatrix)

False-negative rate

$\frac{FN}{FN+TP}$

source

SDeMo.fomr — Function

fomr(C::Vector{<:ConfusionMatrix}, full::Bool=false)

Version of fomr using a vector of confusion matrices. Returns the mean, and when the second argument is true, returns a tuple where the second argument is the CI.

source

SDeMo.fomr — Method

fomr(M::ConfusionMatrix)

False omission rate, 1 - npv

source

SDeMo.forwardselection! — Method

forwardselection!(model, folds, pool; verbose::Bool = false, optimality=mcc, kwargs...)

Adds variables one at a time until the optimality measure stops increasing. The variables in pool are added at the start.

All keyword arguments are passed to crossvalidate and train!.

source

SDeMo.forwardselection! — Method

forwardselection!(model, folds; verbose::Bool = false, optimality=mcc, kwargs...)

Adds variables one at a time until the optimality measure stops increasing.

All keyword arguments are passed to crossvalidate and train!.

source

SDeMo.fpr — Function

fpr(C::Vector{<:ConfusionMatrix}, full::Bool=false)

Version of fpr using a vector of confusion matrices. Returns the mean, and when the second argument is true, returns a tuple where the second argument is the CI.

source

SDeMo.fpr — Method

fpr(M::ConfusionMatrix)

False-positive rate

$\frac{FP}{FP+TN}$

source

SDeMo.fscore — Function

fscore(C::Vector{<:ConfusionMatrix}, full::Bool=false)

Version of fscore using a vector of confusion matrices. Returns the mean, and when the second argument is true, returns a tuple where the second argument is the CI.

source

SDeMo.fscore — Function

fscore(M::ConfusionMatrix, β=1.0)

Fᵦ score, defined as the harmonic mean between precision and recall, using a positive factor β indicating the relative importance of recall over precision:

$(1 + \beta^2)\times\frac{PPV\times TPR}{(\beta^2 \times PPV) + TPR}$

source

SDeMo.fscore — Method

fscore(β::Real)

Creates a function for the Fᵦ score, which takes a confusion matrix as an input.

source

SDeMo.gmean — Method

gmean(M::ConfusionMatrix)

Geometric mean of sensitivity and specificity.

source

SDeMo.holdout — Method

holdout(y, X; proportion = 0.2, permute = true)

Sets aside a proportion (given by the proportion keyword, defaults to 0.2) of observations to use for validation, and the rest for training. An additional argument permute (defaults to true) can be used to shuffle the order of observations before they are split.

This method returns a single tuple with the training data first and the validation data second. To use this with crossvalidate, it must be put in [].

source

SDeMo.holdout — Method

holdout(sdm::Bagging)

Version of holdout using the instances and labels of a bagged SDM. In this case, the instances of the model used as a reference to build the bagged model are used.

source

SDeMo.holdout — Method

holdout(sdm::SDM)

Version of holdout using the instances and labels of an SDM.

source

SDeMo.hyperparameters! — Method

hyperparameters!(tr::HasHyperParams, hp::Symbol, val)

Sets the hyper-parameters for a transformer or a classifier

source

SDeMo.hyperparameters — Method

hyperparameters(::Type{<:HasHyperParams}) = nothing

Returns the hyper-parameters for a type of classifier or transformer

source

SDeMo.hyperparameters — Method

hyperparameters(::HasHyperParams)

Returns the hyper-parameters for a classifier or a transformer

source

SDeMo.hyperparameters — Method

hyperparameters(::HasHyperParams, ::Symbol)

Returns the value for an hyper-parameter

source

SDeMo.instance — Method

instance(sdm::SDM, n; strict=true)

Returns the n-th instance stored in the field X of the SDM. If the keyword argument strict is true, only the variables used for prediction are returned.

source

SDeMo.iqr — Function

iqr(x, m=0.25, M=0.75)

Returns the inter-quantile range, by default between 25% and 75% of observations.

source

SDeMo.kfold — Method

kfold(y, X; k = 10, permute = true)

Returns splits of the data in which 1 group is used for validation, and k-1 groups are used for training. All kgroups have the (approximate) same size, and each instance is only used once for validation (andk`-1 times for training). The groups are stratified (so that they have the same prevalence).

This method returns a vector of tuples, with each entry have the training data first, and the validation data second.

source

SDeMo.kfold — Method

kfold(sdm::Bagging)

Version of kfold using the instances and labels of a bagged SDM. In this case, the instances of the model used as a reference to build the bagged model are used.

source

SDeMo.kfold — Method

kfold(sdm::SDM)

Version of kfold using the instances and labels of an SDM.

source

SDeMo.labels — Method

labels(sdm::SDM)

Returns the labels stored in the field y of the SDM – note that this is not a copy of the labels, but the object itself.

source

SDeMo.leaveoneout — Method

leaveoneout(y, X)

Returns the splits for leave-one-out cross-validation. Each sample is used once, on its own, for validation.

This method returns a vector of tuples, with each entry have the training data first, and the validation data second.

source

SDeMo.leaveoneout — Method

leaveoneout(sdm::Bagging)

Version of leaveoneout using the instances and labels of a bagged SDM. In this case, the instances of the model used as a reference to build the bagged model are used.

source

SDeMo.leaveoneout — Method

leaveoneout(sdm::SDM)

Version of leaveoneout using the instances and labels of an SDM.

source

SDeMo.loadsdm — Method

loadsdm(file::String; kwargs...)

Loads a model to a JSON file. The keyword arguments are passed to train!. The model is trained in full upon loading.

source

SDeMo.markedness — Function

markedness(C::Vector{<:ConfusionMatrix}, full::Bool=false)

Version of markedness using a vector of confusion matrices. Returns the mean, and when the second argument is true, returns a tuple where the second argument is the CI.

source

SDeMo.markedness — Method

markedness(M::ConfusionMatrix)

Markedness, a measure similar to informedness (TSS) that emphasizes negative predictions

$PPV + NPV -1$

source

SDeMo.mcc — Function

mcc(C::Vector{<:ConfusionMatrix}, full::Bool=false)

Version of mcc using a vector of confusion matrices. Returns the mean, and when the second argument is true, returns a tuple where the second argument is the CI.

source

SDeMo.mcc — Method

mcc(M::ConfusionMatrix)

Matthew's correlation coefficient. This is the default measure of model performance, and there are rarely good reasons to use anything else to decide which model to use.

source

SDeMo.montecarlo — Method

montecarlo(y, X; n = 100, kwargs...)

Returns n (def. 100) samples of holdout. Other keyword arguments are passed to holdout.

This method returns a vector of tuples, with each entry have the training data first, and the validation data second.

source

SDeMo.montecarlo — Method

montecarlo(sdm::Bagging)

Version of montecarlo using the instances and labels of a bagged SDM. In this case, the instances of the model used as a reference to build the bagged model are used.

source

SDeMo.montecarlo — Method

montecarlo(sdm::SDM)

Version of montecarlo using the instances and labels of an SDM.

source

SDeMo.nlr — Function

nlr(C::Vector{<:ConfusionMatrix}, full::Bool=false)

Version of nlr using a vector of confusion matrices. Returns the mean, and when the second argument is true, returns a tuple where the second argument is the CI.

source

SDeMo.nlr — Method

nlr(M::ConfusionMatrix)

Negative likelihood ratio

$\frac{FNR}{TNR}$

source

SDeMo.noselection! — Method

noselection!(model, folds; verbose::Bool = false, kwargs...)

Returns the model to the state where all variables are used.

All keyword arguments are passed to train!.

source

SDeMo.noselection! — Method

noselection!(model; verbose::Bool = false, kwargs...)

Returns the model to the state where all variables are used.

All keyword arguments are passed to train!. For convenience, this version does not require a folds argument, as it would be unused anyway.

source

SDeMo.noskill — Method

noskill(ensemble::Bagging)

Version of noskill using the training labels for an homogeneous ensemble.

source

SDeMo.noskill — Method

noskill(sdm::SDM)

Version of noskill using the training labels for an SDM.

source

SDeMo.noskill — Method

noskill(labels::Vector{Bool})

Returns the confusion matrix for the no-skill classifier given a vector of labels. Predictions are made at random, with each class being selected by its proportion in the training data.

source

SDeMo.npv — Function

npv(C::Vector{<:ConfusionMatrix}, full::Bool=false)

Version of npv using a vector of confusion matrices. Returns the mean, and when the second argument is true, returns a tuple where the second argument is the CI.

source

SDeMo.npv — Method

npv(M::ConfusionMatrix)

Negative predictive value

$\frac{TN}{TN+FN}$

source

SDeMo.outofbag — Method

outofbag(ensemble::Bagging; kwargs...)

This method returns the confusion matrix associated to the out of bag error, wherein the succes in predicting instance i is calculated on the basis of all models that have not been trained on i. The consensus of the different models is a simple majority rule.

The additional keywords arguments are passed to predict.

source

SDeMo.partialresponse — Method

partialresponse(model::T, i::Integer, j::Integer, s::Tuple=(50, 50); inflated::Bool, kwargs...)

This method returns the partial response of applying the trained model to a simulated dataset where all variables except i and j are set to their mean value.

This function will return a grid corresponding to evenly spaced values of i and j, the size of which is given by the last argument s (defaults to 50 × 50).

All keyword arguments are passed to predict.

source

SDeMo.partialresponse — Method

partialresponse(model::T, i::Integer, args...; inflated::Bool, kwargs...)

This method returns the partial response of applying the trained model to a simulated dataset where all variables except i are set to their mean value. The inflated keywork, when set to true, will instead pick a random value within the range of the observations.

The different arguments that can follow the variable position are

nothing, where the unique values for the i-th variable are used (sorted)
a number, in which point that many evenly spaced points within the range of the variable are used
an array, in which case each value of this array is evaluated

All keyword arguments are passed to predict.

source

SDeMo.plr — Function

plr(C::Vector{<:ConfusionMatrix}, full::Bool=false)

Version of plr using a vector of confusion matrices. Returns the mean, and when the second argument is true, returns a tuple where the second argument is the CI.

source

SDeMo.plr — Method

plr(M::ConfusionMatrix)

Positive likelihood ratio

$\frac{TPR}{FPR}$

source

SDeMo.ppv — Function

ppv(C::Vector{<:ConfusionMatrix}, full::Bool=false)

Version of ppv using a vector of confusion matrices. Returns the mean, and when the second argument is true, returns a tuple where the second argument is the CI.

source

SDeMo.ppv — Method

ppv(M::ConfusionMatrix)

Positive predictive value

$\frac{TP}{TP+FP}$

source

SDeMo.precision — Function

precision(C::Vector{<:ConfusionMatrix}, full::Bool=false)

Version of precision using a vector of confusion matrices. Returns the mean, and when the second argument is true, returns a tuple where the second argument is the CI.

source

SDeMo.precision — Method

precision(M::ConfusionMatrix)

Alias for ppv, the positive predictive value

source

SDeMo.prune! — Method

prune!(tree, X, y)

This function will take each twig in a tree, and merge the one with the worst contribution to information gain.

source

SDeMo.recall — Function

recall(C::Vector{<:ConfusionMatrix}, full::Bool=false)

Version of recall using a vector of confusion matrices. Returns the mean, and when the second argument is true, returns a tuple where the second argument is the CI.

source

SDeMo.recall — Method

recall(M::ConfusionMatrix)

Alias for tpr, the true positive rate

source

SDeMo.reliability — Method

reliability(sdm::AbstractSDM, link::Function=identity; bins=9, kwargs...)

Returns a binned reliability curve for a trained model, where the raw scores are transformed with a specified link function (which defaults to identity). Keyword arguments other than bins are passed to predict.

source

SDeMo.reliability — Method

reliability(yhat, y; bins=9)

Returns a binned reliability curve for a series of predicted quantitative scores and a series of truth values.

source

SDeMo.reset! — Function

reset!(sdm::SDM, thr=0.5)

Resets a model, with a potentially specified value of the threshold. This amounts to re-using all the variables, and removing the tuned threshold version.

source

SDeMo.sensitivity — Function

sensitivity(C::Vector{<:ConfusionMatrix}, full::Bool=false)

Version of sensitivity using a vector of confusion matrices. Returns the mean, and when the second argument is true, returns a tuple where the second argument is the CI.

source

SDeMo.sensitivity — Method

sensitivity(M::ConfusionMatrix)

Alias for tpr, the true positive rate

source

SDeMo.specificity — Function

specificity(C::Vector{<:ConfusionMatrix}, full::Bool=false)

Version of specificity using a vector of confusion matrices. Returns the mean, and when the second argument is true, returns a tuple where the second argument is the CI.

source

SDeMo.specificity — Method

specificity(M::ConfusionMatrix)

Alias for tnr, the true negative rate

source

SDeMo.stepwisevif! — Function

stepwisevif!(model::SDM, limit, tr=:;kwargs...)

Drops the variables with the largest variance inflation from the model, until all VIFs are under the threshold. The last positional argument (defaults to :) is the indices to use for the VIF calculation. All keyword arguments are passed to train!.

source

SDeMo.threshold! — Method

threshold!(sdm::SDM, τ)

Sets the value of the threshold.

source

SDeMo.threshold! — Method

threshold!(sdm::SDM, folds::Vector{Tuple{Vector{Int}, Vector{Int}}}; optimality=mcc)

Optimizes the threshold for a SDM using cross-validation, as given by the folds. This is meant to be used after cross-validation, as it will cross-validate the threshold across all the training data in a way that is a little more robust than the version in train!.

The specific technique used is to train one model per fold, then aggregate all of their predictions on the validation data, and find the value of the threshold that maximizes the average performance across folds.

source

SDeMo.threshold! — Method

threshold!(sdm::SDM; kwargs...)

Version of threshold! without folds, for which the default of 10-fold validation will be used.

source

SDeMo.threshold — Method

threshold(sdm::SDM)

This returns the value above which the score returned by the SDM is considered to be a presence.

source

SDeMo.tnr — Function

tnr(C::Vector{<:ConfusionMatrix}, full::Bool=false)

Version of tnr using a vector of confusion matrices. Returns the mean, and when the second argument is true, returns a tuple where the second argument is the CI.

source

SDeMo.tnr — Method

tnr(M::ConfusionMatrix)

True-negative rate

$\frac{TN}{TN+FP}$

source

SDeMo.tpr — Function

tpr(C::Vector{<:ConfusionMatrix}, full::Bool=false)

Version of tpr using a vector of confusion matrices. Returns the mean, and when the second argument is true, returns a tuple where the second argument is the CI.

source

SDeMo.tpr — Method

tpr(M::ConfusionMatrix)

True-positive rate

$\frac{TP}{TP+FN}$

source

SDeMo.train! — Method

train!(b::AdaBoost; kwargs...)

Trains all the model in an ensemble model - the keyword arguments are passed to train! for each model. Note that this also retrains the original model. If the original model contains transformers, they are re-trained for each learner that is added to the ensemble. This is crucial as learners are re-trained on proportionally weighted samples of the training data, and not re-training the transformers would create data leakage.

source

SDeMo.train! — Method

train!(ensemble::Bagging; kwargs...)

Trains all the models in an ensemble model - the keyword arguments are passed to train! for each model. Note that this retrains the entire model, which includes the transformers.

source

SDeMo.train! — Method

train!(ensemble::Ensemble; kwargs...)

Trains all the model in an heterogeneous ensemble model - the keyword arguments are passed to train! for each model. Note that this retrains the entire model, which includes the transformers.

The keywod arguments are passed to train! and can include the training indices.

source

SDeMo.train! — Method

train!(sdm::SDM; threshold=true, training=:, optimality=mcc)

This is the main training function to train a SDM.

The three keyword arguments are:

training: defaults to :, and is the range (or alternatively the indices) of the data that are used to train the model
threshold: defaults to true, and performs moving threshold by evaluating 200 possible values between the minimum and maximum output of the model, and returning the one that is optimal
optimality: defaults to mcc, and is the function applied to the confusion matrix to evaluate which value of the threshold is the best
absences: defaults to false, and indicates whether the (pseudo) absences are used to train the transformer; when using actual absences, this should be set to true

Internally, this function trains the transformer, then projects the data, then trains the classifier. If threshold is true, the threshold is then optimized.

source

SDeMo.transformer — Method

transformer(model::Bagging)

Returns the transformer used by the model that is used as a template for the bagged model

source

SDeMo.transformer — Method

transformer(model::SDM)

Returns the transformer used by the model

source

SDeMo.trueskill — Function

trueskill(C::Vector{<:ConfusionMatrix}, full::Bool=false)

Version of trueskill using a vector of confusion matrices. Returns the mean, and when the second argument is true, returns a tuple where the second argument is the CI.

source

SDeMo.trueskill — Method

trueskill(M::ConfusionMatrix)

True skill statistic (a.k.a Youden's J, or informedness)

$TPR + TNR - 1$

source

SDeMo.variableimportance — Method

variableimportance(model, folds; kwargs...)

Returns the importance of all variables in the model. The keywords are passed to variableimportance.

source

SDeMo.variableimportance — Method

variableimportance(model, folds, variable; reps=10, optimality=mcc, kwargs...)

Returns the importance of one variable in the model. The samples keyword fixes the number of bootstraps to run (defaults to 10, which is not enough!).

The keywords are passed to ConfusionMatrix.

source

SDeMo.variables! — Method

variables!(ensemble::Bagging, v::Vector{Int})

Sets the variable of the top-level model, and then sets the variables of each model in the ensemble.

source

SDeMo.variables! — Method

variables!(sdm::SDM, v)

Sets the list of variables.

source

SDeMo.variables! — Method

variables!(model::AbstractSDM, ::Type{T}, folds::Vector{Tuple{Vector{Int}, Vector{Int}}}; included=Int[], optimality=mcc, verbose::Bool=false, bagfeatures::Bool=false, kwargs...) where {T <: VariableSelectionStrategy}

Performs variable selection based on a selection strategy, with a possible folds for cross-validation. If omitted, this defaults to k-folds.

The model is retrained on the optimal set of variables after training.

Keywords:

included (Int[]), a list of variables that must be included in the model
optimality (mcc), the measure to optimise at each round of variable selection
verbose (false), whether the performance should be returned after each round of variable selection
bagfeatures (false), whether bagfeatures! should be called on each model in an homogeneous ensemble
all other keywords are passed to train! and crossvalidate

Important notes:

When using bagfeatures with a pool of included variables, they will always be present in the overall model, but not necessarilly in each model of the ensemble
When using VarianceInflationFactor, the variable selection will stop even if the VIF is above the threshold, if it means producing a model with a lower performance – using variables! will always lead to a better model

source

SDeMo.variables! — Method

variables!(model::M, ::Type{StrictVarianceInflationFactor{N}}, args...; included::Vector{Int}=Int[], optimality=mcc, verbose::Bool=false, bagfeatures::Bool=false, kwargs...) where {M <: Union{SDM, Bagging}, N}

Version of the variable selection for the strict VIF case. This may result in a worse model, and for this reason there is no cross-validation.

source

SDeMo.variables — Method

variables(sdm::SDM)

Returns the list of variables used by the SDM – these may be ordered by importance. This does not return a copy of the variables array, but the array itself.

source

SDeMo.vif — Method

vif(::Matrix)

Returns the variance inflation factor for each variable in a matrix, as the diagonal of the inverse of the correlation matrix between predictors.

source

SDeMo.vif — Method

vif(::AbstractSDM, tr=:)

Returns the VIF for the variables used in a SDM, optionally restricting to some training instances (defaults to : for all points). The VIF is calculated on the de-meaned predictors.

source

SDeMo.writesdm — Method

writesdm(file::String, model::SDM)

Writes a model to a JSON file. This method is very bare-bones, and only saves the structure of the model, as well as the data.

source

SDeMo.κ — Function

κ(C::Vector{<:ConfusionMatrix}, full::Bool=false)

Version of κ using a vector of confusion matrices. Returns the mean, and when the second argument is true, returns a tuple where the second argument is the CI.

source

SDeMo.κ — Method

κ(M::ConfusionMatrix)

Cohen's κ

source

StatsAPI.predict — Method

TODO

source

StatsAPI.predict — Method

StatsAPI.predict(ensemble::Bagging; kwargs...)

Predicts the ensemble model for all training data.

source

StatsAPI.predict — Method

StatsAPI.predict(ensemble::Ensemble; kwargs...)

Predicts the heterogeneous ensemble model for all training data.

source

StatsAPI.predict — Method

StatsAPI.predict(sdm::SDM; kwargs...)

This method performs the prediction on the entire set of training data available for the training of an SDM.

source

StatsAPI.predict — Method

TODO

source

StatsAPI.predict — Method

StatsAPI.predict(ensemble::Bagging, X; consensus = median, kwargs...)

Returns the prediction for the ensemble of models a dataset X. The function used to aggregate the outputs from different models is consensus (defaults to median). All other keyword arguments are passed to predict.

To get a direct estimate of the variability, the consensus function can be changed to iqr (inter-quantile range), or any measure of variance.

source

StatsAPI.predict — Method

StatsAPI.predict(ensemble::Ensemble, X; consensus = median, kwargs...)

Returns the prediction for the heterogeneous ensemble of models a dataset X. The function used to aggregate the outputs from different models is consensus (defaults to median). All other keyword arguments are passed to predict.

To get a direct estimate of the variability, the consensus function can be changed to iqr (inter-quantile range), or any measure of variance.

source

StatsAPI.predict — Method

StatsAPI.predict(sdm::SDM, X; threshold = true)

This is the main prediction function, and it takes as input an SDM and a matrix of features. The only keyword argument is threshold, which determines whether the prediction is returned raw or as a binary value (default is true).

source

SDeMo.AbstractBoostedSDM — Type

AbstractBoostedSDM

This type covers model that use boosting to iteratively improve on the least well predicted instances of a problem.

source

SDeMo.AbstractEnsembleSDM — Type

AbstractEnsembleSDM

This abstract types covers model that combine different SDMs to make a prediction, which currently covers Bagging and Ensemble.

source

SDeMo.AbstractSDM — Type

AbstractSDM

This abstract type covers the regular, ensemble, and boosted models.

source

SDeMo.AdaBoost — Type

AdaBoost <: AbstractBoostedSDM

A type for AdaBoost that contains the model, a vector of learners, a vector of learner weights, a number of boosting iterations, and the weights w of each point.

Note that this type uses training by re-sampling data according to their weights, as opposed to re-training on all samples and weighting internally.

source

SDeMo.AllVariables — Type

AllVariables

All variables in the training dataset are used. Note that this also crossvalidates and trains the model.

source

SDeMo.BIOCLIM — Type

BIOCLIM

BIOCLIM

source

SDeMo.BackwardSelection — Type

ForwardSelection

Variables are removed one at a time until the performance of the models stops improving.

source

SDeMo.Bagging — Type

Bagging

source

SDeMo.Bagging — Method

Bagging(model::SDM, n::Integer)

Creates a bag from SDM

source

SDeMo.Bagging — Method

Bagging(model::SDM, bags::Vector)

blah

source

SDeMo.ChainedTransform — Type

ChainedTransform{T1, T2}

A transformer that applies, in sequence, a pair of other transformers. This can be used to, for example, do a PCA then a z-score on the projected space. This is limited to two steps because the value of chaining more transformers is doubtful. We may add support for more complex transformations in future versions.

The first and second steps are accessible through first and last.

source

SDeMo.Classifier — Type

Classifier

This abstract type covers all algorithms to convert transformed data into prediction.

source

SDeMo.ConfusionMatrix — Type

ConfusionMatrix{T <: Number}

A structure to store the true positives, true negatives, false positives, and false negatives counts (or proportion) during model evaluation. Empty confusion matrices can be created using the zero method.

source

SDeMo.ConfusionMatrix — Method

ConfusionMatrix(ensemble::Bagging; kwargs...)

Performs the predictions for an SDM, and compare to the labels used for training. The keyword arguments are passed to the predict method.

source

SDeMo.ConfusionMatrix — Method

ConfusionMatrix(sdm::SDM; kwargs...)

Performs the predictions for an SDM, and compare to the labels used for training. The keyword arguments are passed to the predict method.

source

SDeMo.ConfusionMatrix — Method

ConfusionMatrix(pred::Vector{Bool}, truth::Vector{Bool})

Given a vector of binary predictions and a vector of ground truths, returns the confusion matrix.

source

SDeMo.ConfusionMatrix — Method

ConfusionMatrix(pred::Vector{T}, truth::Vector{Bool}, τ::T) where {T <: Number}

Given a vector of scores and a vector of ground truths, as well as a threshold, transforms the score into binary predictions and returns the confusion matrix.

source

SDeMo.ConfusionMatrix — Method

ConfusionMatrix(pred::Vector{T}, truth::Vector{Bool}) where {T <: Number}

Given a vector of scores and a vector of truth, returns the confusion matrix under the assumption that the score are probabilities and that the threshold is one half.

source

SDeMo.DecisionTree — Type

DecisionTree

The depth and number of nodes can be adjusted with maxnodes! and maxdepth!.

source

SDeMo.Ensemble — Type

Ensemble

An heterogeneous ensemble model is defined as a vector of SDMs. Bagging models can also be used.

source

SDeMo.ForwardSelection — Type

ForwardSelection

Variables are included one at a time until the performance of the models stops improving.

source

SDeMo.Logistic — Type

Logistic

Logistic regression with default learning rate of 0.01, penalization (L2) of 0.1, and 2000 epochs. Note that interaction terms can be turned on and off through the use of the interactions field. Possible values are :all (default), :self (only squared terms), and :none (no interactions).

The verbose field (defaults to false) can be used to show the progress of gradient descent, by showing the loss every 100 epochs, or to the value of the verbosity field. Note that when doing cross-validation, the loss on the validation data will be automatically reported.

source

SDeMo.MultivariateTransform — Type

MultivariateTransform{T} <: Transformer

T is a multivariate transformation, likely offered through the MultivariateStats package. The transformations currently supported are PCA, PPCA, KernelPCA, and Whitening, and they are documented through their type aliases (e.g. PCATransform).

source

SDeMo.NaiveBayes — Type

NaiveBayes

Naive Bayes Classifier

By default, upon training, the prior probability will be set to the prevalence of the training data.

source

SDeMo.PCATransform — Type

PCATransform

The PCA transform will project the model features, which also serves as a way to decrease the dimensionality of the problem. Note that this method will only use the training instances, and unless the absences=true keyword is used, only the present cases. This ensure that there is no data leak (neither validation data nor the data from the raster are used).

This is an alias for MultivariateTransform{PCA}.

source

SDeMo.RawData — Type

RawData

A transformer that does nothing to the data. This is passing the raw data to the classifier, and can be a good first step for models that assume that the features are independent, or are not sensitive to the scale of the features.

source

SDeMo.SDM — Type

SDM

This type specifies a full model, which is composed of a transformer (which applies a transformation on the data), a classifier (which returns a quantitative score), a threshold (above which the score corresponds to the prediction of a presence).

In addition, the SDM carries with it the training features and labels, as well as a vector of indices indicating which variables are actually used by the model.

source

SDeMo.StrictVarianceInflationFactor — Type

StrictVarianceInflationFactor{N}

Removes variables one at a time until the largest VIF is lower than N (a floating point number). By contrast with VarianceInflationFactor, this approach to variable selection will not cross-validate the model, and might result in a model that is far worse than any other variable selection technique.

source

SDeMo.Transformer — Type

Transformer

This abstract type covers all transformations that are applied to the data before fitting the classifier.

source

SDeMo.VariableSelectionStrategy — Type

VariableSelectionStrategy

This is an abstract type to which all variable selection types belong. The variable selection methods should define a method for variables!, whose first argument is a model, and the second argument is a selection strategy. The third and fourth positional arguments are, respectively, a list of variables to be included, and the folds to use for cross-validation. They can be omitted and would default to no default variables, and k-fold cross-validation.

source

SDeMo.VarianceInflationFactor — Type

VarianceInflationFactor{N}

Removes variables one at a time until the largest VIF is lower than N (a floating point number), or the performancde of the model stops increasing. Note that the resulting set of variables may have a largest VIF larger than the threshold. See StrictVarianceInflationFactor for an alternative.

source

SDeMo.WhiteningTransform — Type

WhiteningTransform

The whitening transformation is a linear transformation of the input variables, after which the new variables have unit variance and no correlation. The input is transformed into white noise.

Because this transform will usually keep the first variable "as is", and then apply increasingly important perturbations on the subsequent variables, it is sensitive to the order in which variables are presented, and is less useful when applying tools for interpretation.

This is an alias for MultivariateTransform{Whitening}.

source

SDeMo.ZScore — Type

ZScore

A transformer that scales and centers the data, using only the data that are avaiable to the model at training time.

For all variables in the SDM features (regardless of whether they are used), this transformer will store the observed mean and standard deviation. There is no correction on the sample size, because there is no reason to expect that the sample size will be the same for the training and prediction situation.

source