The dataset interface

This page is meant for contributors to the package, and specifically provides information on the interface, what to overload, and why.

All of the methods that form the interface have two versions: one for current data, and one for future data. The default behavior of the interface is for the version on future data to fall back to the version for current data (i.e. we assume that future data are provided with the same format as current data). This means that most of the functions will not need to be overloaded when adding a provider with support for future data.

The interface is built around the idea that Julia will use the most specific version of a method first, and resort to the less generic ones when there are multiple matches. A good example is the BioClim dataset, provided by a number of sources, that often has different URLs and filenames. This is handled (in e.g. CHELSA2) by writing a method for the general case of any dataset RasterData{CHELSA2,T} (using a Union type), and then a specific method on RasterData{CHELSA2,BioClim}. In the case of CHELSA2, the general method handles all datasets except BioClim, which makes the code much easier to write and maintain.

Compatibility between datasets and providers

The inner constructor for RasterData involves a call to provides, which must return true for the type to be constructed. The generic method for provides returns false, so additional provider/dataset pairs must be overloaded to return true in order for the corresponding RasterData type to exist.

In practice, especially when there are multiple datasets for a single provider, the easiest way is to define a Union type and overload based on membership to this union type, as touched upon earlier in this document.

SimpleSDMDatasets.providesFunction
provides(::Type{P}, ::Type{D}) where {P <: RasterProvider, D <: RasterDataset}

This is the core function upon which the entire interface is built. Its purpose is to specify whether a specific dataset is provided by a specific provider. Note that this function takes two arguments, as opposed to a RasterData argument, because it is called in the inner constructor of RasterData: you cannot instantiate a RasterData with an incompatible provider/dataset combination.

The default value of this function is false, and to allow the use of a dataset with a provider, it is required to overload it for this specific pair so that it returns true.

source
provides(::R, ::F) where {R <: RasterData, F <: Future}

This method for provides specifies whether a RasterData combination has support for the value of the Future (a combination of a FutureScenario and a FutureModel) given as the second argument. Note that this function is not called as part of the Future constructor (because models and scenarios are messy and dataset dependent), but is still called when requesting data.

The default value of this function is false, and to allow the use of a future dataset with a given provider, it is required to overload it so that it returns true.

source

Type of object downloaded

The specification about the format of downloaded files is managed by downloadtype. By default, we assume that a request to a usable dataset is returning a single file, but this can be overloaded for the providers who return an archive.

SimpleSDMDatasets.downloadtypeFunction
downloadtype(::R) where {R <: RasterData}

This method returns a RasterDownloadType that is used internally to be more explicit about the type of object that is downloaded from the raster source. The supported values are _file (the default, which is an ascii, geotiff, NetCDF, etc. single file), and _zip (a zip archive containing files). This is a trait because we cannot trust file extensions.

source
downloadtype(data::R, ::F) where {R <: RasterData, F <: Future}

This method provides the type of the downloaded object for a combination of a raster source and a future scenario as a RasterDownloadType.

If no overload is given, this will default to downloadtype(data), as we can assume that the type of downloaded object is the same for both current and future scenarios.

source

The return type of the downloadtype must be one of the RasterDownloadType enum, which can be extended if adding a new provider requires a new format for the download.

SimpleSDMDatasets.RasterDownloadTypeType
RasterDownloadType

This enum stores the possible types of downloaded files. They are listed with instances(RasterDownloadType), and are currently limited to _file (a file, can be read directly) and _zip (an archive, must be unzipped).

source

Type of object stored

The specification about the format of the information contained in the downloaded type is managed by filetype. By default, we assume that a request to a usable dataset is returning a tiff, but this can be overloaded for the providers who return data in another format. Note that if the download type is an archive, the file type describes the format of the files within the archive.

SimpleSDMDatasets.filetypeFunction
filetype(::R) where {R <: RasterData}

This method returns a RasterFileType that represents the format of the raster data. RasterFileType is an enumerated type. This overload is particularly important as it will determine how the returned file path should be read.

The default value is _tiff.

source
filetype(data::R, ::F) where {R <: RasterData, F <: Future}

This method provides the format of the stored raster for a combination of a raster source and a future scenario as a RasterFileType.

If no overload is given, this will default to filetype(data), as we can assume that the raster format is the same for both current and future scenarios.

source

The return type of the filetype must be one of the RasterFileType enum, which can be extended if adding a new provider requires a new format for the download.

Available resolutions

SimpleSDMDatasets.resolutionsFunction
resolutions(::R) where {R <: RasterData}

This method controls whether the dataset has a resolution, i.e. a grid size. If this is nothing (the default), it means that the dataset is only given at a set resolution.

An overload of this method is required when there are multiple resolutions available, and must return a Dict with numeric keys (for the resolution) and string values (giving the textual representation of these keys, usually in the way that is usable to build the url).

Any dataset with a return value that is not nothing must accept the resolution keyword.

source
resolutions(data::R, ::F) where {R <: RasterData, F <: Future}

This methods control the resolutions for a future dataset. Unless overloaded, it will return resolutions(data).

source

Available layers

SimpleSDMDatasets.layersFunction
layers(::R) where {R <: RasterData}

This method controls whether the dataset has named layers. If this is nothing (the default), it means that the dataset will have a single layer.

An overload of this method is required when there are multiple layers available, and must return a Vector, usually of String. Note that by default, the layers can also be accessed by using an Integer, in which case layer=i will be the i-th entry in layers(data).

Any dataset with a return value that is not nothing must accept the layer keyword.

source
SimpleSDMDatasets.layerdescriptionsFunction
layerdescriptions(data::R) where {R <: RasterData}

Human-readable names the layers. This must be a dictionary mapping the layer names (as returned by layers) to a string explaining the contents of the layers.

source

Available months

SimpleSDMDatasets.monthsFunction
months(::R) where {R <: RasterData}

This method controls whether the dataset has monthly layers. If this is nothing (the default), it means that the dataset is not accessible at a monthly resolution.

An overload of this method is required when there are multiple months available, and must return a Vector{Dates.Month}.

Any dataset with a return value that is not nothing must accept the month keyword.

source

Available years

SimpleSDMDatasets.timespansFunction
timespans(data::R, ::F) where {R <: RasterData, F <: Future}

For datasets with a Future scenario, this method should return a Vector of Pairs, which are formatted as

Year(start) => Year(end)

There is a method working on a single RasterData argument, defaulting to returning nothing, but it should never be overloaded.

source

Additional keyword arguments

SimpleSDMDatasets.extrakeysFunction
extrakeys(::R) where {R <: RasterData}

This method controls whether the dataset has additional keys. If this is nothing (the default), it means that the dataset can be accessed using only the default keys specified in this interface.

An overload of this method is required when there are additional keywords needed to access the data (e.g. full=true for the EarthEnv land-cover data), and must return a Dict, with Symbol keys and Tuple arguments, where the key is the keyword argument passed to downloader and the tuple lists all accepted values.

Any dataset with a return value that is not nothing must accept the keyword arguments specified in the return value.

source

URL for the data to download

SimpleSDMDatasets.sourceFunction
source(::RasterData{P, D}; kwargs...) where {P <: RasterProvider, D <: RasterDataset}

This method specifies the URL for the data. It defaults to nothing, so this method must be overloaded.

source

Path to the data locally

SimpleSDMDatasets.destinationFunction
destination(::RasterData{P, D}; kwargs...) where {P <: RasterProvider, D <: RasterDataset}

This method specifies where the data should be stored locally. By default, it is the _LAYER_PATH, followed by the provider name, followed by the dataset name.

source