The dataset interface
This page is meant for contributors to the package, and specifically provides information on the interface, what to overload, and why.
All of the methods that form the interface have two versions: one for current data, and one for future data. The default behavior of the interface is for the version on future data to fall back to the version for current data (i.e. we assume that future data are provided with the same format as current data). This means that most of the functions will not need to be overloaded when adding a provider with support for future data.
The interface is built around the idea that Julia will use the most specific version of a method first, and resort to the less generic ones when there are multiple matches. A good example is the BioClim
dataset, provided by a number of sources, that often has different URLs and filenames. This is handled (in e.g. CHELSA2
) by writing a method for the general case of any dataset RasterData{CHELSA2,T}
(using a Union
type), and then a specific method on RasterData{CHELSA2,BioClim}
. In the case of CHELSA2
, the general method handles all datasets except BioClim
, which makes the code much easier to write and maintain.
Compatibility between datasets and providers
The inner constructor for RasterData
involves a call to provides
, which must return true
for the type to be constructed. The generic method for provides
returns false
, so additional provider/dataset pairs must be overloaded to return true
in order for the corresponding RasterData
type to exist.
In practice, especially when there are multiple datasets for a single provider, the easiest way is to define a Union
type and overload based on membership to this union type, as touched upon earlier in this document.
SimpleSDMDatasets.provides Function
provides(::Type{P}, ::Type{D}) where {P <: RasterProvider, D <: RasterDataset}
This is the core function upon which the entire interface is built. Its purpose is to specify whether a specific dataset is provided by a specific provider. Note that this function takes two arguments, as opposed to a RasterData
argument, because it is called in the inner constructor of RasterData
: you cannot instantiate a RasterData
with an incompatible provider/dataset combination.
The default value of this function is false
, and to allow the use of a dataset with a provider, it is required to overload it for this specific pair so that it returns true
.
provides(::R, ::F) where {R <: RasterData, F <: Future}
This method for provides
specifies whether a RasterData
combination has support for the value of the Future
(a combination of a FutureScenario
and a FutureModel
) given as the second argument. Note that this function is not called as part of the Future
constructor (because models and scenarios are messy and dataset dependent), but is still called when requesting data.
The default value of this function is false
, and to allow the use of a future dataset with a given provider, it is required to overload it so that it returns true
.
Type of object downloaded
The specification about the format of downloaded files is managed by downloadtype
. By default, we assume that a request to a usable dataset is returning a single file, but this can be overloaded for the providers who return an archive.
SimpleSDMDatasets.downloadtype Function
downloadtype(::R) where {R <: RasterData}
This method returns a RasterDownloadType
that is used internally to be more explicit about the type of object that is downloaded from the raster source. The supported values are _file
(the default, which is an ascii, geotiff, NetCDF, etc. single file), and _zip
(a zip archive containing files). This is a trait because we cannot trust file extensions.
downloadtype(data::R, ::F) where {R <: RasterData, F <: Future}
This method provides the type of the downloaded object for a combination of a raster source and a future scenario as a RasterDownloadType
.
If no overload is given, this will default to downloadtype(data)
, as we can assume that the type of downloaded object is the same for both current and future scenarios.
The return type of the downloadtype
must be one of the RasterDownloadType
enum, which can be extended if adding a new provider requires a new format for the download.
Type of object stored
The specification about the format of the information contained in the downloaded type is managed by filetype
. By default, we assume that a request to a usable dataset is returning a tiff
, but this can be overloaded for the providers who return data in another format. Note that if the download type is an archive, the file type describes the format of the files within the archive.
SimpleSDMDatasets.filetype Function
filetype(::R) where {R <: RasterData}
This method returns a RasterFileType
that represents the format of the raster data. RasterFileType
is an enumerated type. This overload is particularly important as it will determine how the returned file path should be read.
The default value is _tiff
.
filetype(data::R, ::F) where {R <: RasterData, F <: Future}
This method provides the format of the stored raster for a combination of a raster source and a future scenario as a RasterFileType
.
If no overload is given, this will default to filetype(data)
, as we can assume that the raster format is the same for both current and future scenarios.
The return type of the filetype
must be one of the RasterFileType
enum, which can be extended if adding a new provider requires a new format for the download.
Available resolutions
SimpleSDMDatasets.resolutions Function
resolutions(::R) where {R <: RasterData}
This method controls whether the dataset has a resolution, i.e. a grid size. If this is nothing
(the default), it means that the dataset is only given at a set resolution.
An overload of this method is required when there are multiple resolutions available, and must return a Dict
with numeric keys (for the resolution) and a string value giving an explanation of the resolution.
Any dataset with a return value that is not nothing
must accept the resolution
keyword.
resolutions(data::R, ::F) where {R <: RasterData, F <: Future}
This methods control the resolutions
for a future dataset. Unless overloaded, it will return resolutions(data)
.
Available layers
SimpleSDMDatasets.layers Function
layers(::R) where {R <: RasterData}
This method controls whether the dataset has named layers. If this is nothing
(the default), it means that the dataset will have a single layer.
An overload of this method is required when there are multiple layers available, and must return a Vector
, usually of String
. Note that by default, the layers can also be accessed by using an Integer
, in which case layer=i
will be the i-th entry in layers(data)
.
Any dataset with a return value that is not nothing
must accept the layer
keyword.
SimpleSDMDatasets.layerdescriptions Function
layerdescriptions(data::R) where {R <: RasterData}
Human-readable names the layers. This must be a dictionary mapping the layer names (as returned by layers
) to a string explaining the contents of the layers.
Available months
SimpleSDMDatasets.months Function
months(::R) where {R <: RasterData}
This method controls whether the dataset has monthly layers. If this is nothing
(the default), it means that the dataset is not accessible at a monthly resolution.
An overload of this method is required when there are multiple months available, and must return a Vector{Dates.Month}
.
Any dataset with a return value that is not nothing
must accept the month
keyword.
Available years
SimpleSDMDatasets.timespans Function
timespans(data::R, ::F) where {R <: RasterData, F <: Future}
For datasets with a Future
scenario, this method should return a Vector
of Pairs
, which are formatted as
Year(start) => Year(end)
There is a method working on a single RasterData
argument, defaulting to returning nothing
, but it should never be overloaded.
Additional keyword arguments
SimpleSDMDatasets.extrakeys Function
extrakeys(::R) where {R <: RasterData}
This method controls whether the dataset has additional keys. If this is nothing
(the default), it means that the dataset can be accessed using only the default keys specified in this interface.
An overload of this method is required when there are additional keywords needed to access the data (e.g. full=true
for the EarthEnv
land-cover data), and must return a Dict
, with Symbol
keys and Tuple
s of pairs as values.
The key is the keyword argument passed to downloader
and the tuple lists all accepted values, in the format value => explanation
.
Any dataset with a return value that is not nothing
must accept the keyword arguments specified in the return value.
URL for the data to download
SimpleSDMDatasets.source Function
source(::RasterData{P, D}; kwargs...) where {P <: RasterProvider, D <: RasterDataset}
This method specifies the URL for the data. It defaults to nothing
, so this method must be overloaded.
Path to the data locally
SimpleSDMDatasets.destination Function
destination(::RasterData{P, D}; kwargs...) where {P <: RasterProvider, D <: RasterDataset}
This method specifies where the data should be stored locally. By default, it is the _LAYER_PATH
, followed by the provider name, followed by the dataset name.
URL for additional information
The url
method will display one URL re-directing users to either the description of the provider, or the description of the dataset. A minima, the version for the RasterProvider
should be specified. Note that this must return a Markdown string.
Most of the RasterDataset
will have a default blurb, but more specific (i.e. adapted to a particular prodiver) ones can be provided.
SimpleSDMDatasets.url Function
url(::P) where {P <: DataProvider}
The URL for the data provider - if there is no specific URL for each dataset, it is enough to define this one.
Additional information about a dataset
The blurb
is a short text explaining what the dataset / provider is about. A minima, the version for the RasterProvider
should be specified. In some cases, it is acceptable to only define a version for one RasterDataset
and any RasterProvider
, although a more specific dispatch can be implemented.