... get data from GBIF?
using SpeciesDistributionToolkit
using DataFrames
Identify the taxa
The first step is to understand how GBIF represents the taxonomic information. The taxon
function will take a string (or a GBIF taxonomic ID, but most people tend to call species by their names...) and return a representation of this taxon.
species = taxon("Sitta whiteheadi")
GBIF taxon -- Sitta whiteheadi
An interesting property of the GBIF API is that it returns the full taxonomic information, so we can for example check the phylum of this species:
species.phylum
"Chordata" => 44
Establish search parameters
Now that we are fairly confident that we have the right animal, we can start setting up some search parameters. The search parameters are not given as keyword arguments, but as a vector of pairs (there is a reason, and it is not sufficiently important to spend a paragraph on at this point). We will limit our search to France and Italy (the species is endemic to Corsica), retrieve occurrences 300 at a time (the maximum allowed by the GBIF API), and only focus on georeferences observations. Of course, we only care about the places where the observations represent a presence, so we will use the "occurrenceStatus" flag to get these records only.
query = [
"hasCoordinate" => true,
"country" => "FR",
"country" => "IT",
"limit" => 300,
"occurrenceStatus" => "PRESENT",
]
5-element Vector{Pair{String, Any}}:
"hasCoordinate" => true
"country" => "FR"
"country" => "IT"
"limit" => 300
"occurrenceStatus" => "PRESENT"
Query occurrence data
We have enough information to start our search of occurrences:
places = occurrences(species, query...)
GBIF records: downloaded 300 out of 4835
This step is doing a few important things. First, it is using the taxon object to filter the results of the API query, so that we will only get observations associated to this taxon. Second, it is bundling the query parameters to the object, so that we can modify it with subsequent requests. Internally, it is also keeping track of the total number of results, in order to retrieve them sequentially. Retrieving results sequentially is useful if you want to perform some operations while you collet results, for example check that you have enough data, and stop querying the API.
We can count the total number of observations known to GBIF with count
:
count(places)
4835
Similarly, we can count how many we actually have with length
:
length(places)
300
The package is setup so that the entire array of observations is allocated when we establish contact with the API for the first time, but we can only view the results we have actually retrieved (this is, indeed, because the records are exposed to the user as a view
).
As we know the current and total number of points, we can do a little looping to get all occurrences. Note that the GBIF streaming API has a hard limit at 200000 records, and that querying this amount of data using the streaming API is woefully inefficient. For data volumes above 10000 observations, the suggested solution is to rely on the download interface on GBIF.
while length(places) < count(places)
occurrences!(places)
end
Get information on occurrence data
When this is done, we can have a look at the countries in which the observations were made:
sort(unique([place.country for place in places]))
1-element Vector{String}:
"France"
We can also establish the time of the first and last observations:
extrema(filter(!ismissing, [place.date for place in places]))
(DateTime("2019-07-04T09:56:09"), DateTime("2024-09-14T10:25:10"))
The GBIF results can interact very seamlessly with the layer types, which is covered in other vignettes.
Finally, the package implements the interface to Tables.jl, so that we may write:
fields_to_keep = [:key, :publishingCountry, :country, :latitude, :longitude, :date]
select(DataFrame(places), fields_to_keep)[1:10,:]
Row | key | publishingCountry | country | latitude | longitude | date |
---|---|---|---|---|---|---|
Int64 | String | String | Float64 | Float64 | DateTime? | |
1 | 4596933246 | US | France | 41.7587 | 9.36375 | missing |
2 | 4881328036 | NL | France | 42.3044 | 8.9204 | missing |
3 | 4885726743 | NL | France | 42.312 | 8.93905 | missing |
4 | 4887037892 | NL | France | 42.3203 | 8.94863 | missing |
5 | 4888052469 | NL | France | 42.3188 | 8.95731 | missing |
6 | 4891339312 | NL | France | 42.3 | 8.8999 | missing |
7 | 4886590221 | NL | France | 42.2988 | 8.91033 | missing |
8 | 4891105014 | NL | France | 42.2784 | 8.85528 | missing |
9 | 4886317377 | NL | France | 42.3002 | 8.88372 | missing |
10 | 4863745024 | US | France | 42.4634 | 8.99781 | missing |
Related documentations
GBIF.taxon Function
Get information about a taxon at any level
taxon(name::String)
This function will look for a taxon by its (scientific) name in the GBIF reference taxonomy.
Optional arguments are
rank::Union{Symbol,Nothing}=:SPECIES
– the rank of the taxon you want. This is part of a controlled vocabulary, and can only be one of:DOMAIN
,:CLASS
,:CULTIVAR
,:FAMILY
,:FORM
,:GENUS
,:INFORMAL
,:ORDER
,:PHYLUM,
,:SECTION
,:SUBCLASS
,:VARIETY
,:TRIBE
,:KINGDOM
,:SUBFAMILY
,:SUBFORM
,:SUBGENUS
,:SUBKINGDOM
,:SUBORDER
,:SUBPHYLUM
,:SUBSECTION
,:SUBSPECIES
,:SUBTRIBE
,:SUBVARIETY
,:SUPERCLASS
,:SUPERFAMILY
,:SUPERORDER
, and:SPECIES
strict::Bool=true
– whether the match should be strict, or fuzzy
Finally, one can also specify other levels of the taxonomy, using kingdom
, phylum
, class
, order
, family
, and genus
, all of which can either be String
or Nothing
.
If a match is found, the result will be given as a GBIFTaxon
. If not, this function will return nothing
and give a warning.
Get information about a taxon at any level using taxonID
taxon(id::Int)
This function will look for a taxon by its taxonID in the GBIF reference taxonomy.
GBIF.occurrences Function
occurrences(query::Pair...)
This function will return the latest occurrences matching the queries – usually 20, but this is entirely determined by the server default page size. The query parameters must be given as pairs, and are optional. Omitting the query will return the latest recorded occurrences for all taxa.
The arguments accepted as queries are documented on the GBIF API website.
Note that this function will return even observations where the "occurrenceStatus" is "ABSENT"; therefore, for the majority of uses, your query will at least contain "occurrenceStatus" => "PRESENT"
.
occurrences(t::GBIFTaxon, query::Pair...)
Returns occurrences for a given taxon – the query arguments are the same as the occurrences
function.
occurrences(t::Vector{GBIFTaxon}, query::Pair...)
Returns occurrences for a series of taxa – the query arguments are the same as the occurrences
function.
GBIF.occurrences! Function
Get the next page of results
This function will retrieve the next page of results. By default, it will walk through queries 20 at a time. This can be modified by changing the .query["limit"]
value, to any value up to 300, which is the limit set by GBIF for the queries.
If filters have been applied to this query before, they will be removed to ensure that the previous and the new occurrences have the same status, but only for records that have already been retrieved.