Skip to content

... get data from GBIF?

julia
using SpeciesDistributionToolkit
using DataFrames

Identify the taxa

The first step is to understand how GBIF represents the taxonomic information. The taxon function will take a string (or a GBIF taxonomic ID, but most people tend to call species by their names...) and return a representation of this taxon.

julia
species = taxon("Sitta whiteheadi")
GBIF taxon -- Sitta whiteheadi

An interesting property of the GBIF API is that it returns the full taxonomic information, so we can for example check the phylum of this species:

julia
species.phylum
"Chordata" => 44

Establish search parameters

Now that we are fairly confident that we have the right animal, we can start setting up some search parameters. The search parameters are not given as keyword arguments, but as a vector of pairs (there is a reason, and it is not sufficiently important to spend a paragraph on at this point). We will limit our search to France and Italy (the species is endemic to Corsica), retrieve occurrences 300 at a time (the maximum allowed by the GBIF API), and only focus on georeferences observations. Of course, we only care about the places where the observations represent a presence, so we will use the "occurrenceStatus" flag to get these records only.

julia
query = [
    "hasCoordinate" => true,
    "country" => "FR",
    "country" => "IT",
    "limit" => 300,
    "occurrenceStatus" => "PRESENT",
]
5-element Vector{Pair{String, Any}}:
    "hasCoordinate" => true
          "country" => "FR"
          "country" => "IT"
            "limit" => 300
 "occurrenceStatus" => "PRESENT"

Query occurrence data

We have enough information to start our search of occurrences:

julia
places = occurrences(species, query...)
GBIF records: downloaded 300 out of 4867

This step is doing a few important things. First, it is using the taxon object to filter the results of the API query, so that we will only get observations associated to this taxon. Second, it is bundling the query parameters to the object, so that we can modify it with subsequent requests. Internally, it is also keeping track of the total number of results, in order to retrieve them sequentially. Retrieving results sequentially is useful if you want to perform some operations while you collet results, for example check that you have enough data, and stop querying the API.

We can count the total number of observations known to GBIF with count:

julia
count(places)
4867

Similarly, we can count how many we actually have with length:

julia
length(places)
300

The package is setup so that the entire array of observations is allocated when we establish contact with the API for the first time, but we can only view the results we have actually retrieved (this is, indeed, because the records are exposed to the user as a view).

As we know the current and total number of points, we can do a little looping to get all occurrences. Note that the GBIF streaming API has a hard limit at 200000 records, and that querying this amount of data using the streaming API is woefully inefficient. For data volumes above 10000 observations, the suggested solution is to rely on the download interface on GBIF.

julia
while length(places) < count(places)
    occurrences!(places)
end

Get information on occurrence data

When this is done, we can have a look at the countries in which the observations were made:

julia
sort(unique([place.country for place in places]))
1-element Vector{String}:
 "France"

We can also establish the time of the first and last observations:

julia
extrema(filter(!ismissing, [place.date for place in places]))
(DateTime("2019-07-04T09:56:09"), DateTime("2024-09-14T10:25:10"))

The GBIF results can interact very seamlessly with the layer types, which is covered in other vignettes.

Finally, the package implements the interface to Tables.jl, so that we may write:

julia
fields_to_keep = [:key, :publishingCountry, :country, :latitude, :longitude, :date]
select(DataFrame(places), fields_to_keep)[1:10,:]
10×6 DataFrame
RowkeypublishingCountrycountrylatitudelongitudedate
Int64StringStringFloat64Float64DateTime?
15063770098USFrance42.28578.85273missing
25081660887NLFrance42.26628.84707missing
35081683793NLFrance42.26558.8474missing
45081657866NLFrance42.26578.8476missing
55081675875NLFrance42.26568.84761missing
65081645866NLFrance42.26818.84582missing
75081682830NLFrance42.26898.84457missing
85081662067NLFrance42.26578.84734missing
94596933246USFrance41.75879.36375missing
104881328036NLFrance42.30448.9204missing
GBIF.taxon Function

Get information about a taxon at any level

taxon(name::String)

This function will look for a taxon by its (scientific) name in the GBIF reference taxonomy.

Optional arguments are

  • rank::Union{Symbol,Nothing}=:SPECIES – the rank of the taxon you want. This is part of a controlled vocabulary, and can only be one of :DOMAIN, :CLASS, :CULTIVAR, :FAMILY, :FORM, :GENUS, :INFORMAL, :ORDER, :PHYLUM,, :SECTION, :SUBCLASS, :VARIETY, :TRIBE, :KINGDOM, :SUBFAMILY, :SUBFORM, :SUBGENUS, :SUBKINGDOM, :SUBORDER, :SUBPHYLUM, :SUBSECTION, :SUBSPECIES, :SUBTRIBE, :SUBVARIETY, :SUPERCLASS, :SUPERFAMILY, :SUPERORDER, and :SPECIES

  • strict::Bool=true – whether the match should be strict, or fuzzy

Finally, one can also specify other levels of the taxonomy, using kingdom, phylum, class, order, family, and genus, all of which can either be String or Nothing.

If a match is found, the result will be given as a GBIFTaxon. If not, this function will return nothing and give a warning.

source

Get information about a taxon at any level using taxonID

taxon(id::Int)

This function will look for a taxon by its taxonID in the GBIF reference taxonomy.

source
GBIF.occurrences Function
julia
occurrences(query::Pair...)

This function will return the latest occurrences matching the queries – usually 20, but this is entirely determined by the server default page size. The query parameters must be given as pairs, and are optional. Omitting the query will return the latest recorded occurrences for all taxa.

The arguments accepted as queries are documented on the GBIF API website.

Note that this function will return even observations where the "occurrenceStatus" is "ABSENT"; therefore, for the majority of uses, your query will at least contain "occurrenceStatus" => "PRESENT".

source
julia
occurrences(t::GBIFTaxon, query::Pair...)

Returns occurrences for a given taxon – the query arguments are the same as the occurrences function.

source
julia
occurrences(t::Vector{GBIFTaxon}, query::Pair...)

Returns occurrences for a series of taxa – the query arguments are the same as the occurrences function.

source
GBIF.occurrences! Function

Get the next page of results

This function will retrieve the next page of results. By default, it will walk through queries 20 at a time. This can be modified by changing the .query["limit"] value, to any value up to 300, which is the limit set by GBIF for the queries.

If filters have been applied to this query before, they will be removed to ensure that the previous and the new occurrences have the same status, but only for records that have already been retrieved.

source