Skip to content

... get data from GBIF?

julia
using SpeciesDistributionToolkit
using DataFrames

Identify the taxa

The first step is to understand how GBIF represents the taxonomic information. The taxon function will take a string (or a GBIF taxonomic ID, but most people tend to call species by their names...) and return a representation of this taxon.

julia
species = taxon("Sitta whiteheadi")
GBIF taxon -- Sitta whiteheadi

An interesting property of the GBIF API is that it returns the full taxonomic information, so we can for example check the phylum of this species:

julia
species.phylum
"Chordata" => 44

Establish search parameters

Now that we are fairly confident that we have the right animal, we can start setting up some search parameters. The search parameters are not given as keyword arguments, but as a vector of pairs (there is a reason, and it is not sufficiently important to spend a paragraph on at this point). We will limit our search to France and Italy (the species is endemic to Corsica), retrieve occurrences 300 at a time (the maximum allowed by the GBIF API), and only focus on georeferences observations. Of course, we only care about the places where the observations represent a presence, so we will use the "occurrenceStatus" flag to get these records only.

julia
query = [
    "hasCoordinate" => true,
    "country" => "FR",
    "country" => "IT",
    "limit" => 300,
    "occurrenceStatus" => "PRESENT",
]
5-element Vector{Pair{String, Any}}:
    "hasCoordinate" => true
          "country" => "FR"
          "country" => "IT"
            "limit" => 300
 "occurrenceStatus" => "PRESENT"

Query occurrence data

We have enough information to start our search of occurrences:

julia
places = occurrences(species, query...)
GBIF records: downloaded 300 out of 4835

This step is doing a few important things. First, it is using the taxon object to filter the results of the API query, so that we will only get observations associated to this taxon. Second, it is bundling the query parameters to the object, so that we can modify it with subsequent requests. Internally, it is also keeping track of the total number of results, in order to retrieve them sequentially. Retrieving results sequentially is useful if you want to perform some operations while you collet results, for example check that you have enough data, and stop querying the API.

We can count the total number of observations known to GBIF with count:

julia
count(places)
4835

Similarly, we can count how many we actually have with length:

julia
length(places)
300

The package is setup so that the entire array of observations is allocated when we establish contact with the API for the first time, but we can only view the results we have actually retrieved (this is, indeed, because the records are exposed to the user as a view).

As we know the current and total number of points, we can do a little looping to get all occurrences. Note that the GBIF streaming API has a hard limit at 200000 records, and that querying this amount of data using the streaming API is woefully inefficient. For data volumes above 10000 observations, the suggested solution is to rely on the download interface on GBIF.

julia
while length(places) < count(places)
    occurrences!(places)
end

Get information on occurrence data

When this is done, we can have a look at the countries in which the observations were made:

julia
sort(unique([place.country for place in places]))
1-element Vector{String}:
 "France"

We can also establish the time of the first and last observations:

julia
extrema(filter(!ismissing, [place.date for place in places]))
(DateTime("2019-07-04T09:56:09"), DateTime("2024-09-14T10:25:10"))

The GBIF results can interact very seamlessly with the layer types, which is covered in other vignettes.

Finally, the package implements the interface to Tables.jl, so that we may write:

julia
fields_to_keep = [:key, :publishingCountry, :country, :latitude, :longitude, :date]
select(DataFrame(places), fields_to_keep)[1:10,:]
10×6 DataFrame
RowkeypublishingCountrycountrylatitudelongitudedate
Int64StringStringFloat64Float64DateTime?
14596933246USFrance41.75879.36375missing
24881328036NLFrance42.30448.9204missing
34885726743NLFrance42.3128.93905missing
44887037892NLFrance42.32038.94863missing
54888052469NLFrance42.31888.95731missing
64891339312NLFrance42.38.8999missing
74886590221NLFrance42.29888.91033missing
84891105014NLFrance42.27848.85528missing
94886317377NLFrance42.30028.88372missing
104863745024USFrance42.46348.99781missing
GBIF.taxon Function

Get information about a taxon at any level

taxon(name::String)

This function will look for a taxon by its (scientific) name in the GBIF reference taxonomy.

Optional arguments are

  • rank::Union{Symbol,Nothing}=:SPECIES – the rank of the taxon you want. This is part of a controlled vocabulary, and can only be one of :DOMAIN, :CLASS, :CULTIVAR, :FAMILY, :FORM, :GENUS, :INFORMAL, :ORDER, :PHYLUM,, :SECTION, :SUBCLASS, :VARIETY, :TRIBE, :KINGDOM, :SUBFAMILY, :SUBFORM, :SUBGENUS, :SUBKINGDOM, :SUBORDER, :SUBPHYLUM, :SUBSECTION, :SUBSPECIES, :SUBTRIBE, :SUBVARIETY, :SUPERCLASS, :SUPERFAMILY, :SUPERORDER, and :SPECIES

  • strict::Bool=true – whether the match should be strict, or fuzzy

Finally, one can also specify other levels of the taxonomy, using kingdom, phylum, class, order, family, and genus, all of which can either be String or Nothing.

If a match is found, the result will be given as a GBIFTaxon. If not, this function will return nothing and give a warning.

source

Get information about a taxon at any level using taxonID

taxon(id::Int)

This function will look for a taxon by its taxonID in the GBIF reference taxonomy.

source

GBIF.occurrences Function
julia
occurrences(query::Pair...)

This function will return the latest occurrences matching the queries – usually 20, but this is entirely determined by the server default page size. The query parameters must be given as pairs, and are optional. Omitting the query will return the latest recorded occurrences for all taxa.

The arguments accepted as queries are documented on the GBIF API website.

Note that this function will return even observations where the "occurrenceStatus" is "ABSENT"; therefore, for the majority of uses, your query will at least contain "occurrenceStatus" => "PRESENT".

source

julia
occurrences(t::GBIFTaxon, query::Pair...)

Returns occurrences for a given taxon – the query arguments are the same as the occurrences function.

source

julia
occurrences(t::Vector{GBIFTaxon}, query::Pair...)

Returns occurrences for a series of taxa – the query arguments are the same as the occurrences function.

source

GBIF.occurrences! Function

Get the next page of results

This function will retrieve the next page of results. By default, it will walk through queries 20 at a time. This can be modified by changing the .query["limit"] value, to any value up to 300, which is the limit set by GBIF for the queries.

If filters have been applied to this query before, they will be removed to ensure that the previous and the new occurrences have the same status, but only for records that have already been retrieved.

source