Use-case: the portal data

In this example, we will use NCBITaxonomy to validate the names of the species used in the Portal teaching dataset:

Ernest, Morgan; Brown, James; Valone, Thomas; White, Ethan P. (2017): Portal Project Teaching Database. figshare. https://doi.org/10.6084/m9.figshare.1314459.v6

We will download a list of species from figshare, which is given as a JSON file:

using NCBITaxonomy
using DataFrames
using JSON
using StringDistances

species_file = download("https://ndownloader.figshare.com/files/3299486")
species = JSON.parsefile(species_file)
54-element Vector{Any}:
 Dict{String, Any}("species" => "bilineata", "genus" => "Amphispiza", "taxa" => "Bird", "species_id" => "AB")
 Dict{String, Any}("species" => "harrisi", "genus" => "Ammospermophilus", "taxa" => "Rodent", "species_id" => "AH")
 Dict{String, Any}("species" => "savannarum", "genus" => "Ammodramus", "taxa" => "Bird", "species_id" => "AS")
 Dict{String, Any}("species" => "taylori", "genus" => "Baiomys", "taxa" => "Rodent", "species_id" => "BA")
 Dict{String, Any}("species" => "brunneicapillus", "genus" => "Campylorhynchus", "taxa" => "Bird", "species_id" => "CB")
 Dict{String, Any}("species" => "melanocorys", "genus" => "Calamospiza", "taxa" => "Bird", "species_id" => "CM")
 Dict{String, Any}("species" => "squamata", "genus" => "Callipepla", "taxa" => "Bird", "species_id" => "CQ")
 Dict{String, Any}("species" => "scutalatus", "genus" => "Crotalus", "taxa" => "Reptile", "species_id" => "CS")
 Dict{String, Any}("species" => "tigris", "genus" => "Cnemidophorus", "taxa" => "Reptile", "species_id" => "CT")
 Dict{String, Any}("species" => "uniparens", "genus" => "Cnemidophorus", "taxa" => "Reptile", "species_id" => "CU")
 ⋮
 Dict{String, Any}("species" => "tereticaudus", "genus" => "Spermophilus", "taxa" => "Rodent", "species_id" => "ST")
 Dict{String, Any}("species" => "undulatus", "genus" => "Sceloporus", "taxa" => "Reptile", "species_id" => "SU")
 Dict{String, Any}("species" => "sp.", "genus" => "Sigmodon", "taxa" => "Rodent", "species_id" => "SX")
 Dict{String, Any}("species" => "sp.", "genus" => "Lizard", "taxa" => "Reptile", "species_id" => "UL")
 Dict{String, Any}("species" => "sp.", "genus" => "Pipilo", "taxa" => "Bird", "species_id" => "UP")
 Dict{String, Any}("species" => "sp.", "genus" => "Rodent", "taxa" => "Rodent", "species_id" => "UR")
 Dict{String, Any}("species" => "sp.", "genus" => "Sparrow", "taxa" => "Bird", "species_id" => "US")
 Dict{String, Any}("species" => "leucophrys", "genus" => "Zonotrichia", "taxa" => "Bird", "species_id" => "ZL")
 Dict{String, Any}("species" => "macroura", "genus" => "Zenaida", "taxa" => "Bird", "species_id" => "ZM")

Cleaning up the portal names

There are two things we want to do at this point: extract the species names from the file, and then validate that they are spelled correctly, or that they are the most recent taxonomic name according to NCBI.

The portal data are already identified as belonging to a group of taxa, so we can get a unique list of them:

taxo_groups = unique([tax["taxa"] for tax in species])
4-element Vector{String}:
 "Bird"
 "Rodent"
 "Reptile"
 "Rabbit"

We will store our results in a data frame:

cleanup = DataFrame(
    code = String[],
    portal = String[],
    name = String[],
    rank = Symbol[],
    order = String[],
    taxid = Int[],
    same = Bool[],
    fuzzy = Bool[]
)
0×8 DataFrame
Rowcodeportalnamerankordertaxidsamefuzzy
StringStringStringSymbolStringInt64BoolBool

The next step is to loop through the species, and figure out what to do with them:

for sp in species
    portal_name = sp["species"] == "sp." ? sp["genus"] : sp["genus"]*" "*sp["species"]
    local ncbi_tax
    local fuzzy = false
    try
        ncbi_tax = taxon(portal_name)
    catch y
        if isa(y, NameHasNoDirectMatch)
            fuzzy = true
            ncbi_tax = taxon(portal_name; strict=false)
        else
            continue
        end
    end
    ncbi_lin = lineage(ncbi_tax)
    push!(cleanup,
        (
            sp["species_id"], portal_name, ncbi_tax.name, rank(ncbi_tax),
            first(filter(t -> isequal(:order)(rank(t)), lineage(ncbi_tax))).name,
            ncbi_tax.id, portal_name == ncbi_tax.name, fuzzy
        )
    )
end

first(cleanup, 5)
5×8 DataFrame
Rowcodeportalnamerankordertaxidsamefuzzy
StringStringStringSymbolStringInt64BoolBool
1ABAmphispiza bilineataAmphispiza bilineataspeciesPasseriformes198939truefalse
2AHAmmospermophilus harrisiAmmospermophilus harrisiispeciesRodentia45487falsetrue
3ASAmmodramus savannarumAmmodramus savannarumspeciesPasseriformes135422truefalse
4BABaiomys tayloriBaiomys taylorispeciesRodentia56219truefalse
5CBCampylorhynchus brunneicapillusCampylorhynchus brunneicapillusspeciesPasseriformes141853truefalse

Looking at species with a name discrepancy

Finally, we can look at the codes for which there is a likely issue because the names do not match – this can be because of new names, improper use of vernacular, or spelling issues:

filter(r -> r.portal != r.name, cleanup)
14×8 DataFrame
Rowcodeportalnamerankordertaxidsamefuzzy
StringStringStringSymbolStringInt64BoolBool
1AHAmmospermophilus harrisiAmmospermophilus harrisiispeciesRodentia45487falsetrue
2CSCrotalus scutalatusCrotalus scutulatusspeciesSquamata8737falsetrue
3CTCnemidophorus tigrisAspidoscelis tigrisspeciesSquamata52180falsefalse
4CUCnemidophorus uniparensAspidoscelis uniparensspeciesSquamata37197falsefalse
5EOEumeces obsoletusPlestiodon obsoletusspeciesSquamata463535falsefalse
6GSGambelia silusGambelia silaspeciesSquamata475046falsetrue
7PHPerognathus hispidusChaetodipus hispidusspeciesRodentia38665falsefalse
8PUPipilo fuscusMelozone fuscaspeciesPasseriformes40205falsefalse
9SCSceloporus clarkiSceloporus clarkiispeciesSquamata235405falsefalse
10SSSpermophilus spilosomaXerospermophilus spilosomaspeciesRodentia45471falsefalse
11STSpermophilus tereticaudusXerospermophilus tereticaudusspeciesRodentia99860falsefalse
12ULLizardLisardagenusHemiptera204543falsetrue
13URRodentRodentiaorderRodentia9989falsetrue
14USSparrowPasseridaefamilyPasseriformes9158falsetrue

Out of these, some required to use fuzzy matching to get a proper name, so we can look at there taxa, as they are likely to require manual curation:

filter(r -> r.fuzzy, cleanup)
6×8 DataFrame
Rowcodeportalnamerankordertaxidsamefuzzy
StringStringStringSymbolStringInt64BoolBool
1AHAmmospermophilus harrisiAmmospermophilus harrisiispeciesRodentia45487falsetrue
2CSCrotalus scutalatusCrotalus scutulatusspeciesSquamata8737falsetrue
3GSGambelia silusGambelia silaspeciesSquamata475046falsetrue
4ULLizardLisardagenusHemiptera204543falsetrue
5URRodentRodentiaorderRodentia9989falsetrue
6USSparrowPasseridaefamilyPasseriformes9158falsetrue

Out of these, only Lizard has a strange identification as a Hemiptera:

filter(t -> isequal(:class)(rank(t)), lineage(ncbi"Lisarda"))
1-element Vector{NCBITaxon}:
 Insecta (ncbi:50557)

Right. We can dig into this example a little more, because it shows how much data entry can condition the success of name finding.

similarnames("Lizard"; threshold=0.7)
1-element Vector{Pair{NCBITaxon, Float64}}:
 Lisarda (ncbi:204543) => 0.7142857142857143

The Lisarda taxon (which is an insect!) is the closest match, simply because "Lizards" is not a classification we can use – lizards are a paraphyletic group, containing a handful of different groups. Based on the information available, the only information we can say about the taxon identified as "Lizards" is that it belongs to Squamata.