Use-case: the portal data
In this example, we will use NCBITaxonomy
to validate the names of the species used in the Portal teaching dataset:
Ernest, Morgan; Brown, James; Valone, Thomas; White, Ethan P. (2017): Portal Project Teaching Database. figshare. https://doi.org/10.6084/m9.figshare.1314459.v6
We will download a list of species from figshare, which is given as a JSON file:
using NCBITaxonomy
using DataFrames
using JSON
using StringDistances
species_file = download("https://ndownloader.figshare.com/files/3299486")
species = JSON.parsefile(species_file)
54-element Vector{Any}:
Dict{String, Any}("species" => "bilineata", "genus" => "Amphispiza", "taxa" => "Bird", "species_id" => "AB")
Dict{String, Any}("species" => "harrisi", "genus" => "Ammospermophilus", "taxa" => "Rodent", "species_id" => "AH")
Dict{String, Any}("species" => "savannarum", "genus" => "Ammodramus", "taxa" => "Bird", "species_id" => "AS")
Dict{String, Any}("species" => "taylori", "genus" => "Baiomys", "taxa" => "Rodent", "species_id" => "BA")
Dict{String, Any}("species" => "brunneicapillus", "genus" => "Campylorhynchus", "taxa" => "Bird", "species_id" => "CB")
Dict{String, Any}("species" => "melanocorys", "genus" => "Calamospiza", "taxa" => "Bird", "species_id" => "CM")
Dict{String, Any}("species" => "squamata", "genus" => "Callipepla", "taxa" => "Bird", "species_id" => "CQ")
Dict{String, Any}("species" => "scutalatus", "genus" => "Crotalus", "taxa" => "Reptile", "species_id" => "CS")
Dict{String, Any}("species" => "tigris", "genus" => "Cnemidophorus", "taxa" => "Reptile", "species_id" => "CT")
Dict{String, Any}("species" => "uniparens", "genus" => "Cnemidophorus", "taxa" => "Reptile", "species_id" => "CU")
⋮
Dict{String, Any}("species" => "tereticaudus", "genus" => "Spermophilus", "taxa" => "Rodent", "species_id" => "ST")
Dict{String, Any}("species" => "undulatus", "genus" => "Sceloporus", "taxa" => "Reptile", "species_id" => "SU")
Dict{String, Any}("species" => "sp.", "genus" => "Sigmodon", "taxa" => "Rodent", "species_id" => "SX")
Dict{String, Any}("species" => "sp.", "genus" => "Lizard", "taxa" => "Reptile", "species_id" => "UL")
Dict{String, Any}("species" => "sp.", "genus" => "Pipilo", "taxa" => "Bird", "species_id" => "UP")
Dict{String, Any}("species" => "sp.", "genus" => "Rodent", "taxa" => "Rodent", "species_id" => "UR")
Dict{String, Any}("species" => "sp.", "genus" => "Sparrow", "taxa" => "Bird", "species_id" => "US")
Dict{String, Any}("species" => "leucophrys", "genus" => "Zonotrichia", "taxa" => "Bird", "species_id" => "ZL")
Dict{String, Any}("species" => "macroura", "genus" => "Zenaida", "taxa" => "Bird", "species_id" => "ZM")
Cleaning up the portal names
There are two things we want to do at this point: extract the species names from the file, and then validate that they are spelled correctly, or that they are the most recent taxonomic name according to NCBI.
The portal data are already identified as belonging to a group of taxa, so we can get a unique list of them:
taxo_groups = unique([tax["taxa"] for tax in species])
4-element Vector{String}:
"Bird"
"Rodent"
"Reptile"
"Rabbit"
We will store our results in a data frame:
cleanup = DataFrame(
code = String[],
portal = String[],
name = String[],
rank = Symbol[],
order = String[],
taxid = Int[],
same = Bool[],
fuzzy = Bool[]
)
Row | code | portal | name | rank | order | taxid | same | fuzzy |
---|---|---|---|---|---|---|---|---|
String | String | String | Symbol | String | Int64 | Bool | Bool |
The next step is to loop through the species, and figure out what to do with them:
for sp in species
portal_name = sp["species"] == "sp." ? sp["genus"] : sp["genus"]*" "*sp["species"]
local ncbi_tax
local fuzzy = false
try
ncbi_tax = taxon(portal_name)
catch y
if isa(y, NameHasNoDirectMatch)
fuzzy = true
ncbi_tax = taxon(portal_name; strict=false)
else
continue
end
end
ncbi_lin = lineage(ncbi_tax)
push!(cleanup,
(
sp["species_id"], portal_name, ncbi_tax.name, rank(ncbi_tax),
first(filter(t -> isequal(:order)(rank(t)), lineage(ncbi_tax))).name,
ncbi_tax.id, portal_name == ncbi_tax.name, fuzzy
)
)
end
first(cleanup, 5)
Row | code | portal | name | rank | order | taxid | same | fuzzy |
---|---|---|---|---|---|---|---|---|
String | String | String | Symbol | String | Int64 | Bool | Bool | |
1 | AB | Amphispiza bilineata | Amphispiza bilineata | species | Passeriformes | 198939 | true | false |
2 | AH | Ammospermophilus harrisi | Ammospermophilus harrisii | species | Rodentia | 45487 | false | true |
3 | AS | Ammodramus savannarum | Ammodramus savannarum | species | Passeriformes | 135422 | true | false |
4 | BA | Baiomys taylori | Baiomys taylori | species | Rodentia | 56219 | true | false |
5 | CB | Campylorhynchus brunneicapillus | Campylorhynchus brunneicapillus | species | Passeriformes | 141853 | true | false |
Looking at species with a name discrepancy
Finally, we can look at the codes for which there is a likely issue because the names do not match – this can be because of new names, improper use of vernacular, or spelling issues:
filter(r -> r.portal != r.name, cleanup)
Row | code | portal | name | rank | order | taxid | same | fuzzy |
---|---|---|---|---|---|---|---|---|
String | String | String | Symbol | String | Int64 | Bool | Bool | |
1 | AH | Ammospermophilus harrisi | Ammospermophilus harrisii | species | Rodentia | 45487 | false | true |
2 | CS | Crotalus scutalatus | Crotalus scutulatus | species | Squamata | 8737 | false | true |
3 | CT | Cnemidophorus tigris | Aspidoscelis tigris | species | Squamata | 52180 | false | false |
4 | CU | Cnemidophorus uniparens | Aspidoscelis uniparens | species | Squamata | 37197 | false | false |
5 | EO | Eumeces obsoletus | Plestiodon obsoletus | species | Squamata | 463535 | false | false |
6 | GS | Gambelia silus | Gambelia sila | species | Squamata | 475046 | false | true |
7 | PH | Perognathus hispidus | Chaetodipus hispidus | species | Rodentia | 38665 | false | false |
8 | PU | Pipilo fuscus | Melozone fusca | species | Passeriformes | 40205 | false | false |
9 | SC | Sceloporus clarki | Sceloporus clarkii | species | Squamata | 235405 | false | false |
10 | SS | Spermophilus spilosoma | Xerospermophilus spilosoma | species | Rodentia | 45471 | false | false |
11 | ST | Spermophilus tereticaudus | Xerospermophilus tereticaudus | species | Rodentia | 99860 | false | false |
12 | UL | Lizard | Lisarda | genus | Hemiptera | 204543 | false | true |
13 | UR | Rodent | Rodentia | order | Rodentia | 9989 | false | true |
14 | US | Sparrow | Passeridae | family | Passeriformes | 9158 | false | true |
Out of these, some required to use fuzzy matching to get a proper name, so we can look at there taxa, as they are likely to require manual curation:
filter(r -> r.fuzzy, cleanup)
Row | code | portal | name | rank | order | taxid | same | fuzzy |
---|---|---|---|---|---|---|---|---|
String | String | String | Symbol | String | Int64 | Bool | Bool | |
1 | AH | Ammospermophilus harrisi | Ammospermophilus harrisii | species | Rodentia | 45487 | false | true |
2 | CS | Crotalus scutalatus | Crotalus scutulatus | species | Squamata | 8737 | false | true |
3 | GS | Gambelia silus | Gambelia sila | species | Squamata | 475046 | false | true |
4 | UL | Lizard | Lisarda | genus | Hemiptera | 204543 | false | true |
5 | UR | Rodent | Rodentia | order | Rodentia | 9989 | false | true |
6 | US | Sparrow | Passeridae | family | Passeriformes | 9158 | false | true |
Out of these, only Lizard
has a strange identification as a Hemiptera
:
filter(t -> isequal(:class)(rank(t)), lineage(ncbi"Lisarda"))
1-element Vector{NCBITaxon}:
Insecta (ncbi:50557)
Right. We can dig into this example a little more, because it shows how much data entry can condition the success of name finding.
similarnames("Lizard"; threshold=0.7)
1-element Vector{Pair{NCBITaxon, Float64}}:
Lisarda (ncbi:204543) => 0.7142857142857143
The Lisarda taxon (which is an insect!) is the closest match, simply because "Lizards" is not a classification we can use – lizards are a paraphyletic group, containing a handful of different groups. Based on the information available, the only information we can say about the taxon identified as "Lizards" is that it belongs to Squamata.