Finding taxa
The taxon
function
NCBITaxonomy.taxon
— Functiontaxon(df::DataFrame, id::Integer)
Returns a fully formed NCBITaxon
based on its id. The name
of the taxon will be the valid scientic name associated to this id.
taxon(id::Integer)
Performs a search in the entire taxonomy backbone based on a known ID.
taxon(name::AbstractString; kwargs...)
The taxon
function is the core entry point in the NCBI taxonomy. It takes a string, and a series of keywords, and go look for this taxon in the dataframe (by default the entire names table).
The keywords are:
strict
(def.true
), allows fuzzy matchingdist
(def.Levenshtein
), the string distance function to usecasesensitive
(def.true
), whether to strict match on lowercased namesrank
(def.nothing
), the taxonomic rank to limit the searchpreferscientific
(def.false
), whether scientific names are prefered when the query also matches non-scientific names (synonyms, vernaculars, blast names, ...) - this is most likely useful when paired withcasesensitive=true
, and is not working withstrict=false
onlysynonyms
(def.false
) - limits the search to synonyms, which may be useful in case the taxonomy is particularly outdated
taxon(df::DataFrame, name::AbstractString; kwargs...)
Additional method for taxon
with an extra dataframe argument, used most often with a namefinder
. Accepts the usual taxon
keyword arguments.
NCBITaxonomy.vernacular
— Functionvernacular(t::NCBITaxon)
This function will return nothing
if no vernacular name is known, and an array of names if found. It searches the "common name" and "genbank common name" category of the NCBI taxonomy name table.
NCBITaxonomy.synonyms
— Functionsynonyms(t::NCBITaxon)
This function will return nothing
if no synonyms exist, and an array of names if they do. It returns all of the
NCBITaxonomy.authority
— Functionauthority(t::NCBITaxon)
This function will return nothing
if no authority exist, and a string with the authority if found.
NCBITaxonomy.alternativetaxa
— Functionalternativetaxa(df::DataFrame, name::AbstractString)
Generic version of alternativetaxa
with an arbitrary data frame
alternativetaxa(name::AbstractString)
Returns an array of taxa that share the same name
– note that this function does strict, case-sensitive searches only at the moment, but this may be extended through keyword arguments in a future release.
NCBITaxonomy.similarnames
— Functionsimilarnames(name::AbstractString)
Returns a list (as a vector of pairs) mapping an NCBI taxon to a similarity score for the name given as argument.
Note that the function can return the same taxon more than once with different scores, because it will look through the entire list of names, and not only the scientific ones.
It may also return multiple taxa with the same score if the names are ambiguous, in which case all alternative are given.
That being said, the taxa/score pairs will always be equal. For example, the string "mouse"
will match both the vernacular for Bryophyta ("mosses"
) and its synonym ("Musci"
) with an equal dissimilarity under the Levenshtein distance - the pair will be returned only once.
Additional keywords are rank
(limit to a given rank) and onlysynonyms
.
similarnames(df::DataFrame, name::AbstractString; dist::Type{SD}=Levenshtein, threshold::Float64=0.8) where {SD <: StringDistance}
Generic version of similarnames
The taxon
function will return a NCBITaxon
object, which has two fields: name
and id
. We do not return the class
attribute, because the package will always return the scientific name, as the examples below illustrate:
using NCBITaxonomy
taxon("Bos taurus")
Bos taurus (ncbi:9913)
There is a convenience string macro to replace the taxon
function:
ncbi"Bos taurus"
Bos taurus (ncbi:9913)
Note that because the names database contains vernacular and deprecated names, the scientific name will be returned, no matter what you search
taxon("cow")
Bos taurus (ncbi:9913)
This may be a good point to note that we can use the vernacular
function to get a list of NCBI-known vernacular names:
taxon("cow") |> vernacular
8-element Vector{String}:
"bovine"
"cattle"
"cow"
"dairy cow"
"domestic cattle"
"domestic cow"
"ox"
"oxen"
It also work with authorities:
taxon("cow") |> authority
"Bos taurus Linnaeus, 1758"
You can pass an additional strict=false
keyword argument to the taxon
function to perform fuzzy name matching using the Levenshtein distance:
taxon("Paradiplozon homion", strict=false)
Paradiplozoon homoion (ncbi:147838)
Note that fuzzy searching comes at a performance cost, so it is preferable to use the strict matching unless necessary. As a final note, you can specify any distance function from the StringDistances
package, using the dist
argument.
Some valid names refer to more than one entry in the NCBI taxonomy. This is, for example, the case for Mus (the genus and the sub-genus):
alternativetaxa("Mus")
2-element Vector{NCBITaxon}:
Mus (ncbi:10088)
Mus (ncbi:862507)
In some cases, the fuzzy matched name may not be the one you want. There is a function to get the names ordered by similarity:
similarnames("mouse"; threshold=0.6)
12-element Vector{Pair{NCBITaxon, Float64}}:
Mus (ncbi:10088) => 1.0
Mus musculus (ncbi:10090) => 1.0
Alces americanus (ncbi:999462) => 0.8
Bryophyta (ncbi:3208) => 0.6666666666666667
Ectromelia virus (ncbi:12643) => 0.625
Ulex europaeus (ncbi:3902) => 0.6
Dinornithiformes (ncbi:8808) => 0.6
Anser sp. (ncbi:8847) => 0.6
Equus caballus (ncbi:9796) => 0.6
Milicia excelsa (ncbi:58664) => 0.6
Sousa (ncbi:103599) => 0.6
Equus asinus x Equus caballus (ncbi:319699) => 0.6
Errors
NCBITaxonomy.NameHasNoDirectMatch
— TypeNameHasNoDirectMatch
This exception is thrown when the name passed as an argument does not have a direct match, in which case using strict=false
to switch to fuzzy matching may be advised.
NCBITaxonomy.NameHasMultipleMatches
— TypeNameHasMultipleMatches
This exception is thrown when the name is an "in-part" name, which is not a valid node but an aggregation of multiple nodes. It is also thrown when the name is valid for several nodes. The error message will return the taxa that could be used instead. "Reptilia" is an example of a node that will throw this exception (in-part name); "Mus" will throw this example as it is valid subgenus of itself.
Note that the error object has a taxa
field, which stores the NCBITaxon
that were matched; this allows to catch the error and look for the taxon you want without relying on e.g. alternativetaxa
.
Building a better namefilter
The taxon
function, by default, searches in the entire names table. In many cases, we can restrict the scope of the search quite a lot by searching only in the range of names that match a given condition. For this reason, the taxon
function also has a method with a first argument being a data frame of names. These are generated using namefilter
, as well as a varitety of helper functions.
NCBITaxonomy.namefilter
— Functionnamefilter(ids::Vector{T}) where {T <: Integer}
Returns a subset of the names table where only the given taxids are present.
namefilter(taxa::Vector{NCBITaxon})
Returns a subset of the names table dataset, where the taxids of the taxa are present. This includes all names, not only the scientific names.
namefilter(division::Symbol)
Returns a subset of the names table for all names under a given NCBI division.
namefilter(division::Vector{Symbol})
Returns a subset of the names table for all names under a number of multiple NCBI divisions.
Here is an illustration of why using namefilters makes sense. Let's say we have to search for a potentially misspelled name:
@time taxon("Ebulavurus"; strict=false);
Ebolavirus (ncbi:186536)
We can use the virusfilter()
function to generate a table with viruses only:
viruses = virusfilter()
@time taxon(viruses, "Bumbulu ebolavirus"; strict=false);
Bombali ebolavirus (ncbi:2010960)
Standard namefilters
To save some time, there are namefilters pre-populated with the large-level taxonomic divisions.
NCBITaxonomy.bacteriafilter
— Functionbacteriafilter()
Returns a namefinder
limited to the bacterial division of the NCBI taxonomy. See the documentation for namefinder
and taxid
for more information about arguments.
NCBITaxonomy.virusfilter
— Functionvirusfilter()
Returns a namefinder
limited to the viral division of the NCBI taxonomy. See the documentation for namefinder
and taxid
for more information about arguments. Note that phage are covered by phagefinder
.
NCBITaxonomy.mammalfilter
— Functionmammalfilter(;inclusive::Bool=true)
Returns a namefinder
limited to the mammal division of the NCBI taxonomy. See the documentation for namefinder
and taxid
for more information about arguments.
If the keyword argument inclusive
is set to false
, this will not search for organisms assigned to a lower division, in this case rodents (covered by rodentfinder
) and primates (covered by primatefinder
). The default behavior is to include these groups.
NCBITaxonomy.vertebratefilter
— Functionvertebratefilter(;inclusive::Bool=true)
Returns a namefinder
limited to the vertebrate division of the NCBI taxonomy. See the documentation for namefinder
and taxid
for more information about arguments.
If the keyword argument inclusive
is set to false
, this will not search for organisms assigned to a lower division, in this case mammals (covered by mammalfinder
). The default behavior is to include these groups, which also include the groups covered by mammalfinder
itself.
NCBITaxonomy.plantfilter
— Functionplantfilter()
Returns a namefinder
limited to the "plant and fungi" division of the NCBI taxonomy. See the documentation for namefinder
and taxid
for more information about arguments.
NCBITaxonomy.invertebratefilter
— Functioninvertebratefilter()
Returns a namefinder
limited to the invertebrate division of the NCBI taxonomy. See the documentation for namefinder
and taxid
for more information about arguments.
Note that this is limited organisms not covered by plantfinder
, bacteriafinder
, and virusfinder
.
NCBITaxonomy.rodentfilter
— Functionrodentfilter()
Returns a namefinder
limited to the rodent division of the NCBI taxonomy. See the documentation for namefinder
and taxid
for more information about arguments.
NCBITaxonomy.primatefilter
— Functionprimatefilter()
Returns a namefinder
limited to the primate division of the NCBI taxonomy. See the documentation for namefinder
and taxid
for more information about arguments.
NCBITaxonomy.phagefilter
— Functionphagefilter()
Returns a namefinder
limited to the phage division of the NCBI taxonomy. See the documentation for namefinder
and taxid
for more information about arguments.
NCBITaxonomy.environmentalsamplesfilter
— Functionenvironmentalsamplesfilter()
Returns a namefinder
limited to the environmental samples division of the NCBI taxonomy. See the documentation for namefinder
and taxid
for more information about arguments.
All of these return a dataframe which can be passed to the taxon
function as a first argument.