Finding taxa
The taxon function
NCBITaxonomy.taxon — Functiontaxon(df::DataFrame, id::Integer)Returns a fully formed NCBITaxon based on its id. The name of the taxon will be the valid scientic name associated to this id.
taxon(id::Integer)Performs a search in the entire taxonomy backbone based on a known ID.
taxon(name::AbstractString; kwargs...)The taxon function is the core entry point in the NCBI taxonomy. It takes a string, and a series of keywords, and go look for this taxon in the dataframe (by default the entire names table).
The keywords are:
strict(def.true), allows fuzzy matchingdist(def.Levenshtein), the string distance function to usecasesensitive(def.true), whether to strict match on lowercased namesrank(def.nothing), the taxonomic rank to limit the searchpreferscientific(def.false), whether scientific names are prefered when the query also matches non-scientific names (synonyms, vernaculars, blast names, ...) - this is most likely useful when paired withcasesensitive=true, and is not working withstrict=falseonlysynonyms(def.false) - limits the search to synonyms, which may be useful in case the taxonomy is particularly outdated
taxon(df::DataFrame, name::AbstractString; kwargs...)Additional method for taxon with an extra dataframe argument, used most often with a namefinder. Accepts the usual taxon keyword arguments.
NCBITaxonomy.vernacular — Functionvernacular(t::NCBITaxon)This function will return nothing if no vernacular name is known, and an array of names if found. It searches the "common name" and "genbank common name" category of the NCBI taxonomy name table.
NCBITaxonomy.synonyms — Functionsynonyms(t::NCBITaxon)This function will return nothing if no synonyms exist, and an array of names if they do. It returns all of the
NCBITaxonomy.authority — Functionauthority(t::NCBITaxon)This function will return nothing if no authority exist, and a string with the authority if found.
NCBITaxonomy.alternativetaxa — Functionalternativetaxa(df::DataFrame, name::AbstractString)Generic version of alternativetaxa with an arbitrary data frame
alternativetaxa(name::AbstractString)Returns an array of taxa that share the same name – note that this function does strict, case-sensitive searches only at the moment, but this may be extended through keyword arguments in a future release.
NCBITaxonomy.similarnames — Functionsimilarnames(name::AbstractString)Returns a list (as a vector of pairs) mapping an NCBI taxon to a similarity score for the name given as argument.
Note that the function can return the same taxon more than once with different scores, because it will look through the entire list of names, and not only the scientific ones.
It may also return multiple taxa with the same score if the names are ambiguous, in which case all alternative are given.
That being said, the taxa/score pairs will always be equal. For example, the string "mouse" will match both the vernacular for Bryophyta ("mosses") and its synonym ("Musci") with an equal dissimilarity under the Levenshtein distance - the pair will be returned only once.
Additional keywords are rank (limit to a given rank) and onlysynonyms.
similarnames(df::DataFrame, name::AbstractString; dist::Type{SD}=Levenshtein, threshold::Float64=0.8) where {SD <: StringDistance}Generic version of similarnames
The taxon function will return a NCBITaxon object, which has two fields: name and id. We do not return the class attribute, because the package will always return the scientific name, as the examples below illustrate:
using NCBITaxonomy
taxon("Bos taurus")Bos taurus (ncbi:9913)There is a convenience string macro to replace the taxon function:
ncbi"Bos taurus"Bos taurus (ncbi:9913)Note that because the names database contains vernacular and deprecated names, the scientific name will be returned, no matter what you search
taxon("cow")Bos taurus (ncbi:9913)This may be a good point to note that we can use the vernacular function to get a list of NCBI-known vernacular names:
taxon("cow") |> vernacular8-element Vector{String}:
"bovine"
"cattle"
"cow"
"dairy cow"
"domestic cattle"
"domestic cow"
"ox"
"oxen"It also work with authorities:
taxon("cow") |> authority"Bos taurus Linnaeus, 1758"You can pass an additional strict=false keyword argument to the taxon function to perform fuzzy name matching using the Levenshtein distance:
taxon("Paradiplozon homion", strict=false)Paradiplozoon homoion (ncbi:147838)Note that fuzzy searching comes at a performance cost, so it is preferable to use the strict matching unless necessary. As a final note, you can specify any distance function from the StringDistances package, using the dist argument.
Some valid names refer to more than one entry in the NCBI taxonomy. This is, for example, the case for Mus (the genus and the sub-genus):
alternativetaxa("Mus")2-element Vector{NCBITaxon}:
Mus (ncbi:10088)
Mus (ncbi:862507)In some cases, the fuzzy matched name may not be the one you want. There is a function to get the names ordered by similarity:
similarnames("mouse"; threshold=0.6)12-element Vector{Pair{NCBITaxon, Float64}}:
Mus (ncbi:10088) => 1.0
Mus musculus (ncbi:10090) => 1.0
Alces americanus (ncbi:999462) => 0.8
Bryophyta (ncbi:3208) => 0.6666666666666667
Ectromelia virus (ncbi:12643) => 0.625
Ulex europaeus (ncbi:3902) => 0.6
Dinornithiformes (ncbi:8808) => 0.6
Anser sp. (ncbi:8847) => 0.6
Equus caballus (ncbi:9796) => 0.6
Milicia excelsa (ncbi:58664) => 0.6
Sousa (ncbi:103599) => 0.6
Equus asinus x Equus caballus (ncbi:319699) => 0.6Errors
NCBITaxonomy.NameHasNoDirectMatch — TypeNameHasNoDirectMatchThis exception is thrown when the name passed as an argument does not have a direct match, in which case using strict=false to switch to fuzzy matching may be advised.
NCBITaxonomy.NameHasMultipleMatches — TypeNameHasMultipleMatchesThis exception is thrown when the name is an "in-part" name, which is not a valid node but an aggregation of multiple nodes. It is also thrown when the name is valid for several nodes. The error message will return the taxa that could be used instead. "Reptilia" is an example of a node that will throw this exception (in-part name); "Mus" will throw this example as it is valid subgenus of itself.
Note that the error object has a taxa field, which stores the NCBITaxon that were matched; this allows to catch the error and look for the taxon you want without relying on e.g. alternativetaxa.
Building a better namefilter
The taxon function, by default, searches in the entire names table. In many cases, we can restrict the scope of the search quite a lot by searching only in the range of names that match a given condition. For this reason, the taxon function also has a method with a first argument being a data frame of names. These are generated using namefilter, as well as a varitety of helper functions.
NCBITaxonomy.namefilter — Functionnamefilter(ids::Vector{T}) where {T <: Integer}Returns a subset of the names table where only the given taxids are present.
namefilter(taxa::Vector{NCBITaxon})Returns a subset of the names table dataset, where the taxids of the taxa are present. This includes all names, not only the scientific names.
namefilter(division::Symbol)Returns a subset of the names table for all names under a given NCBI division.
namefilter(division::Vector{Symbol})Returns a subset of the names table for all names under a number of multiple NCBI divisions.
Here is an illustration of why using namefilters makes sense. Let's say we have to search for a potentially misspelled name:
@time taxon("Ebulavurus"; strict=false);Ebolavirus (ncbi:186536)We can use the virusfilter() function to generate a table with viruses only:
viruses = virusfilter()
@time taxon(viruses, "Bumbulu ebolavirus"; strict=false);Bombali ebolavirus (ncbi:2010960)Standard namefilters
To save some time, there are namefilters pre-populated with the large-level taxonomic divisions.
NCBITaxonomy.bacteriafilter — Functionbacteriafilter()Returns a namefinder limited to the bacterial division of the NCBI taxonomy. See the documentation for namefinder and taxid for more information about arguments.
NCBITaxonomy.virusfilter — Functionvirusfilter()Returns a namefinder limited to the viral division of the NCBI taxonomy. See the documentation for namefinder and taxid for more information about arguments. Note that phage are covered by phagefinder.
NCBITaxonomy.mammalfilter — Functionmammalfilter(;inclusive::Bool=true)Returns a namefinder limited to the mammal division of the NCBI taxonomy. See the documentation for namefinder and taxid for more information about arguments.
If the keyword argument inclusive is set to false, this will not search for organisms assigned to a lower division, in this case rodents (covered by rodentfinder) and primates (covered by primatefinder). The default behavior is to include these groups.
NCBITaxonomy.vertebratefilter — Functionvertebratefilter(;inclusive::Bool=true)Returns a namefinder limited to the vertebrate division of the NCBI taxonomy. See the documentation for namefinder and taxid for more information about arguments.
If the keyword argument inclusive is set to false, this will not search for organisms assigned to a lower division, in this case mammals (covered by mammalfinder). The default behavior is to include these groups, which also include the groups covered by mammalfinder itself.
NCBITaxonomy.plantfilter — Functionplantfilter()Returns a namefinder limited to the "plant and fungi" division of the NCBI taxonomy. See the documentation for namefinder and taxid for more information about arguments.
NCBITaxonomy.invertebratefilter — Functioninvertebratefilter()Returns a namefinder limited to the invertebrate division of the NCBI taxonomy. See the documentation for namefinder and taxid for more information about arguments.
Note that this is limited organisms not covered by plantfinder, bacteriafinder, and virusfinder.
NCBITaxonomy.rodentfilter — Functionrodentfilter()Returns a namefinder limited to the rodent division of the NCBI taxonomy. See the documentation for namefinder and taxid for more information about arguments.
NCBITaxonomy.primatefilter — Functionprimatefilter()Returns a namefinder limited to the primate division of the NCBI taxonomy. See the documentation for namefinder and taxid for more information about arguments.
NCBITaxonomy.phagefilter — Functionphagefilter()Returns a namefinder limited to the phage division of the NCBI taxonomy. See the documentation for namefinder and taxid for more information about arguments.
NCBITaxonomy.environmentalsamplesfilter — Functionenvironmentalsamplesfilter()Returns a namefinder limited to the environmental samples division of the NCBI taxonomy. See the documentation for namefinder and taxid for more information about arguments.
All of these return a dataframe which can be passed to the taxon function as a first argument.