Finding taxa

The taxon function

NCBITaxonomy.taxonFunction
taxon(df::DataFrame, id::Integer)

Returns a fully formed NCBITaxon based on its id. The name of the taxon will be the valid scientic name associated to this id.

source
taxon(id::Integer)

Performs a search in the entire taxonomy backbone based on a known ID.

source
taxon(name::AbstractString; kwargs...)

The taxon function is the core entry point in the NCBI taxonomy. It takes a string, and a series of keywords, and go look for this taxon in the dataframe (by default the entire names table).

The keywords are:

  • strict (def. true), allows fuzzy matching
  • dist (def. Levenshtein), the string distance function to use
  • casesensitive (def. true), whether to strict match on lowercased names
  • rank (def. nothing), the taxonomic rank to limit the search
  • preferscientific (def. false), whether scientific names are prefered when the query also matches non-scientific names (synonyms, vernaculars, blast names, ...) - this is most likely useful when paired with casesensitive=true, and is not working with strict=false
  • onlysynonyms (def. false) - limits the search to synonyms, which may be useful in case the taxonomy is particularly outdated
source
taxon(df::DataFrame, name::AbstractString; kwargs...)

Additional method for taxon with an extra dataframe argument, used most often with a namefinder. Accepts the usual taxon keyword arguments.

source
NCBITaxonomy.vernacularFunction
vernacular(t::NCBITaxon)

This function will return nothing if no vernacular name is known, and an array of names if found. It searches the "common name" and "genbank common name" category of the NCBI taxonomy name table.

source
NCBITaxonomy.synonymsFunction
synonyms(t::NCBITaxon)

This function will return nothing if no synonyms exist, and an array of names if they do. It returns all of the

source
NCBITaxonomy.authorityFunction
authority(t::NCBITaxon)

This function will return nothing if no authority exist, and a string with the authority if found.

source
NCBITaxonomy.alternativetaxaFunction
alternativetaxa(df::DataFrame, name::AbstractString)

Generic version of alternativetaxa with an arbitrary data frame

source
alternativetaxa(name::AbstractString)

Returns an array of taxa that share the same name – note that this function does strict, case-sensitive searches only at the moment, but this may be extended through keyword arguments in a future release.

source
NCBITaxonomy.similarnamesFunction
similarnames(name::AbstractString)

Returns a list (as a vector of pairs) mapping an NCBI taxon to a similarity score for the name given as argument.

Note that the function can return the same taxon more than once with different scores, because it will look through the entire list of names, and not only the scientific ones.

It may also return multiple taxa with the same score if the names are ambiguous, in which case all alternative are given.

That being said, the taxa/score pairs will always be equal. For example, the string "mouse" will match both the vernacular for Bryophyta ("mosses") and its synonym ("Musci") with an equal dissimilarity under the Levenshtein distance - the pair will be returned only once.

Additional keywords are rank (limit to a given rank) and onlysynonyms.

source
similarnames(df::DataFrame, name::AbstractString; dist::Type{SD}=Levenshtein, threshold::Float64=0.8) where {SD <: StringDistance}

Generic version of similarnames

source

The taxon function will return a NCBITaxon object, which has two fields: name and id. We do not return the class attribute, because the package will always return the scientific name, as the examples below illustrate:

using NCBITaxonomy
taxon("Bos taurus")
Bos taurus (ncbi:9913)

There is a convenience string macro to replace the taxon function:

ncbi"Bos taurus"
Bos taurus (ncbi:9913)

Note that because the names database contains vernacular and deprecated names, the scientific name will be returned, no matter what you search

taxon("cow")
Bos taurus (ncbi:9913)

This may be a good point to note that we can use the vernacular function to get a list of NCBI-known vernacular names:

taxon("cow") |> vernacular
8-element Vector{String}:
 "bovine"
 "cattle"
 "cow"
 "dairy cow"
 "domestic cattle"
 "domestic cow"
 "ox"
 "oxen"

It also work with authorities:

taxon("cow") |> authority
"Bos taurus Linnaeus, 1758"

You can pass an additional strict=false keyword argument to the taxon function to perform fuzzy name matching using the Levenshtein distance:

taxon("Paradiplozon homion", strict=false)
Paradiplozoon homoion (ncbi:147838)

Note that fuzzy searching comes at a performance cost, so it is preferable to use the strict matching unless necessary. As a final note, you can specify any distance function from the StringDistances package, using the dist argument.

Some valid names refer to more than one entry in the NCBI taxonomy. This is, for example, the case for Mus (the genus and the sub-genus):

alternativetaxa("Mus")
2-element Vector{NCBITaxon}:
 Mus (ncbi:10088)
 Mus (ncbi:862507)

In some cases, the fuzzy matched name may not be the one you want. There is a function to get the names ordered by similarity:

similarnames("mouse"; threshold=0.6)
12-element Vector{Pair{NCBITaxon, Float64}}:
                            Mus (ncbi:10088) => 1.0
                   Mus musculus (ncbi:10090) => 1.0
              Alces americanus (ncbi:999462) => 0.8
                       Bryophyta (ncbi:3208) => 0.6666666666666667
               Ectromelia virus (ncbi:12643) => 0.625
                  Ulex europaeus (ncbi:3902) => 0.6
                Dinornithiformes (ncbi:8808) => 0.6
                       Anser sp. (ncbi:8847) => 0.6
                  Equus caballus (ncbi:9796) => 0.6
                Milicia excelsa (ncbi:58664) => 0.6
                         Sousa (ncbi:103599) => 0.6
 Equus asinus x Equus caballus (ncbi:319699) => 0.6

Errors

NCBITaxonomy.NameHasNoDirectMatchType
NameHasNoDirectMatch

This exception is thrown when the name passed as an argument does not have a direct match, in which case using strict=false to switch to fuzzy matching may be advised.

source
NCBITaxonomy.NameHasMultipleMatchesType
NameHasMultipleMatches

This exception is thrown when the name is an "in-part" name, which is not a valid node but an aggregation of multiple nodes. It is also thrown when the name is valid for several nodes. The error message will return the taxa that could be used instead. "Reptilia" is an example of a node that will throw this exception (in-part name); "Mus" will throw this example as it is valid subgenus of itself.

Note that the error object has a taxa field, which stores the NCBITaxon that were matched; this allows to catch the error and look for the taxon you want without relying on e.g. alternativetaxa.

source

Building a better namefilter

The taxon function, by default, searches in the entire names table. In many cases, we can restrict the scope of the search quite a lot by searching only in the range of names that match a given condition. For this reason, the taxon function also has a method with a first argument being a data frame of names. These are generated using namefilter, as well as a varitety of helper functions.

NCBITaxonomy.namefilterFunction
namefilter(ids::Vector{T}) where {T <: Integer}

Returns a subset of the names table where only the given taxids are present.

source
namefilter(taxa::Vector{NCBITaxon})

Returns a subset of the names table dataset, where the taxids of the taxa are present. This includes all names, not only the scientific names.

source
namefilter(division::Symbol)

Returns a subset of the names table for all names under a given NCBI division.

source
namefilter(division::Vector{Symbol})

Returns a subset of the names table for all names under a number of multiple NCBI divisions.

source

Here is an illustration of why using namefilters makes sense. Let's say we have to search for a potentially misspelled name:

@time taxon("Ebulavurus"; strict=false);
Ebolavirus (ncbi:186536)

We can use the virusfilter() function to generate a table with viruses only:

viruses = virusfilter()
@time taxon(viruses, "Bumbulu ebolavirus"; strict=false);
Bombali ebolavirus (ncbi:2010960)

Standard namefilters

To save some time, there are namefilters pre-populated with the large-level taxonomic divisions.

NCBITaxonomy.bacteriafilterFunction
bacteriafilter()

Returns a namefinder limited to the bacterial division of the NCBI taxonomy. See the documentation for namefinder and taxid for more information about arguments.

source
NCBITaxonomy.virusfilterFunction
virusfilter()

Returns a namefinder limited to the viral division of the NCBI taxonomy. See the documentation for namefinder and taxid for more information about arguments. Note that phage are covered by phagefinder.

source
NCBITaxonomy.mammalfilterFunction
mammalfilter(;inclusive::Bool=true)

Returns a namefinder limited to the mammal division of the NCBI taxonomy. See the documentation for namefinder and taxid for more information about arguments.

If the keyword argument inclusive is set to false, this will not search for organisms assigned to a lower division, in this case rodents (covered by rodentfinder) and primates (covered by primatefinder). The default behavior is to include these groups.

source
NCBITaxonomy.vertebratefilterFunction
vertebratefilter(;inclusive::Bool=true)

Returns a namefinder limited to the vertebrate division of the NCBI taxonomy. See the documentation for namefinder and taxid for more information about arguments.

If the keyword argument inclusive is set to false, this will not search for organisms assigned to a lower division, in this case mammals (covered by mammalfinder). The default behavior is to include these groups, which also include the groups covered by mammalfinder itself.

source
NCBITaxonomy.plantfilterFunction
plantfilter()

Returns a namefinder limited to the "plant and fungi" division of the NCBI taxonomy. See the documentation for namefinder and taxid for more information about arguments.

source
NCBITaxonomy.invertebratefilterFunction
invertebratefilter()

Returns a namefinder limited to the invertebrate division of the NCBI taxonomy. See the documentation for namefinder and taxid for more information about arguments.

Note that this is limited organisms not covered by plantfinder, bacteriafinder, and virusfinder.

source
NCBITaxonomy.rodentfilterFunction
rodentfilter()

Returns a namefinder limited to the rodent division of the NCBI taxonomy. See the documentation for namefinder and taxid for more information about arguments.

source
NCBITaxonomy.primatefilterFunction
primatefilter()

Returns a namefinder limited to the primate division of the NCBI taxonomy. See the documentation for namefinder and taxid for more information about arguments.

source
NCBITaxonomy.phagefilterFunction
phagefilter()

Returns a namefinder limited to the phage division of the NCBI taxonomy. See the documentation for namefinder and taxid for more information about arguments.

source
NCBITaxonomy.environmentalsamplesfilterFunction
environmentalsamplesfilter()

Returns a namefinder limited to the environmental samples division of the NCBI taxonomy. See the documentation for namefinder and taxid for more information about arguments.

source

All of these return a dataframe which can be passed to the taxon function as a first argument.