[Bioperl-l] Bio::DB::Taxonomy:: mishandles species, subspecies/variant names
Chris Fields
cjfields at uiuc.edu
Mon May 15 21:29:14 UTC 2006
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Nadeem Faruque
> Sent: Monday, May 15, 2006 2:47 PM
> To: bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] Bio::DB::Taxonomy:: mishandles
> species,subspecies/variant names
>
> >> My personal view is that having it as an annotation would serve no
> >> real
> >> purpose. For me the whole point of any kind of species
> >> representation in
> >> bioperl is to allow you to compare species in a biologically
> >> meaningful
> >> way. If it's just some annotation then that means it's basically
>
> I understand the need to find the species name of entries, especially
> now that so many complete genomes have been given their own strain-
> specific tax nodes, and I also think it is a shame that the ncbi tax
> dump does not give a rank to entries such as these (they cannot
> easily be distinguished from unofficial ranks higher in the tree
> without ascending the tree).
> Would it be useful for the species name to be included within EMBL
> file headers, eg in a line called OB (OB is a terrible suggestion
> based on 'Organism Binomial' since OS is already in use)?
>
> eg two examples of the species 'Apple stem grooving virus', where the
> second one would appear to be a different species without delving
> into the tax tree or the inclusion of an OB line.
>
> AC D14995; S47260;
> DE Apple stem grooving virus genome, complete sequence.
> OS Apple stem grooving virus
> OB Apple stem grooving virus
> OC Viruses; ssRNA positive-strand viruses, no DNA stage; Flexiviridae;
> OC Capillovirus.
>
> AC AY646511;
> DE Citrus tatter leaf virus strain Kumquat 1, complete genome.
> OS Citrus tatter leaf virus
> OB Apple stem grooving virus
> OC Viruses; ssRNA positive-strand viruses, no DNA stage; Flexiviridae;
> OC Capillovirus.
Jason also mentions a few examples (see below). The problem lies in the
fact that EMBL and GenBank flatfiles do not give hierarchy ranking for
taxonomy, so it's a best guess. What I'm seeing is that the guess is wrong
more often than not when it comes to complex scientific names (viruses,
bacteria, etc). Notice the doubling of the strain in the following GenBank
files passed through SeqIO (genbank->genbank conversion, BTW; haven't tried
EMBL):
SOURCE Azoarcus sp. EbN1 EbN1
ORGANISM Azoarcus sp.
Bacteria; Proteobacteria; Betaproteobacteria; Rhodocyclales;
Rhodocyclaceae; Azoarcus.
SOURCE Mycobacterium sp. KMS KMS
ORGANISM Mycobacterium sp.
Bacteria; Actinobacteria; Actinobacteridae; Actinomycetales;
Corynebacterineae; Mycobacteriaceae; Mycobacterium.
SOURCE Mycobacterium tuberculosis C C
ORGANISM Mycobacterium tuberculosis
Bacteria; Actinobacteria; Actinobacteridae; Actinomycetales;
Corynebacterineae; Mycobacteriaceae; Mycobacterium;
Mycobacterium;
tuberculosis complex; Mycobacterium.
SOURCE Bacillus subtilis subsp. subtilis str. 168 subtilis str. 168
ORGANISM Bacillus subtilis subsp.
Bacteria; Firmicutes; Bacillales; Bacillaceae; Bacillus.
Here are Jason's examples, for posterity:
Can you guess what value is the strain versus sub-species? What happens
when there is a two part strain name (space separated) and a sub-species or
variety designation?
SOURCE Staphylococcus haemolyticus JCSC1435
ORGANISM Staphylococcus haemolyticus JCSC1435
Bacteria; Firmicutes; Bacillales; Staphylococcus.
http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=279808
strain is JCSC1435
versus
SOURCE Muntiacus muntjak vaginalis
ORGANISM Muntiacus muntjak vaginalis
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;
Euteleostomi;
Mammalia; Eutheria; Laurasiatheria; Cetartiodactyla;
Ruminantia;
Pecora; Cervidae; Muntiacinae; Muntiacus.
http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9887
species is muntjak, sub-species vaginalis ?
versus
SOURCE Aspergillus nidulans FGSC A4
ORGANISM Aspergillus nidulans FGSC A4
Eukaryota; Fungi; Ascomycota; Pezizomycotina; Eurotiomycetes;
Eurotiales; Trichocomaceae; Emericella.
http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=227321
Genus should be Aspergillus or Emericella ?
Strain and subspecies/variety in the same entry
SOURCE Cryptococcus neoformans var. grubii H99
ORGANISM Cryptococcus neoformans var. grubii H99
Eukaryota; Fungi; Basidiomycota; Hymenomycetes;
Heterobasidiomycetes; Tremellomycetidae; Tremellales;
Tremellaceae;
Filobasidiella.
http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=235443
> > My point is, a large number of users do NOT use, nor care about,
> > taxonomic
> > information to the degree they need to know the entire
> > classification of the
> > organism; many are just as happy about getting the scientific name
> > only,
> > which is in the GenBank/EMBL file itself. To take one extreme, it
> > is not
> > productive to force every user to download the NCBI tax database
> > and use
> > lookups just to convert sequences from EMBL format to GenBank
> > format. It's
> > not productive to allow users to spam the NCBI tax database
> > remotely either,
> > so hardcoding lookups is, IMHO, a big mistake.
>
> I don't think you need to add any information to turn an embl-format
> file into a Genbank flatfile, but maybe I'm missing something obvious.
The issue is the way the SOURCE and ORGANISM lines are handled (OS/OC lines
in EMBL, I believe), which is using a Bio::Species object. The problem is,
like I mentioned above, no hierarchal ranking is in the flat file, just the
order of the ranking. We can try to make a best guess based on that but
it's obviously very tricky, particularly when dealing with subspecies,
strains, etc.
NCBI also states that many times the classification can be too long for a
file so may be incomplete (I think they leave out nodes which have 'no rank'
tags, but I can't be completely sure), so there's another issue.
Anyway, this is where the lookup would come in, which would require a local
taxonomy database (we can't spam the NCBI remote database, that would just
be rude) which would give the complete taxonomic classification if it worked
properly.
So now we have three possible situations:
1) One extreme : We require a lookup to get it right (which, BTW, it
currently doesn't); this by default requires a local database.
2) Middle of the road : we try and guess the information as best as we can
with the information given (the current situation); this is breaking more
and more often now, so is becoming more unreliable.
3) Other extreme : we punt and absolve ourselves of even trying to parse the
data and just have a strict tagname->value or similar simple construct to
handle the data.
#3 as default with option to do #1 is probably best (least error prone with
option for most information), with caching to speed up lookups as Sendu Bala
does now.
Chris
> Nadeem
>
>
> --
> Dr S.M. Nadeem N. Faruque
> 9 Barley Court
> Saffron Walden
> Essex CB11 3HG
> 01799 500 120
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list