[Bioperl-l] Bio::DB::Taxonomy:: mishandles species, subspecies/variant names

Fri May 12 17:08:11 UTC 2006

> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Sendu Bala
> Sent: Friday, May 12, 2006 5:25 AM
> To: bioperl-l at lists.open-bio.org
> Subject: [Bioperl-l] Bio::DB::Taxonomy:: mishandles
> species,subspecies/variant names
> 
> In bioperl up to at least 1.5.1, when one of the database modules comes
> across a species rank it does:
> 
> if ($rank eq 'species') {
>    # get rid of genus from species name
>    (undef,$taxon_name) = split(/\s+/,$taxon_name,2);
> }

The XML example from NCBI Taxonomy I mentioned previously seems to have
everything in the classification, from superkingdom down to species (no
strain unfortunately, and I'm nit sure about subspecies); if it's missing
the rank then the designation doesn't exist or is tagged as 'no rank'.  Like
I mentioned before I'm not intimately familiar Bio::Taxonomy,
Bio::DB::Taxonomy, or Bio::Species, so I don't have a clue as to how
everything is parsed and plugged in to Bio::Taxonomy objects.  I do know
that XML::Twig is used for parsing through the data so it shouldn't be too
hard to change what you want.

I haven't tried using Bio::DB::Taxonomy directly yet, but I would have
thought that the binomial is just built from the XML twig 'LineageEx'
Rank=Genus + Rank=Species, that the genus comes from the tag 'Genus' and
species from 'Species', and that the scientific name is from the tag
'ScientificName'.  Guess not. 

> However even though true scientific name is usually 'Genus species' in
> the database, note the 'usually' - sometimes the species is a multiword
> item that does not include the Genus, so we can't do some simple split
> and take the second word.
> The same applies to levels below species, eg. 'Avian erythroblastosis
> virus' is a variant of the species 'Avian leukosis virus' but 'Avian
> erythroblastosis virus (strain ES4)' is a variant of that variant...
> 
> My solution is to just remove whatever is the same between the current
> rank and the previous rank. Maybe even that's not so perfect, but it
> must be a lot better than turning the species 'Avian leukosis virus'
> into the species 'virus' (especially given that the genus here is
> 'Alpharetrovirus')!
> 
> # we need to be going root(kingdom) -> leaf (species or lower) order
> #
> # we need to be storing untouched versions of the scientific name of
> # the previous rank ($self->{_last_raw})
> #
> # probably only bother start doing this when we get to genus
> my $last_raw = $self->{_last_raw} || undef;
> $self->{_last_raw} = $sci_name;
> if ($last_raw) {
>    $sci_name =~ s/$last_raw//;
>    $sci_name =~ s/^\s+//;
> }
> 
> Are there even more strange species (and lower) names that would still
> not work well with the above solution?

I'm don't think taking Genus/Species directly from the scientific name
(normally what is in the SOURCE or ORGANISM annotation for GenBank or OS for
EMBL) is the best way to go about it since it's really a best guess using
regex; Jason pointed out several examples where this falls apart, and being
a bacterial man I have found many examples myself.  I'm also not sure that
forcing a lookup for every TaxID in every sequence every time it's passed
through SeqIO is the best way to go either, though I think it should be
required for storing sequences.  It's a tricky balance.  

I still think that maybe we should absolve ourselves from using
SOURCE/ORGANISM or OS/OC information in GenBank files as anything more than
strictly annotation, or reconstruct Bio::Species to maybe a
Bio::Annotation::Species object to handle that annotation and either
deprecate Bio::Species or separate it completely from any Bio::Taxonomy
objects.  It would really simplify things.  Then, if anyone is interested in
taxonomy, either install a local database or use Entrez efetch, and then use
Bio::DB::Taxonomy (fixed of course) to grab the TaxID info.  Seems like
we're running more and more into exceptions to the rule as more genomes are
made available.

Anyway, using Bio::Species for GenBank is really screwy for bacterial names,
so currently I get around BioPerl issues with bacterial names by grabbing
the 'source' seqfeature and pulling the 'organism' tag out.  But it really
shouldn't be that obfuscated, right?

Chris

> Cheers,
> Sendu.
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l