[Bioperl-l] Bio::DB::Taxonomy:: mishandles species, subspecies/variant names

Sendu Bala sb at mrc-dunn.cam.ac.uk
Mon May 15 08:18:11 UTC 2006


Chris Fields wrote:
> Sendu Bala wrote:
>> In bioperl up to at least 1.5.1, when one of the database modules 
>> comes across a species rank it does:
>> 
>> if ($rank eq 'species') { # get rid of genus from species name 
>> (undef,$taxon_name) = split(/\s+/,$taxon_name,2); }
> 
> The XML example from NCBI Taxonomy I mentioned previously seems to 
> have everything in the classification, from superkingdom down to 
> species (no strain unfortunately, and I'm nit sure about subspecies);
> if it's missing the rank then the designation doesn't exist or is 
> tagged as 'no rank'.  Like I mentioned before I'm not intimately 
> familiar Bio::Taxonomy, Bio::DB::Taxonomy, or Bio::Species, so I 
> don't have a clue as to how everything is parsed and plugged in to 
> Bio::Taxonomy objects.  I do know that XML::Twig is used for parsing
> through the data so it shouldn't be too hard to change what you
> want.

Yes, that's all true, but I'm not sure what it has to do with what I was
saying. FYI, you do get a 'subspecies' rank but no 'variant' rank. In my
own implementation I change the rank of all 'no rank' Nodes below
species to 'variant'.


> I haven't tried using Bio::DB::Taxonomy directly yet, but I would 
> have thought that the binomial is just built from the XML twig 
> 'LineageEx' Rank=Genus + Rank=Species, that the genus comes from the
> tag 'Genus' and species from 'Species', and that the scientific name
> is from the tag 'ScientificName'.  Guess not.

No. See above for what it actually does. That is a copy/paste from the
code (there, $taxon_name == ScientificName). When it finds a species
rank it does that split because in the
ncbi taxonomy database the 'genus' rank for a human has a ScientificName
of 'Homo', whilst the 'species' rank has a ScientificName of 'Homo
sapiens', and the bioperl model (quite rightly, I think) wants the
'species' node to not have information of other nodes (well, except for
the classification array). So it removes the 'Homo' from 'Homo sapiens'
giving a species name of 'sapiens'. This then allows the binomial method
to return 'Homo sapiens' instead of 'Homo Homo sapiens'.

(though in a bizarre twist, and this is one of my problems with how
names are currently represented in the Taxonomy modules, 'Scientific
Name' and 'binomial' are synonymous)


[snip]
>> My solution is to just remove whatever is the same between the 
>> current rank and the previous rank. Maybe even that's not so 
>> perfect, but it must be a lot better than turning the species 
>> 'Avian leukosis virus' into the species 'virus' (especially given 
>> that the genus here is 'Alpharetrovirus')!
> 
> I'm don't think taking Genus/Species directly from the scientific 
> name (normally what is in the SOURCE or ORGANISM annotation for 
> GenBank or OS for EMBL) is the best way to go about it [snip]

Perhaps, but again I'm not sure what this has to do with what I was
saying. If you don't want your species name to contain your genus name
you have to do some kind of parsing. My post merely pointed out that the
parsing currently in bioperl does not work for viruses and possibly
other species. I'd like to think that someone cares about this error and
would do the simple fix I offered, or that they already know about the
problem and have done their own fix.


> I'm also not sure that forcing a lookup for every TaxID in every 
> sequence every time it's passed through SeqIO is the best way to go 
> either, though I think it should be required for storing sequences. 
> It's a tricky balance.

In my own implementation any database lookups are cached, and you have
the option of not doing any database lookup at all and 'faking' a
taxonomy from the supplied list of names (so it works just like normal
Bio::Seq).


> I still think that maybe we should absolve ourselves from using 
> SOURCE/ORGANISM or OS/OC information in GenBank files as anything 
> more than strictly annotation, or reconstruct Bio::Species to maybe a
>  Bio::Annotation::Species object to handle that annotation and either
>  deprecate Bio::Species or separate it completely from any 
> Bio::Taxonomy objects.  It would really simplify things.  Then, if 
> anyone is interested in taxonomy, either install a local database or
>  use Entrez efetch, and then use Bio::DB::Taxonomy (fixed of course)
>  to grab the TaxID info.

My personal view is that having it as an annotation would serve no real
purpose. For me the whole point of any kind of species representation in
bioperl is to allow you to compare species in a biologically meaningful
way. If it's just some annotation then that means it's basically
free-form text and you have no guarantee that two sequences from the
same species are annotated exactly the same - no guarantee that your
code would identify that those sequences are from the same species.
The only other useful thing that a species object needs to do it let you
know how related two different species are - you need to be able to ask
what a species' class, kingdom etc. are. Again, not viable with an
annotation - you need something strict like a properly constructed Taxonomy.

I guess it comes down to the philosophy of parsing a file. Do you try
and reflect exactly what the file contains, letter for letter, so that
your resulting object can recreate that file letter for letter, or do
you parse the file and extract the correct /meaning/ in order to be more
useful?
I think there can be a choice by the user, and this is best done by
making Bio::Species a clever wrapper around an improved Bio::Taxonomy,
as in my own implementation.



More information about the Bioperl-l mailing list