[Bioperl-l] Bio::Species/Bio::Taxonomy changes

Sun Jul 23 20:53:32 UTC 2006

Sendu, Hilmar, et al,

I was looking through SeqIO::genbank and though I would bring up a  
couple of things to think about re: GenBank Taxonomy information.

This is how NCBI defines the names used for SOURCE and ORGANISM  
according to the latest GenBank release notes:

SOURCE	- Common name of the organism or the name most frequently used
in the literature. Mandatory keyword in all annotated entries/one or
more records/includes one subkeyword.

    ORGANISM	- Formal scientific name of the organism (first line)
and taxonomic classification levels (second and subsequent lines).
Mandatory subkeyword in all annotated entries/two or more records.

According to their sample file page (http://www.ncbi.nlm.nih.gov/ 
Sitemap/samplerecord.html), the SOURCE is this:

Free-format information including an abbreviated form of the organism  
name, sometimes followed by a molecule type. (See section 3.4.10 of  
the GenBank release notes for more info.)

The SOURCE can also include the organelle and also may include  
additional information, such as an abbreviated name and a common name  
in parentheses.

...
SOURCE      Saccharomyces cerevisiae (baker's yeast)
   ORGANISM  Saccharomyces cerevisiae
             Eukaryota; Fungi; Ascomycota; Saccharomycotina;  
Saccharomycetes;
             Saccharomycetales; Saccharomycetaceae; Saccharomyces.

...

Setting scientific_name() isn't a problem; acc. to the above  
definition, it is the full name on the ORGANISM line.  The lineage  
(or classification() array) is also straight-forward.  The common_name 
(), though as used by Bio::SeqIO::genbank, is the entire SOURCE line  
(not just the abbreviated name, but the name and everything else).   
No additional parsing is performed on it.  write_seq() also seems to  
do the wrong thing when rebuilding the SOURCE line as well as the  
method writes the subspecies to the line.

I plan on using Bio::SeqIO::genbank as a guinea pig of sorts to try  
using Bio::Taxonomy::Node objects instead of Bio::Species, then get  
the parsing for these lines corrected and simplified.  Essentially,  
the way NCBI describes it, the main name on the line is actually the  
free-form abbreviated name, the name in parentheses is the common  
name (optionally present), and the organelle precedes all of these if  
present.  I want to try getting common_name() to match the common  
name found for taxonomy (baker's yeast) rather than have it be a  
simple container, add an abbreviated_name() method for the name  
container for the SOURCE line, and have the organelle() method  
actually be used if an organelle is present (it doesn't seem to be  
set at the moment in SeqIO::genbank).

Right now, I have NO idea how EMBL, DDBJ, other formats deal with  
organism info; I would think that the main three (GenBank/EMBL- 
SwissProt/DDBJ) handle them similarly...(Famous Last Words)

I also propose (I'll probably get yelled at here) NOT actively  
supporting additional parsing of species, subspecies, etc directly  
from a file w/o a DB lookup.  As in, leave species, subspecies, genus  
parsing from the flatfile as is (no longer support it) or remove it  
completely and leave them unset.

I haven't looked, but I have a strong feeling that the species  
parsing in Bio::SeqIO is different from format to format.  It really  
seems like more trouble than it's worth to maintain this, especially  
as there is perfectly valid Taxonomy database information available  
either locally using a flatfile or via Entrez.  If people want to  
have reliable $species->species or $species-genus for taxonomy  
information, they will need to have the db_handle() set for the  
Bio::Taxonomy::Node object and have an Node-based method to reset  
species, genus, etc to the tax database information (maybe  
reset_taxon or something along those lines).

Okay, rambled on enough.  Any thoughts?

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign