[Bioperl-l] EMBL/genbank organism parsing

Fri Mar 10 19:49:05 UTC 2006

James -
Wonderful, thanks for stepping in.

One thing is this may be a good time to note that species data can be  
better presented in the taxonomy objects so to ditch Bio::Species and  
move to Bio::Taxonomy::Node (a sexy name I know).  There is a little  
about this on the wiki in the project priority list http:// 
bioperl.org/wiki/Project_priority_list - I *think* the fields in the  
Taxonomy::Node object should be suffient to separate out the field  
you are talking about.

As to whether or not to break common_name behavior, I don't have any  
opinion right now, but perhaps those who use this data from a file  
can speak better to it.

I encourage you to add some text on the wiki pages about whatever you  
plan so that we can document what has happened - feel free to just  
create a new page for this project and it can be linked in  
appropriately.

-jason

On Mar 9, 2006, at 9:16 AM, James Abbott wrote:

> Hi Folks,
>
> The current parsing of OS lines by Bio::SeqIO::embl.pm fails with many
> of the organisms currently found in the database, since the OS lines
> differ considerably from the specification in the EMBL User Manual,
> which appears to have been used as the basis for the current  
> parser. In
> an attempt to improve matters, I have collected a set of examples  
> which
> hopefully cover the majority of the different ways of writing an
> organism name, and managed to get  embl.pm to 'correctly' parse these
> (correctly being open to debate with some of the more esoteric
> examples). I'm sure there are plenty of entries which still don't  
> parse
> correctly, but it's a start. I'll post the patches to bugzilla once I
> get a few loose ends tidied up.
>
> In the interests of consistency, I have also obtained the same set of
> sequences from Genbank, and am trying to make both parsers behave the
> same way, however they currently behave in different ways with respect
> to parsing the common name. According to the EMBL spec, the common  
> name
> is the English name for the organism given in brackets after the latin
> name, consequently calling the common_name method on an embl.pm parsed
> Bio::Species object returns 'human' for a Homo sapiens (human). The
> genbank parser, however, currently takes the entire SOURCE line,
> including the latin name, consequently calling the common_name  
> method on
> a genbank.pm parsed species object returns 'Homo sapiens (human)'.  
> This
> would appear to be the intended behavior, since this is considered the
> correct response by the tests.
>
> Is it considered better to maintain consistency between the EMBL and
> Genbank parsers and risk breaking any code which relies upon the  
> current
> behavior of genbank->species->common_name(), or to have the two  
> parsers
> behaving differently, but consistently with their existing behavior?
>
> Cheers,
> James
>
> -- 
> Dr. James Abbott <j.abbott at imperial.ac.uk>
> Bioinformatics Software Developer, Bioinformatics Support Service
> Imperial College, London
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

--
Jason Stajich
Duke University
http://www.duke.edu/~jes12