[Bioperl-l] EMBL/genbank organism parsing

James Abbott j.abbott at imperial.ac.uk
Thu Mar 9 14:16:03 UTC 2006


Hi Folks,

The current parsing of OS lines by Bio::SeqIO::embl.pm fails with many 
of the organisms currently found in the database, since the OS lines 
differ considerably from the specification in the EMBL User Manual, 
which appears to have been used as the basis for the current parser. In 
an attempt to improve matters, I have collected a set of examples which 
hopefully cover the majority of the different ways of writing an 
organism name, and managed to get  embl.pm to 'correctly' parse these 
(correctly being open to debate with some of the more esoteric 
examples). I'm sure there are plenty of entries which still don't parse 
correctly, but it's a start. I'll post the patches to bugzilla once I 
get a few loose ends tidied up.

In the interests of consistency, I have also obtained the same set of 
sequences from Genbank, and am trying to make both parsers behave the 
same way, however they currently behave in different ways with respect 
to parsing the common name. According to the EMBL spec, the common name 
is the English name for the organism given in brackets after the latin 
name, consequently calling the common_name method on an embl.pm parsed 
Bio::Species object returns 'human' for a Homo sapiens (human). The 
genbank parser, however, currently takes the entire SOURCE line, 
including the latin name, consequently calling the common_name method on 
a genbank.pm parsed species object returns 'Homo sapiens (human)'. This 
would appear to be the intended behavior, since this is considered the 
correct response by the tests.

Is it considered better to maintain consistency between the EMBL and 
Genbank parsers and risk breaking any code which relies upon the current 
behavior of genbank->species->common_name(), or to have the two parsers 
behaving differently, but consistently with their existing behavior?

Cheers,
James

-- 
Dr. James Abbott <j.abbott at imperial.ac.uk>
Bioinformatics Software Developer, Bioinformatics Support Service
Imperial College, London





More information about the Bioperl-l mailing list