[Bioperl-l] EMBL/genbank organism parsing
James Abbott
j.abbott at imperial.ac.uk
Thu Mar 9 14:16:03 UTC 2006
Hi Folks,
The current parsing of OS lines by Bio::SeqIO::embl.pm fails with many
of the organisms currently found in the database, since the OS lines
differ considerably from the specification in the EMBL User Manual,
which appears to have been used as the basis for the current parser. In
an attempt to improve matters, I have collected a set of examples which
hopefully cover the majority of the different ways of writing an
organism name, and managed to get embl.pm to 'correctly' parse these
(correctly being open to debate with some of the more esoteric
examples). I'm sure there are plenty of entries which still don't parse
correctly, but it's a start. I'll post the patches to bugzilla once I
get a few loose ends tidied up.
In the interests of consistency, I have also obtained the same set of
sequences from Genbank, and am trying to make both parsers behave the
same way, however they currently behave in different ways with respect
to parsing the common name. According to the EMBL spec, the common name
is the English name for the organism given in brackets after the latin
name, consequently calling the common_name method on an embl.pm parsed
Bio::Species object returns 'human' for a Homo sapiens (human). The
genbank parser, however, currently takes the entire SOURCE line,
including the latin name, consequently calling the common_name method on
a genbank.pm parsed species object returns 'Homo sapiens (human)'. This
would appear to be the intended behavior, since this is considered the
correct response by the tests.
Is it considered better to maintain consistency between the EMBL and
Genbank parsers and risk breaking any code which relies upon the current
behavior of genbank->species->common_name(), or to have the two parsers
behaving differently, but consistently with their existing behavior?
Cheers,
James
--
Dr. James Abbott <j.abbott at imperial.ac.uk>
Bioinformatics Software Developer, Bioinformatics Support Service
Imperial College, London
More information about the Bioperl-l
mailing list