[Bioperl-l] EMBL/genbank organism parsing
James Abbott
j.abbott at imperial.ac.uk
Tue Mar 14 11:28:34 UTC 2006
Hi Hilmar/Jason,
Thanks for the comments. Please excuse the breach of netiquette by
replying to you both in one message, but given the overlaps it's the
easiest way....
Jason Stajich wrote:
> I *think* the fields in the Taxonomy::Node object should be suffient
> to separate out the field you are talking about.
I've had a look at Taxonomy::Node, and it looks like it will indeed hold
the necessary fields. There are some distinctions below species level
such as serovars and pathovars which I though may need special
handling, but NCBI taxonomy seems happy to treat these as separate
species. Well....they provide distinct nodes with the rank of 'species'
for each one, which probably means they consider them separate species...
Hilmar Lapp wrote:
> I don't think tweaking individual parsers until they behave as desired
> on a then-current set examples is going to put an end to this
I agree with this completely. I haven't looked so closely at Genbank,
but the EMBL User Manual dictates a 'standard' which does not appear to
be enforced, to the extent that certain OS lines are little more than
free text. This situation looks even worse in Uniprot, where there can
be multiple bracketed names following the latin name, which may
represent synonyms, strains or common-names, but with little contextual
information to allow you to determine what the data is. I think,
certainly for Uniprot, and probably for EMBL/Genbank, there is little
chance in reliably parsing organism names.
Hilmar Lapp wrote:
> Or, quite radical in approach, we require the NCBI taxonomy database
> (or any other implementation of Bio::DB::Taxonomy, e.g. could be
> through BioSQL or what not) and otherwise disclaim responsibility for
> correctly parsing the species.
This seems perhaps the most pragmatic option, although I'd be worried
about not providing any means of getting at species information in
situations where access to a taxonomy database is not available for
whatever reason (laziness included!), and the probable loss of speed
associated with carrying out these queries.
I guess there are numerous approaches to get round this. Two which
immediately spring to mind:
1) a hybrid system which retains the parsing of OS lines as best of
possible (accessed via Bio::Seq->species), but with the addition of a
set of Bio::Seq->taxonomy methods to query taxonomy if more reliable
data is required. Pro's - doesn't break existing API. Con's - I can see
considerable user confusion by providing essentially the same data
through different routes.
2) Carryout minimal parsing of OS lines populating only
genus/species/binomial fields (i.e the bits we can probably reliably
parse), and throw a warning if accessors to unpopulated fields are
called. Add a new method to Bio::Seq to repopulate the Taxonomy::Node
object on demand via a taxonomy query if more detailed info is required.
Pros - Adds only one extra method Cons - breaks existing API if calls
made to common_name etc. prior to fully populating the Taxonomy::Node
object.
I'm sure there are plenty of other ways...including just biting the
bullet and enforcing the use of a taxonomy database, but that seems a
little draconian when many entries will be easily parseable.
> ideally someone (you?) can take charge and spearhead overhauling this.
Doh...walked into that one... :-) I'll give it a go and see where we
end up...I'm not hugely familiar with bioperl's internals, but I'm sure
there are plenty of folk to holler if I do something stupid. I'll do
some more digging, and as Jason suggested, create a page on the wiki
with my thoughts and see what people think of it.
Cheers,
James
--
Dr. James Abbott <j.abbott at imperial.ac.uk>
Bioinformatics Software Developer, Bioinformatics Support Service
Imperial College, London
More information about the Bioperl-l
mailing list