[Bioperl-l] EMBL/genbank organism parsing

Tue Mar 14 11:28:34 UTC 2006

Hi Hilmar/Jason,

Thanks for the comments. Please excuse the breach of netiquette by 
replying to you both in one message, but given the overlaps it's the 
easiest way....

Jason Stajich wrote:
> I *think* the fields in the Taxonomy::Node object should be suffient 
> to separate out the field you are talking about. 
I've had a look at Taxonomy::Node, and it looks like it will indeed hold 
the necessary fields. There are some distinctions below species level 
such as serovars and pathovars  which I though may need special 
handling, but NCBI taxonomy seems happy to treat these as separate 
species. Well....they provide distinct nodes with the rank of 'species' 
for each one, which probably means they consider them separate species...

Hilmar Lapp wrote:
> I don't think tweaking individual parsers until they behave as desired 
> on a then-current set examples is going to put an end to this
I agree with this completely. I haven't looked so closely at Genbank, 
but the EMBL User Manual dictates a 'standard' which does not appear to 
be enforced, to the extent that certain OS lines are little more than 
free text. This situation looks even worse in Uniprot, where there can 
be multiple bracketed names following the latin name, which may 
represent synonyms, strains or common-names, but with little contextual 
information to allow you to determine what the data is. I think, 
certainly for Uniprot, and probably for EMBL/Genbank, there is little 
chance in reliably parsing organism names.

Hilmar Lapp wrote:
> Or, quite radical in approach, we require the NCBI taxonomy database 
> (or any other implementation of Bio::DB::Taxonomy, e.g. could be 
> through BioSQL or what not) and otherwise disclaim responsibility for 
> correctly parsing the species. 
This seems perhaps the most pragmatic option, although I'd be worried 
about not providing any means of getting at species information in 
situations where access to a taxonomy database is not available for 
whatever reason (laziness included!), and the probable loss of speed 
associated with carrying out these queries.

I guess there are numerous approaches to get round this. Two which 
immediately spring to mind:
1) a hybrid system which retains the parsing of OS lines as best of 
possible (accessed via Bio::Seq->species), but with the addition of a 
set of Bio::Seq->taxonomy methods to query  taxonomy if more reliable 
data is required. Pro's - doesn't break existing API. Con's - I can see 
considerable user confusion by providing essentially the same data 
through different routes.

2) Carryout minimal parsing of OS lines populating only 
genus/species/binomial fields (i.e the bits we can probably reliably 
parse), and throw a warning if accessors to unpopulated fields are 
called. Add a new method to Bio::Seq to repopulate the Taxonomy::Node 
object on demand via a taxonomy query if more detailed info is required. 
Pros - Adds only one extra method Cons - breaks existing API if calls 
made to common_name etc. prior to fully populating the Taxonomy::Node 
object.

I'm sure there are plenty of other ways...including just biting the 
bullet and enforcing the use of a taxonomy database, but that seems a 
little draconian when many entries will be easily parseable.
> ideally someone (you?) can take charge and spearhead overhauling this. 
Doh...walked into that one... :-)  I'll give it a go and see where we 
end up...I'm not hugely familiar with bioperl's internals, but I'm sure 
there are plenty of folk to holler if I do something stupid. I'll do 
some more digging, and as Jason suggested, create a page on the wiki 
with my thoughts and see what people think of it.

Cheers,
James

-- 
Dr. James Abbott <j.abbott at imperial.ac.uk>
Bioinformatics Software Developer, Bioinformatics Support Service
Imperial College, London