[Biopython-dev] [Bug 3069] More robust feature parser for GenBank/EMBL records

Fri Apr 30 07:46:49 UTC 2010

http://bugzilla.open-bio.org/show_bug.cgi?id=3069

------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk  2010-04-30 03:46 EST -------
(In reply to comment #6)
> Generally I agree with you.  However, based on my knowledge of the people at
> IMGT, this is highly unlikely.  From their perspective, they invested a very
> large amount of time into their ontology/database structure, and I don't think
> they'll really be prepared to shorten their feature keys to be in compliance
> with EMBL.

You're in a much better position to access this - but could you ask them about
this anyway? They may at least clarify how they bend the EMBL specification.

Do they have a preferred file format (e.g. XML)?

> I will try to cook up a parser for IMGT that integrates into biopython (but I
> can't guarantee success, as I'm not extremely familiar with the internals). 
> I'll keep you posted.

How I would try this would be to write a new scanner subclassing the EMBL
scanner in Bio/GenBank/Scanner.py (which probably only needs to override the
feature parsing), and then new functions in Bio/SeqIO/InsdcIO.py to call it
(matching the GenBank and EMBL functions), and define a new format name
(mabye "embl-imgt") in the dictionary in Bio/SeqIO/__init__.py and done.

However, if the only out-of-specification thing in the IMGT EMBL files is the
feature indentation and long feature keys, many your original request to make
the EMBL parser more tolerant is the best route.

Thinking ahead would you also want to be able to write out IMGT variant EMBL
files?

-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.