[Biopython-dev] EMBL flatfile parsing
    Peter 
    biopython-dev at maubp.freeserve.co.uk
       
    Tue Feb  6 15:16:42 UTC 2007
    
    
  
Albert Krewinkel wrote:
>> I am trying to parse a EMBL-formated file with biopython, but I
>> couldn't find any working parser for this. When I try to use the
>> Martel-based parser as described in one of the mailinglist-threads, I
>> get the following error...
Peter wrote:
> OK, we have the following files in BioPython:
> 
> Bio/formatdefs/embl.py (wrapper)
> Bio/expressions/embl/__init__.py (dummy file)
> Bio/expressions/embl/embl65.py (contains Martel definition)
 >
 > ...
 >
> It does look like an out of date [Martel] file format definition in
 > BioPython (assuming that example code from Jeff Chang is fine).
I haven't touched the Martel file format definition, but I have been 
looking at EMBL parsing for Bio.SeqIO
Based on my experience with the poor performance of the old Martel 
GenBank on large files, I would expect the same issue to apply to the 
Martel EMBL parser (even if it was updated).
So, I have been looking at re-writing my Python based GenBank parser (in 
Bio.GenBank) instead:
Notes and attachment showing the idea here:
http://bugzilla.open-bio.org/show_bug.cgi?id=2059#c14
I am thinking of sticking with the current scanner/consumer model in 
Bio/GenBank/__init__.py but simply replacing the (GenBank only) _Scanner 
class with a "GenBank scanner" and an "EMBL scanner" (based on a common 
base class which will handle the feature table).
These new scanners would both feed into the existing consumers.  In 
particular, the "Feature Consumer" which builds a SeqRecord with 
SeqFeature objects.  I have this more or less working.
Does this sound like a sensible way to include EMBL support?
While it would be possible to use the new EMBL parser in much the same 
way as the current GenBank parser, I would recommend most users simply 
invoke them via Bio.SeqIO for normal work.
I could put most of the new code in Bio/GenBank and create a new 
module/directory called Bio/EMBL, or just stick everything in 
Bio/GenBank - I'm not that fussed either way given I want to push 
Bio.SeqIO as the main interface.
(Once that is settled I can rearrange the new code to slot in as 
appropriate.)
Michiel - how does this plan sound?  And should I try and get these 
changes done and tested in time for the next release - or wait until 
afterwards?
Peter
    
    
More information about the Biopython-dev
mailing list