[Biopython-dev] GenBank parser ?

Mon Oct 30 16:26:10 EST 2000

Thomas:
> > Do we - or someone else - have a genbank parser ? I remember something came
> > up in the news groups, but I cannot find it anymore ...

Jeff:
> No.

There are a couple of ways around this that I have found which allow
you still use python to get at GenBank:

1. Use jpython and the biojava libraries for parsing GenBank. I
attached a file which shows a basic example of doing this.

2. Use the python BioCorba interface (biopython-corba). 
 I use a bioperl based server and a biopython client, and this works 
quite well, at least for what I'm doing (parsing out CDS
info). Sometime soon I hope to make a new release of biopython-corba
with documentation on how to do stuff like this. I just need to revise 
the docs, and do some more testing to make sure everything in CVS is
kosher. If you are interested in trying this way, I would definately be
willing to help (hey, it would be quite exciting to have someone
using biopython-corba besides me :-).

Jeff:  
> The currently plan is to use this as a test case for Martel.  Any
> takers?  :)

I think one of our biggest sticking points is that we don't really
have anything in terms of features, which would be really really
useful to parse the GenBank files into. It seems like it is pretty
tricky to have classes which can deal with all of the possible
complexities of GenBank (also EMBL) formats, so it would be nice to
think of and implement some feature classes which do this first. There 
was an interesting discussion about some of this on the biocorba list
(in the October archives under the threads 'Biocorba IDL --
Clarifications' and 'SeqFeatures and the EMBL IDL').

Anyways, I don't have much time at the moment to work on this 100%,
but would be willing to do part o' the coding/hashing things out if
other people are willing to work on it as well. I think once we have 
a feature class, the GenBank parser won't be too incredibly horrible
to do from Martel (fingers crossed :-).

Brad

-------------- next part --------------
#!/usr/bin/env jpython
"""Read info from GenBank files.

This uses jpython and biojava (http://www.biojava.org) to read from a
GenBank file.

This is basically a jpython translation of demos/seq/TestGenbank.java"""
# standard python libs
import os

# java stuff
from java.io import *

# biojava
from org.biojava.bio.seq.io import *
from org.biojava.bio import *
from org.biojava.bio.symbol import *
from org.biojava.bio.seq import *

# set up the files
file = os.path.join('test.gb')

gb_file = File(file)
reader = BufferedReader(InputStreamReader(FileInputStream(gb_file)))

# set up biojava stuff to parse the files
alphabet = DNATools.getDNA()
seq_factory = SimpleSequenceFactory()
parser = alphabet.getParser("token")
gb_format = GenbankFormat()

iterator = StreamReader(reader, gb_format, parser, seq_factory)

while iterator.hasNext():
    seq = iterator.nextSequence()

    print 'name:', seq.getName()
    print 'num features:', seq.countFeatures()