[Biopython-dev] Second go at GenBank parser

Brad Chapman chapmanb at arches.uga.edu
Tue Jan 9 18:05:53 EST 2001


Hi Jeff!
   Thanks for getting back with me. Sorry I have been so slow. I was
away over the break and then was so excited to code when I got back
that I jumped right into to working hard-core on Biocorba (more on
that fairly soon :-).

Jeff:
> This is great!  You've filled two gaping holes in biopython functionality.  
> Please check these in, as I'm sure people will want to start using the
> code.

Okee dokee, I would be more than happy to do this. Are there any
objections from anyone before I do it? I also am not totally clear
about where everything should go (more on that below). 

me:
> > the dreaded "fake /" cases
> > (found some more hideous ones like that in a bacterial
> > dataset). GenBank, wow, what a headache!
Jeff:
> Good.  GenBank is notoriously hard to deal with, and I suspect work on the
> format will be ongoing.

I hope so -- it will be good to get it in CVS so others can look at
it. I'm not very happy with my fix (it seems pretty inefficient to
me), but it was the best my small mind could come up with. Once it is
in there all of the brilliant minds at biopython can have a go at it :-).

me:
> > o Naming of modules -- right now my naming sucks (the "supplimentary"
> > feature classes, like Location.py and Reference.py are in a module
> > called 'FeatureInfo', for instance. yeck.), so if people have good
> > ideas for how to name things I'll definately take 'em.
Jeff:
> Are these meant to be used with SeqFeatures?  If so, how about just
> SeqFeature.Location and SeqFeature.Reference?

Do you mean put them all in the SeqFeature.py module? That sounds like 
a fine solution to me (just wanted to be sure I understand you). 

> > I'm also not sure where a good place for spark.py to live in Biopython
> > is (BTW, I think we should include it :-).
> 
> Where you have it now seems as good a place as any (without the
> PGML).  Including it is fine with me.

Okay -- I'll stick it there.

> > Finally, I noticed Jeff put his snazzy code in GenBank/__init__.py --
> > Should my GenBank.py go into __init__.py?
> 
> Yes.  GenBank is a good name for it, and as per Andrew's earlier email, we
> should avoid having code in both GenBank/__init__.py and
> GenBank/GenBank.py.

Okay, so you want me to integrate it with your __init__.py stuff? That 
is no problem just wanted to be sure. I definately want to avoid
Andrew's __init__.py/GenBank.py-type problem.

One thing -- I added the ability to index files as a Dictionary (a la
the other Parsers). Is it too confusing to having Dictionary and
NCBIDictionary in the same module? Just curious.

> Are the HTML-formatted files different?  Does it work if you just strip
> the HTML tags?  I guess for HTML-formatted data from GenBank, it would be
> nice to handle, but very low priority.  HTML-formatted data from other
> sources, no.  If someone needs that functionality, they can submit the
> patches!  :)

Most of the entry is in <pre> tags so it is not too bad, but I think
there will be some tricky issues because some of the feature names
have links in them -- this will be hard, especially considering how 
important whitespace is in the feature table. So I think I'll forgoe
this and maybe if it turns out to be easy someone else will patch it
for me :-).

Thanks again for getting back with me. I will try to write up some
docs tonight, so I hope it'll be ready to go in whenever I'm sure
where to put things :-)

Brad




More information about the Biopython-dev mailing list