[Biopython-dev] GenBank parser -- first go

Wed Dec 6 03:39:45 EST 2000

Jeff:
>> I don't believe there's any general
>> data structure in existance that can handle the genbank location
>> field.  It's describe by a BNF grammar and requires a tree!

Speaking as a parsing problem, this cannot be done with regular
expression.  When something like that occurs, it should be fine
to leave it as an opaque block of text, which is parsed elsewhere.

John Aycock wrote a really nice context-free parser in pure
Python called SPARK.  http://www.csr.uvic.ca/~aycock/python/
Easier to use.  (Which means it is *much* easier to use than
lax/yacc.)

Brad:
>I use the ambiguous DNA and RNA
>alphabets so this should cover any letters in the sequence
>(hopefully). I'm not sure if this is ideal, but at least it associates 
>the type with the sequence. Suggestions about how to be more strict
>are welcome on this.

You could be more strict by being less strict.  There's a
ProteinAlphabet, DNAAlphabet and RNAAlphabet as part of the
Bio.Alphabet module.

You can't really do anything with them.  All they say is that
sequence contains a single letter of alphabet containing protein,
dna or rna residues.  It doesn't attempt to define what those
letter means.

Jeff:
>> - There's a TaggingConsumer in Bio.ParserSupport.

Oops!  You can see I haven't read that bit of code.  I included
something pretty much like that in my earlier reply to Brad.

                    Andrew
                    dalke at acm.org