[BioPython] new parser questions

Fri, 5 Apr 2002 09:15:26 -0500

Hey Chris;

> I have more parser questions too.  I'm a bit behind current progress,
> but last time I looked at BioPython a few months ago, the parser I
> looked at seemed to be highly line-oriented.  

True enough. But then again, most of the files that we need to parse are
also line oriented.

> I noticed this when
> fixing up a script that downloaded blast output from (I think)
> NCBI. The website had added a single newline in a nonsubstantive part
> of the html output and the parser choked.  

Well, the html spit out by NCBI is definitely a special case. The output
changes randomly based on phases of the moon and the current state of
the dow jones, so it is difficult to keep up with. Other file formats
are not necessarily more fun, but sometimes the changes are actually
documented.

> I found it kind of
> discouraging to see this lack of robustness (while at the same timing
> being thankful that someone had done the work in the first place).

It depends how you want to think about it. This approach allows us to
say: "Look, the format has changed. We need to check the parser to make
sure it is still doing the right thing." With the current framework,
changes are easy to isolate and check; the last thing we want is to have
the parser happily parsing a file it doesn't understand and giving
incorrect output.

> Has the framework changed?  

BLAST parsing is also done under the old Producer/Consumer framework.
Currently, we develop parsers using Martel, which uses a regular-
expressions-on-steroids grammar to describe a file format, and then
transforms the file format into XML which is parsed with the standard
XML parsers. If you are looking at how we do things now, you'll want to
take a look at the Martel code. The GenBank parser is a huge style
parser that currently uses Martel.

>   1. load the file (use mmap if possible in order to more easily deal
>      with large datafiles eg. 100's of MB and upwards.)

This is not a problem. We currently parse files one record at a time, so
as long as the record size is reasonable, you can parse huge files with
no issues.

>   2. parse languages that have a grammar (e.g. html/xml) with parsers
>      that understand that grammar.  

It's awfully nice of you to characterize html as having an actual
grammar. The problem with NCBI BLAST is that often the "changed newline"
is just a symptom of more substatial changes in the output. 

We don't currently have a parser for the XML output from BLAST. If you'd
like to contribute one, that would be super!

Hope this answers your concerns.
Brad