[Biojava-l] Re: [Bioperl-l] looking for datafile parsers

Aaron J Mackey ajm6q@virginia.edu
Thu, 11 Jan 2001 08:52:32 -0500 (EST)


On Thu, 11 Jan 2001, Andrew Dalke wrote:

> Finally, the project I've been working on, Martel,
> lets you develop parsers which handle most, if not all, of
> these cases.

Excellent, I look forward to seeing your work.  Parsing is the meat and
potatoes of bioinformatics, and it's beginning to taste very stale (I
dunno, maybe it's been stale for awhile now).  My own secret wish list is
focused more on result file parsing; I once spent a fair amount of time
building a "robust" FASTA result file parser, but found myself constantly
needing to tweak it to keep up with fasta development changes.  You don't
have that problem with SwissProt or other static file formats.

> grep - http://www.gnu.org/gnulist/production/grep.html
>   written in C
>   count (when used as "grep ^ID | wc")
>      takes 0m:57s to parse sprot38
>   offset (when used as "grep -b ^ID")
>   cannot be used for fasta, generic, all, validate, markup

I've actually found that I now use grep and a small mix of perl more than
any other parsing routine (mainly because of the predicament I mention
above: when a format changes, I have to fix the entire parser, even if I
just want to pull out a few relevant fields at the moment).  My result
file "parsers" often take a few 'grep swipes' at the file (since the
second grep on the same file is commonly much faster than the first), and
as you show, it's very fast to begin.  The one extension to grep that I'd
dearly like to see (perhaps I'll submit a patch) would be to extend the -A
and -B (after-context and before-context flags) to take regexp's
themselves (i.e. instead of printing N lines after the first match,
continue printing until the second regexp is matched, or other
possibilities depending on specified flags).  Then you could start using
(multiple) greps to get 'fasta', 'generic', 'all' satisfied.

Use the shell, Luke.

-Aaron

-- 
 o ~   ~   ~   ~   ~   ~  o
/ Aaron J Mackey           \
\  Dr. Pearson Laboratory  /
 \ University of Virginia  \
 /  (804) 924-2821          \
 \  amackey@virginia.edu    /
  o ~   ~   ~   ~   ~   ~  o