[Biojava-l] Reading consensus sequence from phred/phrap ace files

Fri, 15 Jun 2001 07:22:18 -0500

This sounds interesting!  I know nothing about Jlex and cup (yet), so I
can't really comment on it.  I took a VERY brief look at some web pages on
it though, it looks useful.  I would be curious to see the source code
regardless of whether I choose to change it or not.  It would be a great
learning example for me.

Given Jlex and cup, does anyone have an opinion on making a sax parser for
the .ace file, and how that would compare to having a Jlex/cup parser
instead?  Considering that the Jlex code is done, it would likely be more
pragmatic to just use that.  I imagine it would take me a bit of time to
write a SAX parser, but I know what can be done with SAX event streams.  No
idea what can be done with what is already written, as far as input to other
programs or pulling data from the parsed file.

I am more curious about the consensus contig though.  I went back through
the consed documentation, to see what more I could see.  It was talking
about exporting a consensus sequence.  And from what I can glean the CO
section is a consensus, a gmish of all the contigs put together, sort of an
average.  Is that not correct?  

-Mat

-----Original Message-----
From: 	David Waring [mailto:dwaring@u.washington.edu] 
Sent:	Thursday, June 14, 2001 6:11 PM
To:	Wiepert, Mathieu; biojava-l@biojava.org
Subject:	RE: [Biojava-l] Reading consensus sequence from phred/phrap
ace files

We have a full .ace parser. It was not written to the biojava API, so
sequences are strings. Our parser (which I have not worked with) uses Jlex
and Cup, and parses the entire .ace file into a really big Object with all
the data in it, in a structure just like the .ace file itself. For this
reason it is not particulary fast. Anyone familiar with Jlex and Cup should
be able to modify it to ignore parts that they were not interested in.

While you may not want everything in the file (and there is alot) perhaps a
more complete data structure is in order. In fact if I am not mistaken,
there really is no such thing as a consensus sequence in an .ace file. The
file consists of a list of contig sequences, the individual reads, and a
bunch more data. In a finished assembly project the "consensus sequence" is
just the longest contig. The other contigs may be junk. In an assembly
project that is not complete there are many "good" contigs and some
potential junk.

I would think that a structure that contained collection of all the contigs
would be in order. Methods could then allow getting the largest removing
sequences by size limits etc.

Will Gillett is the author. He says he has been thinking about modifying it
to fit the biojava API. If you could define a spec for the output data
structure, he would be willing to modify his code to parse the .ace file
into it. Otherwise we would gladly send you the source code.

David
University of Washington Genome Center