[Biojava-l] Reading consensus sequence from phred/phrap ace files

David Waring dwaring@u.washington.edu
Thu, 14 Jun 2001 16:10:42 -0700


We have a full .ace parser. It was not written to the biojava API, so
sequences are strings. Our parser (which I have not worked with) uses Jlex
and Cup, and parses the entire .ace file into a really big Object with all
the data in it, in a structure just like the .ace file itself. For this
reason it is not particulary fast. Anyone familiar with Jlex and Cup should
be able to modify it to ignore parts that they were not interested in.

While you may not want everything in the file (and there is alot) perhaps a
more complete data structure is in order. In fact if I am not mistaken,
there really is no such thing as a consensus sequence in an .ace file. The
file consists of a list of contig sequences, the individual reads, and a
bunch more data. In a finished assembly project the "consensus sequence" is
just the longest contig. The other contigs may be junk. In an assembly
project that is not complete there are many "good" contigs and some
potential junk.

I would think that a structure that contained collection of all the contigs
would be in order. Methods could then allow getting the largest removing
sequences by size limits etc.

Will Gillett is the author. He says he has been thinking about modifying it
to fit the biojava API. If you could define a spec for the output data
structure, he would be willing to modify his code to parse the .ace file
into it. Otherwise we would gladly send you the source code.

David
University of Washington Genome Center


> Has anyone written something that fills out a sequence from the consensus
> sequence found in the .ace files of phred/phrap?
>
> If not, I will be writing one, I was thinking of doing something
> like being
> able to do
>
> BufferedReader reader = new BufferedReader( new FileReader(phredFile));
> SequenceIterator si = SeqIOTools.readPhred(reader);
> Sequence sequence = si.getConsensusSequence();
>
> Don't really need a sequence iterator I suppose, there is only
> one consensus
> in the file, though there are all the sample sequences in the file.  And I
> don't want to add a method to the sequence iterator either.  SO
> perhaps some
> sort of sequencebuilder child or factory method?  Anyway, please advise...
>
> -Mat
>
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l@biojava.org
> http://biojava.org/mailman/listinfo/biojava-l