[Biojava-l] FASTA reader problem: "Mark invalid"

Thomas Down td2@sanger.ac.uk
Fri, 18 Aug 2000 12:46:44 +0100


On Fri, Aug 18, 2000 at 01:03:44PM +0200, Christian Gruber wrote:
> Hi!
> 
> I wrote a Java test program that just reads in a sequence in FASTA
> format and prints the sequences out. I did this by making the
> appropriate changes to the file demos/seq/TestEmbl.java. As a test
> sequence file in FASTA format, i created one with random sequences.
> 
> Now the problem: There are some sequence files that are definitely
> correct FASTA format files, but create the following error message:

I've looked at this, and it's definitely an issue with long
description lines.  The trouble is, there is no `end of entry'
marker in a FASTA file, so the reader has to grab a line, then
`push it back' onto the stream if it turns out to be the 
description line for the next sequence in the file.

In the current Java I/O framework, the standard way to do
this is using the mark() and reset() methods.  Unfortunately,
mark() takes a numerical argument, and reset() MAY fail is
more than that number of bytes have been read since the mark().
So effectively we have a fixed-length buffer issue.

I agree (!) that BioJava should be able to handle long description
lines (which are, as you pointed out, quite common in some areas).
As a TEMPORARY workaround, I've upped the readahead limit passed
to mark() from 120 to 1024 (this is done on both HEAD and
release-1_0-branch).  Given the state of the Java I/O infrastructure,
I think this is probably the best that can be done while maintaining
the current interface for BioJava SequenceFormat classes.

On the other hand, it should be possible to redesign the
BioJava sequence I/O to avoid this issue.  At the same time,
I can see potential for performance optimisations (at the moment,
most SequenceFormats tend to read one line at a time, convert
that to a SymbolList, then join them all together later -- can
this be improved?).  It would be kind-of nice if we could get
any updates to this framework out of the way well before we
branch for 1.1.  Does anyone else have any thoughts on this?

Thanks,
   Thomas.
-- 
One of the advantages of being disorderly is that one is
constantly making exciting discoveries.
                                       -- A. A. Milne