[Biojava-dev] Fasta parsing bug

Matthew Pocock matthew_pocock at yahoo.co.uk
Thu Apr 24 13:22:45 EDT 2003


The problem is that readSequenceData is designed to
read all the sequence characters up to the end of file
or the next '>' character indicating the beginning of
the next sequence. We are marking so that when the
method returns, the stream is positioned at the '>' as
if it had never been read.

Poking arround in the BufferedReader source hasn't
thrown anything realy obvious up. If you repeatedly
call the mark function, the new mark obliterates the
old. So - by marking a file a hundred thousand times
we will not be creating a hundred thousand markers. If
we mark every 1k, then the most stuff the buffered
reader will cache will be 1k. Mmm.

Marker invalid happens when you attempt to reset()
after reading in more stuff in than you reserved in
the mark() call. But, my reading of the fasta parsing
code is that we call read() between the mark and the
reset exactly once, and with a read limit that is
guaranteed to be legal. The only thing I can think is
that we've hit a boundary condition in BufferedReader
where if we attempt to read /exactly/ the same number
of characters in as we reserve in mark() that the
extra character needed for the position we would
reset() back to pushes us over the limit.

Oh - was the logic at line 160 correct - my copy is
now different - added some brackets and replaced an &&
with an ||

     while (parseStart < bytesRead &&
       (cache[parseStart] == '\n' ||
        cache[parseStart] == '\r') )
       {

I've also modified line 139 to read:

     r.mark(cache.length + 1);

Hopefully this would remove my possible boundary
condition. This is now in CVS. Could you try your
sequences again?

Matthew



 --- "Schreiber, Mark"
<mark.schreiber at agresearch.co.nz> wrote: > Hi -
>  
> I have been slowly tracking down a bug in the
> reading of large (10K + sequences) fasta files. The
> bug is caused by a mark being set in a
> BufferedReader by the FastaFormat object that is
> later unable to be reset causing an IOException.
>  
> A typical stack trace is:
>  
> java.io.IOException: Can't reset: Mark invalid
> parseStart=417 bytesRead=512
>         at
>
org.biojava.bio.seq.io.FastaFormat.readSequenceData(FastaFormat.java:170)
>         at
>
org.biojava.bio.seq.io.FastaFormat.readSequence(FastaFormat.java:120)
>         at
>
org.biojava.bio.seq.io.StreamReader.nextSequence(StreamReader.java:100)
> rethrown as org.biojava.bio.BioException: Could not
> read sequence
>         at
>
org.biojava.bio.seq.io.StreamReader.nextSequence(StreamReader.java:103)
>         at utils.SeqLength.main(SeqLength.java:42)
>  
> although the specifics of parseStart anf bytesRead
> are dependent on the size of the BufferedReader.
>  
> Looking into the Java docs I found some hints about
> the size of the buffer. If you decrease the size of
> the buffer from the default 8192 then errors occur
> in smaller files, or earlier in the file. I then
> started doubling the size of the buffer once I got
> to 65536 I could read the largest FASTA lib I had on
> my machine. This is a bit of a kludge and it may
> point to an error in the bowels of the JVM itself
> rather than in biojava.
>  
> This was on WindowsXP with biojava-live and Java
> build 1.4.1_02-b06 but I think others have been
> periodically bugged by this as well, not sure of the
> OS etc been used.
>  
> Is there a way to avoid using the mark/ reset
> paradigm in FastaFormat?
>  
> - Mark
>  
> 
>
=======================================================================
> Attention: The information contained in this message
> and/or attachments
> from AgResearch Limited is intended only for the
> persons or entities
> to which it is addressed and may contain
> confidential and/or privileged
> material. Any review, retransmission, dissemination
> or other use of, or
> taking of any action in reliance upon, this
> information by persons or
> entities other than the intended recipients is
> prohibited by AgResearch
> Limited. If you have received this message in error,
> please notify the
> sender immediately.
>
=======================================================================
> 
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at biojava.org
> http://biojava.org/mailman/listinfo/biojava-dev 

__________________________________________________
Yahoo! Plus
For a better Internet experience
http://www.yahoo.co.uk/btoffer


More information about the biojava-dev mailing list