[Biojava-l] seeking comments on proposed changes

Thomas Down td2@sanger.ac.uk
Thu, 30 Nov 2000 11:31:59 +0000


On Wed, Nov 29, 2000 at 06:29:21PM -0800, Scott Markel wrote:
> We'd like to propose some changes and would like to get the group's
> feedback.

Great -- BioJava has grown quite a bit since the 1.0 release,
and the more review it gets before 1.1 the better.

>   * Location.empty.equals(Location.empty) evaluates to false.  The
>     problem is that EmptyLocation returns Integer.MIN_VALUE from the
>     getMax() method and the LocationComparator determines the distance
>     between the max of two Locations using subtraction.  In this case of
>     comparing Location.empty to itself the max values are both maximally
>     negative so subtracting does not result in 0.  We'd like to change
>     EmptyLocation's equals() method.

That sounds reasonably to me...

>   * FastaFormat doesn't use Java-like facilities such as reading lines
>     as Strings from a BufferedReader.  We tripped over this while
>     tracking down a bug regarding DOS formatted end-of-line characters
>     in a FASTA file.  we have a fix to the DOS format bug that could be
>     checked in, but we're wondering if using BufferedReader's readLine()
>     method might be a safer approach that avoids that kind of problem.

There is actually a very good reason why FastaFormat doesn't use
BufferedReader.readLine (I went to some trouble when I rewrote it
to stop using readLine).  The trouble is that some FASTA files have
long (potentially /very/ long) description lines.  The only way
to detect when you've hit the end of one sequence is to see the
start of the description line of the next sequence.  The contract
for the SequenceFormat.readSequence method is to read exactly one
sequence from the stream, and then leave it parked at the start
of the next sequence (this is important for allowing IndexedSequenceDB
to work).  Since Java doens't allow truly random access on normal
streams (only mark/restore), it's actually NOT safe to use readLine --
previous versions of BioJava did this, and ended up breaking if you
used files with description lines bigger than the buffer of the
BufferedReader :(.

That said, you've clearly found a bug with FASTA files containing
return characters -- glad someone found this sooner rather than later.
But it's safer to just accept return characters as well as newlines,
rather than using readLine() -- I'll check this in in a minute
(bad Thomas for being Unix-centric).

>   * We also noticed that when FastaFormat processes a sequence file a
>     new String object is instantiated for each character in the sequence
>     so that it can be parsed and added to the SymbolList.  We've noticed
>     a big performance hit for large sequences (100K - 10M bp).

I know this isn't ideal (although it was actually less of a problem
that I thought on the VMs I tested -- still worth fixing,
though).  I've been thinking about changes to the SymbolParser 
interface for a while, but haven't got round to doing anything.

>     We'd like to do one of the following.
> 
>     - Add a method that mimics parseToken(), but takes a primitive char.
>       This new method might live in either SymbolParser or a derived
>       interface.  Change the implementation of TokenParser's parse()
>       method to not use substring(), which causes more Strings to be
>       instantiated.
> 
>     - Change FastaFormat to use the current interface but instantiate a
>       String per symbol in the alphabet and reuse them rather than
>       creating a String per sequence character.

I'd be quite happy to see the first of these options implemented --
go ahead and do it now if you're being held back by the performance
issues.

An alternative solution which I've been thinking about is a
`symbol-stream' parsing approach.  The broad idea is that the
SymbolParser gains an extra method to create a `streaming context'
object.  Blocks of primitives chars go in, blocks of Symbols come
out.  There are two possible ways this might be done:

  - Have a method on SymbolParser which takes a java Reader
    and returns a SymbolReader (the same interface I used in
    my initial newio proposal).  SequenceFormats just provide
    a custom Reader implementation which exposes the raw sequence
    character data.

  - Have a special `streaming context' interface alongside the
    parser.  This has a (SAX-like) characters(char[], int, int)
    method.  A streaming context accepts character data, parses
    it, and passes blocks of symbols on to a SeqIOListener

I think I'm starting to prefer the second of these proposals,
and we then get rid of SymbolReader completely.

The reason I'd like to use one of these two systems in preference
to just having a parseToken(char) method is that, while these
approaches should be just as efficient for streams with a single
character -> Symbol encoding, they can also be used on multiple
character -> Symbol encoded streams.  I think the current
SymbolParser interfaces was designed with multi-char -> Symbol
encodings in mind.

On the other hand, I'm open to being told that this is overkill
and we should just concentrate on single-char -> Symbol parsing
for now.

Thomas.
-- 
``If I was going to carry a large axe on my back to a diplomatic
function I think I'd want it glittery too.''
           -- Terry Pratchett