[Biojava-dev] SeqIO maintenance

Keith James kdj@sanger.ac.uk
12 Nov 2002 18:02:44 +0000


I found a cause of some of the performance problem - code had been
added to buffer *all* stringified features until the last one arrived,
before writing them. It used to stream these. Not good when the entry
contains a whole bacterial genome with 20k+ features.

Also, the tokenizer speeds up measurably after changing

 return "" + _tokenizeSymbol(s).charValue();

to

 return String.valueOf(_tokenizeSymbol(s).charValue());

Other stuff -

SeqIOEventEmitter exists purely to create sequence writing events for
EMBL/Genbank & Co. I'd like to make this package private to reflect
this.

It is this class which should be determining the order of events being
sent to the writers (which are SeqIOListeners). Right now the writers
themselves are trying to enforce ordering on the data *after* they get
it and much kludgy hackery ensues. The *addSymbols* method is printing
the sequence properties and features, for heaven's sake.

I'll add Comparator-based ordering to the emitter, BreakIterator-based
line wrapping to the abstract base class and should be able to remove
masses of duplication from the file formers.

onwards...

-- 

- Keith James <kdj@sanger.ac.uk> bioinformatics programming support -
- Pathogen Sequencing Unit, The Wellcome Trust Sanger Institute, UK -