[Biojava-dev] EMBL/Genbank/Swissprot writing changes

Keith James kdj@sanger.ac.uk
15 Nov 2002 09:49:04 +0000


I've made a few changes here. The objectives were to restore expected
behaviour (e.g. you should not have to call addSymbols to print your
feature table), use less memory when writing big sequences, increase
speed and make the code more maintainable.

I think all have been achieved for EMBL writing and the changes still
required for Genbank, Swissprot etc will be moderate. However, I'm not
familiar enough with Genbank to risk making the changes myself without
the risk of them creating subtly mangled files.

In rough tests writing EMBL now takes about 60% of the time it used to
for a big (5 Mb sequence + 10k features) file.

The key changes are the ability to pass Comparators to the
SeqIOEventEmitter to enforce ordering of the
addFeatureProperty/addSequenceProperty calls and addition of a
sequence property formatting method to AbstractGenEmblFileFormer.

Writing some aspects of the header (notably references) is not working
properly (it never has). Some of this needs to be addressed in the
parser - if there was a structured Annotation for this stuff common to
all Sequences then fixing this would be trivial. Will get to that
later.

Keith

-- 

- Keith James <kdj@sanger.ac.uk> bioinformatics programming support -
- Pathogen Sequencing Unit, The Wellcome Trust Sanger Institute, UK -