[Biojava-dev] SeqIO maintenance

Keith James kdj@sanger.ac.uk
04 Nov 2002 16:36:10 +0000


I've been having a look through the seqIO stuff and found a few things
which I think need attention prior to 1.3 release.

1. Writing complex formats e.g. EMBL is too slow and seems to have
   become slower (no numbers to back this up, though). Big pauses for
   GC during writing. I think it needs attention.

2. The SeqFileFormer runtime class loading stuff I wrote is both
   unecessary and confusing. I think I can and should kill it (without
   affecting the interfaces).

3. We are referring to sequence formats by String names (e.g. Embl,
   Swissprot) in the interfaces, apart from in one method of both
   MSFAlignmentFormat and FastaAlignmentFormat which takes an
   int. However, the required int field is class private and in the
   case of MSFAlignmentFormat is different from the public int field
   for the same format in SeqIOTools.

   SeqIOTools uses Nimesh Singh's int fields to identify formats (or
   aspects thereof). Personally, I prefer this nomenclature. What do
   others prefer? At least we should provide a map between the two
   systems. (Also, the int fields probably belong in the
   SequenceFormat interface as this is the convention used
   elsewhere. Right now they're all in SeqIOTools.)

   It might also be nice to have SequenceFormat.FASTADNA equal to
   (SequenceFormat.FASTA & SequenceFormat.DNA) etc.

4. There is now almost identical file format guessing code
   cut'n'pasted in SeqIOTools, SeqAlignReadWrite, MSFAlignmentFormat,
   FastaAlignmentFormat. I'd like to move all this to a package
   private class.

Can anyone think of more while I'm at it?

Keith

-- 

- Keith James <kdj@sanger.ac.uk> bioinformatics programming support -
- Pathogen Sequencing Unit, The Wellcome Trust Sanger Institute, UK -