[Biojava-dev] SeqIO maintenance

Schreiber, Mark mark.schreiber@agresearch.co.nz
Tue, 5 Nov 2002 09:54:41 +1300


> -----Original Message-----
> From: Keith James [mailto:kdj@sanger.ac.uk] 
> Sent: Tuesday, 5 November 2002 5:36 a.m.
> To: BioJava Dev List
> Subject: [Biojava-dev] SeqIO maintenance
> 
> 
> 
> I've been having a look through the seqIO stuff and found a 
> few things which I think need attention prior to 1.3 release.
> 
> 1. Writing complex formats e.g. EMBL is too slow and seems to have
>    become slower (no numbers to back this up, though). Big pauses for
>    GC during writing. I think it needs attention.
> 

I think some of the Genbank code was relying on exception handling for
processing the file which is always slow, is this whats happening here?


> 2. The SeqFileFormer runtime class loading stuff I wrote is both
>    unecessary and confusing. I think I can and should kill it (without
>    affecting the interfaces).
> 
> 3. We are referring to sequence formats by String names (e.g. Embl,
>    Swissprot) in the interfaces, apart from in one method of both
>    MSFAlignmentFormat and FastaAlignmentFormat which takes an
>    int. However, the required int field is class private and in the
>    case of MSFAlignmentFormat is different from the public int field
>    for the same format in SeqIOTools.
> 
>    SeqIOTools uses Nimesh Singh's int fields to identify formats (or
>    aspects thereof). Personally, I prefer this nomenclature. What do
>    others prefer? At least we should provide a map between the two
>    systems. (Also, the int fields probably belong in the
>    SequenceFormat interface as this is the convention used
>    elsewhere. Right now they're all in SeqIOTools.)
> 
>    It might also be nice to have SequenceFormat.FASTADNA equal to
>    (SequenceFormat.FASTA & SequenceFormat.DNA) etc.
> 

Agree

> 4. There is now almost identical file format guessing code
>    cut'n'pasted in SeqIOTools, SeqAlignReadWrite, MSFAlignmentFormat,
>    FastaAlignmentFormat. I'd like to move all this to a package
>    private class.
> 

I had a look at the format guessing code, while all format guessing is
likely to fail sometimes there are some weakneses in this code that
should be fixed. In a number of cases white space at the start of the
file is likely to cause errors. The guessing of the alphabet only takes
into account the first line of sequence, while speedy its probably not
ideal. If it is only going to read the first line why not use a
combination of mark and reset to avoid reopening the file (expensive on
some OS's). Much of the format guessing also relies on looking for a
keyword near the start of the file (which is fine) therefore it may be
possible to engineer format guessing on a Stream rather than just a
file, dependent on reset and mark being supported (not sure on this).

Possibly too much relience is placed on the extension name of the file,
this is only in my opinion as there really aren't any standard
extensions for FASTA, Genbank etc. Also the proceedure used in format
guessing should be in the javadocs otherwise someone might be left
confused as to why there perfectly well formatted file is not correctly
guessed.

> Can anyone think of more while I'm at it?
> 
> Keith
> 
> -- 
> 
> - Keith James <kdj@sanger.ac.uk> bioinformatics programming support -
> - Pathogen Sequencing Unit, The Wellcome Trust Sanger Institute, UK -
> 
> _______________________________________________
> biojava-dev mailing list
> biojava-dev@biojava.org 
> http://biojava.org/mailman/listinfo/biojava-dev
> 
=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================