[Biojava-dev] SeqIO maintenance

Keith James kdj@sanger.ac.uk
10 Nov 2002 16:58:59 +0000


>>>>> "Mark" == Schreiber, Mark <mark.schreiber@agresearch.co.nz> writes:

    Mark> I think some of the Genbank code was relying on exception
    Mark> handling for processing the file which is always slow, is
    Mark> this whats happening here?

I may have been wrong about the GC. Some the code does suck though, it
deliberately overruns an array and catches the exception to set a
variable. I want to set up a benchmark before I start, so I haven't
applied any fixes yet.

    >> 2. The SeqFileFormer runtime class loading stuff I wrote is
    >> both unecessary and confusing. I think I can and should kill it
    >> (without affecting the interfaces).

I've started to deal with this. A lot of the complexity comes from
trying to deal with SequenceFormats which write more than one file
format. In these cases there need to be methods for specifying which
format, getting allowed formats etc. I'd like to move to having each
SequenceFormat write only one format. This would mean e.g. subclassing
EmblLikeFormat to create EmblFormat, SwissprotFormat with polymorphic
write methods. Then we can ditch writeFormat(Sequence seq, String
format, PrintStream os), getFormats() and getDefaultFormat().

Right now I've removed the SeqFileFormerFactories and the getFormats()
method. Everything behaves as it did before as I've simply deprecated
the writeFormat which requires a String argument in preparation.

    >> 3. We are referring to sequence formats by String names
    >> (e.g. Embl, Swissprot)

Given the above change to one file format per SequenceFormat, this
goes away as we never have to pass a format name as a String.

    >> SeqIOTools uses Nimesh Singh's int fields to identify formats
    >> (or aspects thereof). Personally, I prefer this
    >> nomenclature. What do others prefer? At least we should provide
    >> a map between the two systems. (Also, the int fields probably
    >> belong in the SequenceFormat interface as this is the
    >> convention used elsewhere. Right now they're all in
    >> SeqIOTools.)

    Mark> Agree

I've put some preliminary static ints in SequenceFormat. I've used the
most significant bytes to hold information about symbols (DNA vs. RNA
vs. AA etc) and the least significant bytes to hold information. These
could change, but it's more systematic now than it was.

    Mark> I had a look at the format guessing code, while all format
    Mark> guessing is likely to fail sometimes there are some
    Mark> weakneses in this code that should be fixed. In a number of
    Mark> cases white space at the start of the file is likely to
    Mark> cause errors. The guessing of the alphabet only takes into
    Mark> account the first line of sequence, while speedy its
    Mark> probably not ideal. If it is only going to read the first
    Mark> line why not use a combination of mark and reset to avoid
    Mark> reopening the file (expensive on some OS's). Much of the
    Mark> format guessing also relies on looking for a keyword near
    Mark> the start of the file (which is fine) therefore it may be
    Mark> possible to engineer format guessing on a Stream rather than
    Mark> just a file, dependent on reset and mark being supported
    Mark> (not sure on this).

    Mark> Possibly too much relience is placed on the extension name
    Mark> of the file, this is only in my opinion as there really
    Mark> aren't any standard extensions for FASTA, Genbank etc. Also
    Mark> the proceedure used in format guessing should be in the
    Mark> javadocs otherwise someone might be left confused as to why
    Mark> there perfectly well formatted file is not correctly
    Mark> guessed.

Thanks for the pointers. I haven't started on this yet. Wanted to get
some feedback on SequenceFormats first.

Keith

-- 

- Keith James <kdj@sanger.ac.uk> bioinformatics programming support -
- Pathogen Sequencing Unit, The Wellcome Trust Sanger Institute, UK -