[Biojava-dev] Bug in SeqIOTools

Keith James kdj at sanger.ac.uk
Tue Mar 4 10:12:57 EST 2003


>>>>> "Mark" == Schreiber, Mark <mark.schreiber at agresearch.co.nz> writes:

    Mark> Hi - The new way of getting an int to identify your file
    Mark> type in SeqIOTools is somewhat buggy. The problem seems to
    Mark> stem from the use of the method
    Mark> SeqIOTools.identifyFormat(String formatName, String
    Mark> alphabetName) this method returns an int by doing some
    Mark> bitwise operations that should equal one of the constants in
    Mark> SeqIOConstants.

    Mark> There seems to be a problem however with formats like
    Mark> Genbank. If you supply the formatName "genbank" then the DNA
    Mark> alphabet is implied however you have to give a alphabetName
    Mark> as an argument. If you give the name DNA then the returned
    Mark> in no longer mathces the SeqIOConstants value for GenBank so
    Mark> you can't use fileToBioJava() type methods ie it doesn't
    Mark> recognize the genbank | dna operation. If you use an empty
    Mark> string for the alphabetName if defaults to "Unknown" which
    Mark> again won't work. If you put null as the secong argument you
    Mark> get a null pointer exception.

Well, I had to hack it on the fly in Singapore in order to get the
OBDA stuff working. There's a bunch of methods which are now broken. I
found a couple of hours last night to fix more OBDA, but the
fileToBiojava etc in SeqIOTools is down below that on my list.

I orginally mapped the name "genbank" to GENBANK | DNA, but then
GENBANK | RNA is also valid. Plus you can coerce a sequence of any
alphabet into just about any format with EMBOSS (e.g. GENBANK | AA).

So the current state is that swissprot, genpept and pdb imply AA,
phred implies DNA and all others make no assumption. It would be more
consistent to make no assumptions at all about format name implying an
alphabet.

    Mark> To be really robust we should probably have an overloaded
    Mark> identifyFormat() method that takes either, just the format
    Mark> name and complains if it really needs an alphabet (like for
    Mark> Fasta) and one that takes two and complains if your
    Mark> combination makes no sense eg GenBank and RNA or
    Mark> something. We need to at least get it working before 1.3

But GENBANK | RNA does make sense e.g. gb:HSA299431

You're right, though. I need to check in a test for all format/alpha
combinations for each method. I can't do this in work hours - it'll
take me a few days to find the necessary time.

cheers,

Keith

-- 

- Keith James <kdj at sanger.ac.uk> bioinformatics programming support -
- Pathogen Sequencing Unit, The Wellcome Trust Sanger Institute, UK -



More information about the biojava-dev mailing list