[Biojava-l] Fasta & EMBL feature table parsing

Matthew Pocock mrp@sanger.ac.uk
Mon, 27 Nov 2000 15:03:07 +0000


Hi Keith,

You should drop in some time and say hello (D322).

Keith James wrote:

> Hi,
>
> I'm one of the Sanger Pathogen Sequencing Unit annotators and I've
> been writing/using OO Perl stuff for EMBL feature table editing,
> Blast/Fasta/HMMER/EMBOSS etc. sequence analysis. I'm a Java newbie
> looking to see if the 'grass is greener' on the Java side of the
> fence.
>
> I spent a weekend reading the Javadoc and trying things out. No
> problems. Now I have some questions:
>

Wow - you could make stuff work from reading the docs? They must be better
than I remember...

>
> I want to implement a Fasta search output parser (for the nicer -m 10
> form of output). I have a Perl implementation right now. Going through
> the list archive I found lots of discussion regarding the Blast
> SAX-type parser. Would this be the preferred way to cope with Fasta?
> This might be a bit of a challenge for me as I am initially confused
> by the various layers of the SAX-type system, but I'm sure I'll sort
> it out.
>

SAX would be the ideal way to do this, but as you say, it does require a
level of effort that may be disproportionately high.

>
> (How does the SAX-type parser fit in with the code in
> org.biojava.bio.search?)
>

bio.search specifies how the biojava objects for representing search methods
& results should appear. The parsing framework specifies how the results
flow through the application as a stream of data. It is easy to build
bio.search objects from the xml streams by extracting interesting stuff.
However, with the streams, you can do on-the-fly translation into other
formats e.g. HTML. You could also build the bio.search objects directly from
the fasta search output, or build them to represent the results of your
personal search algorithm writen in Java.

>
> And an observation:
>
> The EMBL flatfile feature table parser (at least, as it was until the
> new io stuff) would overwrite qualifiers. e.g. where there were
> several /gene names in a feature, only the last one would be
> retained. Also quirks similar to earlier Bioperl (like discarding
> information from < and > in locations, which is important for us to
> keep). Are these going to be addressed in the io shakeup?
>

The qualifier overwriting should be adressed by the new IO (fingers
crossed). Fuzzy locations are evil. I ducked handeling this one untill
somebody required it. You requre it, so I guess the days of ducking are
over. I am willing to add a new implementation of the Location interface
called FuzzyLocation. It will have isMinFuzzy() and isMaxFuzzy() boolean
methods, and will decorate another Location for all the other location
methods. This way I think we can store everything & lose nothing. Sounds
good?

>
> On a related note, if nobody is going to implement writeSequence for
> EMBL, then I'll offer to do it.

Thanks - once the new IO has settled down, this would be great.

>
>
> cheers,
>
> Keith
>

Matthew

>
> --
>
> -= Keith James - kdj@sanger.ac.uk - http://www.sanger.ac.uk/Users/kdj =-
> The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambs CB10 1SA
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l@biojava.org
> http://biojava.org/mailman/listinfo/biojava-l