[Biojava-l] Fasta & EMBL feature table parsing

Keith James kdj@sanger.ac.uk
27 Nov 2000 14:13:18 +0000


Hi,

I'm one of the Sanger Pathogen Sequencing Unit annotators and I've
been writing/using OO Perl stuff for EMBL feature table editing,
Blast/Fasta/HMMER/EMBOSS etc. sequence analysis. I'm a Java newbie
looking to see if the 'grass is greener' on the Java side of the
fence.

I spent a weekend reading the Javadoc and trying things out. No
problems. Now I have some questions:

I want to implement a Fasta search output parser (for the nicer -m 10
form of output). I have a Perl implementation right now. Going through
the list archive I found lots of discussion regarding the Blast
SAX-type parser. Would this be the preferred way to cope with Fasta?
This might be a bit of a challenge for me as I am initially confused
by the various layers of the SAX-type system, but I'm sure I'll sort
it out.

(How does the SAX-type parser fit in with the code in
org.biojava.bio.search?)

And an observation:

The EMBL flatfile feature table parser (at least, as it was until the
new io stuff) would overwrite qualifiers. e.g. where there were
several /gene names in a feature, only the last one would be
retained. Also quirks similar to earlier Bioperl (like discarding
information from < and > in locations, which is important for us to
keep). Are these going to be addressed in the io shakeup?

On a related note, if nobody is going to implement writeSequence for
EMBL, then I'll offer to do it.

cheers,

Keith

-- 

-= Keith James - kdj@sanger.ac.uk - http://www.sanger.ac.uk/Users/kdj =-
The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambs CB10 1SA