[Biojava-l] Emblparsing

Thomas Down td2@sanger.ac.uk
Tue, 5 Dec 2000 11:04:06 +0000


On Tue, Dec 05, 2000 at 11:54:27AM +0100, Kristina Engdahl wrote:
> Hello everyone!
> I have recently started to work with BioJava with mostly satisfying
> results.
> At the moment I'm trying to parse an Embl flatfile to get all the
> features. It works fine and I get features such as repeat regions and
> misc_features etc. However, I would like to be able to retrieve the
> individual "exons" that are specified in the CDS feature. Like this:
> 
> FT   CDS
> join(<5642..5793,10804..10976,12496..12656,14136..14266,
> FT
> 14403..14532,16852..16987,17821..17959,18068..18122,
> FT
> 19456..19570,23623..23753,25885..26053,29102..29240,
> FT                   32621..32738,33595..33771)

Hi...

Our current parsing behaviour (either the old EmblParser, or the
new EmblLikeParser-EmblProcessor combination), will build that
feature table entry into a single BioJava feature.  All the information
in the location part of the entry will be preserved in the Biojava
Location -- you can retrieve the exons using the location's
blockIterator() method.  I hope this might do what you want.

In the longer term (or as soon as anyone feels like coding it
up...)

In the current CVS development tree, we now have a larger set
of feature interfaces, including special interfaces for representing
genes, transcripts, exons, etc.  It would be good if, in future,
we could have a more sophisticated EmblProcessor which recognises
genes in EMBL feature tables and builds more appropriate feature
objects.  Since the newio changes landed a couple of weeks back,
all the infrastructure is there to allow something like this to
be plugged in, but we don't have any code yet.

Good luck,

   Thomas.
-- 
``If I was going to carry a large axe on my back to a diplomatic
function I think I'd want it glittery too.''
           -- Terry Pratchett