[Biojava-l] EMBL/Genbank parse/write improvments

Keith James kdj@sanger.ac.uk
20 Mar 2001 10:27:03 +0000


Hi all,

I've made a few home-improvements in FeatureTableParser and
SeqFormatTools (which contains the static method for stringifying
EMBL/Genbank Locations).

Locations of the form

(123.456)..789
123..(456.789)
(123.345)..(456.678)

(plus combinations like <123..(456.789), (123.456)..>789)

should now be supported by both readSequence and writeSequence.

Unsupported are fuzzy points of the form (123.456), 'between residue'
locations like 123^456, remote locations like AL123456:(123...456) and
unbounded ranges which only have a single point within the entry
e.g. <123, >123 or <>123.

When these are encountered an Exception is thrown, then caught by some
code that Greg (I think) put in, resulting in an message to
System.err, rather than instant flaming death. However, I think the
Exceptions and sensible (documented) Feature repair/recovery options
need some work.

Locations like <123 are a bit odd, because they are really ranges, but
exist as points in the entry. So are they best represented as
FuzzyRange, or FuzzyPoint (along with (123.456))?

There is still a deficiency in the parser as it is makes no to attempt
to interpret feature types e.g. CDS, gene etc. Therefore a gene still
ends up having its exons represented by a CompoundLocation on one
strand, rather than set of sub-features, each with their own strand
information.

In the short term I'm going to add some code to store feature
information not fully preserved by the parser in the feature's
annotation bundle. It should therefore be possible to post-process
Feature(s) with a type (CDS, gene, repeat, exon) specific heuristic,
rather than burden the parser with decision-making code.

At the moment this stuff may well break with some of the more scary
EMBL entries.

Keith

-- 

-= Keith James - kdj@sanger.ac.uk - http://www.sanger.ac.uk/Users/kdj =-
The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambs CB10 1SA