[Biojava-l] EMBL/Genbank parse/write improvments

Matthew Pocock mrp@sanger.ac.uk
Tue, 20 Mar 2001 12:25:02 +0000


Keith James wrote:

> Hi all,
> 
> I've made a few home-improvements in FeatureTableParser and
> SeqFormatTools (which contains the static method for stringifying
> EMBL/Genbank Locations).
> 
> Locations of the form
> 
> (123.456)..789
> 123..(456.789)
> (123.345)..(456.678)
> 
> (plus combinations like <123..(456.789), (123.456)..>789)
> 
> should now be supported by both readSequence and writeSequence.
> 

Keith, this is sooo cool. Thanks.

> Unsupported are fuzzy points of the form (123.456), 'between residue'
> locations like 123^456, remote locations like AL123456:(123...456) and
> unbounded ranges which only have a single point within the entry
> e.g. <123, >123 or <>123.
> 

Keith, do you want to write FuzzyPoint, or shall I?

> When these are encountered an Exception is thrown, then caught by some
> code that Greg (I think) put in, resulting in an message to
> System.err, rather than instant flaming death. However, I think the
> Exceptions and sensible (documented) Feature repair/recovery options
> need some work.
> 

Yes. This *should* go away once FuzzyPoint is in.

> Locations like <123 are a bit odd, because they are really ranges, but
> exist as points in the entry. So are they best represented as
> FuzzyRange, or FuzzyPoint (along with (123.456))?
> 
> There is still a deficiency in the parser as it is makes no to attempt
> to interpret feature types e.g. CDS, gene etc. Therefore a gene still
> ends up having its exons represented by a CompoundLocation on one
> strand, rather than set of sub-features, each with their own strand
> information.
> 

So - my take on this is that we slot an extra 'feature interpritation' 
layer into the listener pipe-line that builds objects from our genomics 
package. You can then chose to get out CDSs as compound locations using 
the raw pipeline and the full genomic-complient model using the modified 
one.

> In the short term I'm going to add some code to store feature
> information not fully preserved by the parser in the feature's
> annotation bundle. It should therefore be possible to post-process
> Feature(s) with a type (CDS, gene, repeat, exon) specific heuristic,
> rather than burden the parser with decision-making code.
> 
> At the moment this stuff may well break with some of the more scary
> EMBL entries.
> 
> Keith

Have you run the parser over a complete EMBL database file yet? This is 
the acid test.


Again, thanks for all this.

Matthew