[Biojava-l] Genbank parsing problem

Thomas Down td2@sanger.ac.uk
Tue, 30 Apr 2002 16:26:09 +0100


On Tue, Apr 30, 2002 at 09:12:59AM -0400, Simon Foote wrote:
> I've recently run across a problem with parsing of Genbank files 
> containing unbounded locations.
> Anyone have any idea what's causing it.  I tried to trace it back 
> through but got lost.  But I think it has to do with the single <1 for 
> the -35_signal as shown in the example.
>
>      -35_signal      <1
>                       /gene="entD"

The default Feature implementations in the BioJava development
tree explicitly forbid construction of Features with locations
which aren't contained by the sequence to which they're attached.
As a quick fix, you can just remove the check from the
constructor of org.biojava.bio.seq.impl.SimpleFeature (lines
281--283 in my copy).

I'm not sure what the proper solution for this problem is.  Normally,
features which extend beyond the sequence can be transformed into
RemoteFeatures.  However, this particular feature is nasty in that
it doesn't even partially overlap the sequence.  To my mind, it's
actually pretty much meaningless, and the best thing to do would be
to drop it.  But some people like to be able to represent the whole
of Genbank.

Does anyone know how many more `wholly remote' features there are
in the databases?  And any great ideas about how they could be
usefully represented?

   Thomas.