[Biojava-l] Remote Locations

Thomas Down td2@sanger.ac.uk
Tue, 6 Feb 2001 11:52:42 +0000


On Mon, Feb 05, 2001 at 02:36:58PM -0500, Cox, Greg wrote:
> I plugged some new data into the genbank and embl parsers, and there's a
> slight problem.  A location like "join(L41624.1:2858..5660,1..419)" is valid
> and refers to a different sequence, L41624.  I've coded up a new location
> type, RemoteLocation to handle this case, but I want some feedback before
> committing it.  

This is a really horrible issue...  It comes from the fact that
99% of the time we want to deal with EMBL/GENBANK/whatever
as simple files, but in reality you need to look at the database
as a whole.

> 	 I've attached my code, but the big problem I see is that
> RemoteLocation implements Location, and contains a Location.  I've dealt
> with this recursive inheritance before and not enjoyed the experience.  The
> other option, inheriting from a concrete location, begs the question of
> which one.  

I'm afraid I'm with Matthew on this one.  BioJava Locations represent
sets of points within some coordinate system.  EMBL-locations,
which can include joins between two separate coordinate systems
are a much more complicated case -- Features feel a far more
appropriate place to keep this semantically rich information.

The `nice' way to handle this case is to assemble all the sequences
involved into a single coordinate system, and build features there.
As an example, I've been working on a BioJava bridge for the Ensembl
database.  In their gene model, exons are always stored in the
coordinate system of the working-draft raw contigs.  You then get
transcripts which are simply sets of these exons.  In BioJava,
we try to create Transcript features on the raw contigs whenever
possible, but if a transcript spans two or more of these contigs
we create a feature on the assembled sequence instead.  It's been
a bit awkward to code efficiently, but does work very cleanly and
seems to be behaving itself in practice.

Below I've suggested a possible roadmap for dealing with this issue.
How does this fit with your requirements?


For 1.1:

  - Add a boolean property on EmblProcessor (and GenbankProcessor) which
    defines the behaviour on seeing a remote location.  The
    options are:

          + Throw an exception, like we do at the moment (but
            hopefully rather clearer).

          + Parse the location entry, including remote parts.
            Construct a BioJava location covering all the local
            parts, then add an Annotation bundle property to the
            feature giving the full EMBL (Genbank) location.

For early 1.2 development

  - Write special SequenceDB implementations for EMBL and GENBANK,
    which offer all the single-entry sequences, but which can also
    construct assemblies when we need to represent remote locations.

    This should also make these databases really usable resources
    in BioJava.  There should be a simple interface (system properties?)
    for defining where the data comes from -- we should be able to
    support local files, web interfaces (SRS?), the EBI CORBA service,
    and probably some others.

    It should be possible to hide a lot of this behind naming
    and directory services.


How does this sound?

   Thomas.