[Bioperl-l] storing and retrieving partial sequences

Robson Francisco de Souza rfsouza@citri.iq.usp.br
Tue, 4 Dec 2001 12:23:39 -0200 (BRST)


	Hello,

	I beleive this message is a bit off topic, but maybe some ideias
about my problem could help developing bioperl-db.
	I have a problem that is similar to the one described by Jonathan,
though I'm not using bioperl-db. I built a PostgreSQL database to hold
annotation features from a complete bacterial genome sequence. Each
feature has it's own table and in every table there is a pair of
coordinates describing where is the feature (start, end). My problem is to
find overlapping features, like all clones covering a certain gene. But,
at least in PostgreSQL, when I make a search, all coordinates comparisons
(which are done by using >=, <=, > and <) must do a sequential scan on the
table of overlapping features. That is slow, and may became too slow!
	Now, do you guys think this searchs could be performed faster with
a different design or is that a problem that will probably affect any
design (including bioperl-db: note Heikki's proposal for vioperl-db)?
	Sorry if that is too database specific and a bit off topic.
	Cheers,
			Robson


On Tue, 4 Dec 2001, Heikki Lehvaslaiho wrote:

> Jason Eric Stajich wrote:
> > 
> > There is the capability for getting a subseq in the bioperl-db
> > implementation (bioperl layer on top of mysql).  We don't currently cache
> > anything though so each subseq requires a new db call.  However, there
> > should be capability there to build your own Bio::Seq::CachingSeq which
> > intercepts calls if need be.
> > 
> > Not sure I totally understand the scenario so not sure if this helps.
> > 
> > -jason
> 
> My quess what Jonathan is after is a way to store a sequence from a genomic
> build. bioperl-db could be used to cache the retrieved sequences.
>  
> The bioperl-db schema would need an additional table holding the ID of the
> subsequence, the ID of the main sequence (== biosequence.biosequence_id) and
> the range covered within the main sequence. When the next sequence query
> comes in, Bio::Seq::CachingSeq could use this biosubsequence table to find
> out if a some of it is already in, retrieve it (from several separate
> sequences if needed) and calculate how much more is needed. The rest of the
> sequnec could then be retrieve from the slow main database.
> 
> e.g.:
> 
> CREATE TABLE biosubsequence (
>   biosubsequence_id	int(10) unsigned NOT NULL \
> 			PRIMARY KEY auto_increament,
>   biosequence_id	int(10) NOT NULL,
>   sub_start		int(10) NOT NULL,
>   sub_end		int(10) NOT NULL,
>   KEY(biosequence_id),
>   KEY(sub_start),
>   KEY(sub_end)
> )
> 
> 
> If the cache database has a long lifetime it needs a method to remove
> redundant sequences (not in yet). Alternaitvely on could just drom the whole
> database and start a new, but that code needs writing, too.
> 
> 	-Heikki
> 
> 
> > On Mon, 3 Dec 2001, Jonathan Epstein wrote:
> > 
> > > Hi,
> > >
> > > Does anyone have a good BioPerl or ACEDB way to handle storing and
> > > retrieval of partial sequences?
> > >
> > > The idea is that, say, I might have bp 50001-100000 of a particular
> > > sequence which is 500kb long.  I want to cache this local result,
> > > since obtaining the other sequence data may be computationally very
> > > complex and may even require manual intervention.  So, if subsequently
> > > there is a query for bp 56000-60000 I want to retrieve the data
> > > immediately from the local cache.  If there is a query for bp
> > > 40000--60000 I want to retrieve the cached portion of the data, and
> > > set in motion whatever is needed to obtain the missing data.
> > >
> > > For now we are starting a home-grown mySQL solution, but I really
> > > prefer to use a solution which is BioPerl-based or at least
> > > BioPerl-like.
> > >
> > > Can anyone suggest how we might hook into Bio::DB or Bio::Seq or ... ?
> > >
> > > Thanks,
> > >
> > > - Jonathan
> > >
> > > _______________________________________________
> > > Bioperl-l mailing list
> > > Bioperl-l@bioperl.org
> > > http://bioperl.org/mailman/listinfo/bioperl-l
> > >
> > 
> > --
> > Jason Stajich
> > Duke University
> > jason@cgt.mc.duke.edu
> > 
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l@bioperl.org
> > http://bioperl.org/mailman/listinfo/bioperl-l
> 
> -- 
> ______ _/      _/_____________________________________________________
>       _/      _/                      http://www.ebi.ac.uk/mutations/
>      _/  _/  _/  Heikki Lehvaslaiho          heikki@ebi.ac.uk
>     _/_/_/_/_/  EMBL Outstation, European Bioinformatics Institute
>    _/  _/  _/  Wellcome Trust Genome Campus, Hinxton
>   _/  _/  _/  Cambs. CB10 1SD, United Kingdom
>      _/      Phone: +44 (0)1223 494 644   FAX: +44 (0)1223 494 468
> ___ _/_/_/_/_/________________________________________________________
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>