[Bioperl-l] storing and retrieving partial sequences

Lincoln Stein lstein@cshl.org
Thu, 6 Dec 2001 17:40:46 -0500


Hi All,

The Bio::DB::GFF::Adaptor::mysqlopt module in bioperl-live uses a very simple 
optimization that is very effective at speeding up the following three types 
of query:

	Given a search range:

	1) find ranges that overlap with it
	2) find ranges that are contained within it
	3) find ranges that contain it

It uses a binning scheme in which there are several bin tiers of different 
orders of magnitude (the smallest tier consists of 1K bins and the largest is 
large enough to hold the largest range in the database; computed dynamically 
during database load).  The bins are then turned into floating point indexes. 
 Range queries do a series of range checks on the bins in order to reduce the 
search space.  Here's an overlap query for all features between positions 
500,000 and 520,000 on chromosome "I":

 SELECT fref,fstart,fstop
 FROM fdata
 WHERE data.fref='I'
        AND (fbin between 100000000.000000 and 100000000.000000
         OR fbin between 10000000.000000 and 10000000.000000
         OR fbin between 1000000.000000 and 1000000.000000
         OR fbin between 100000.000004 and 100000.000005
         OR fbin between 10000.000049 and 10000.000052
         OR fbin between 1000.000499 and 1000.000520)
        AND fdata.fstop>=500000 AND fdata.fstart<=520000

And here's one that finds features that are completely contained within the 
same range:

 SELECT fref,fstart,fstop
 FROM fdata
 WHERE data.fref='I'
        AND (fbin between '100000000.000000' and '100000000.000000'
         OR fbin between '10000000.000000' and '10000000.000000'
         OR fbin between '1000000.000000' and '1000000.000000'
         OR fbin between '100000.000004' and '100000.000005')
        AND fdata.fstart<='500000' AND fdata.fstop>='520000'

(pardon the extraneous quotes in the latter example; they are introduced by 
my perl debugging code).

On a MySQL database with 6 million ranges and reference segments of 15 megs 
average size, it takes 10.22s for the unoptimized query to run (one without 
the bin conditions), and 0.03 seconds for the optimized query to run.  
Obviously you need a bit of a Perl script to generate the SQL, but that's 
what Bio::DB::GFF is for.

I'm writing the algorithm up, but feel free to use it if you think it's 
useful (you can steal the details from the code).

PostgreSQL has an R-Tree datatype that is specialized for multi-dimensional 
range queries.  I experimented with it before arriving at this solution, but 
found that it doesn't do well with sequence feature data.  The problem is 
that R-Trees create a set of dynamically-sized bins, but the great 
differences in magnitude of the features in a typical data set (some features 
are megabases, others are a single base pair) causes many features to be 
greedily allocated to the largest bin.

Lincoln


On Thursday 06 December 2001 15:05, Aaron J Mackey wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
>
> This is a bit offtopic, but the invocation of 'postgres' made my ears perk
> up ...
>
> On Wed, 5 Dec 2001, Chris Mungall wrote:
> > If you're using postgres there are a number of optimisations to
> > consider -
> >
> > you can use the native range type to store the coordinates, this is
> > presumably optimised for this sort of thing.
>
> "presumably" is the key word here.  I've recently migrated to using
> Postgres, mainly because of all the wonderful things it seemed to provide
> along these lines (more interesting datatypes and inheritance, namely).
> I've found that, with a little inspection, many of these things aren't
> actually optimized at all.  Provided, yes.  Optimized, not necessarily.
>
> > other than the speed issue, which is still open, i think postgres
> > could be better than mysql as an open source bioinformatics dbms.
>
> I want to agree.  I want to believe.  I want to read some documentation
> about inheritance that tells me more than just the trite cities/capitals
> example.  I've read the developers list where they talk about how
> inheritance isn't actually "finished".  How you cannot access children
> attributes without explictly joining in the table (which is how you have
> to handle subtyping in MySQL).  How foreign keys amongst children aren't
> shared.  etc. etc. etc.
>
> Sorry, didn't mean to rant.  To get bac on topic, to answer the range
> overlap/exclusion problem, here's how we do it (with MySQL, begin/end
> columns):
>
> There are two ways, and no I haven't explictly compared them to see which
> is better (but my hunch is the first on Postgres, the second with MySQL).
>
> Method #1. Use set logic.  The key point here is that you need to create a
> temporary table (or VIEW in other RDBMs) that "explodes" each begin/end
> range pair into one row for each position.  You do this by joining to an
> auxiliary table of integers (that has at least as many rows as your
> longest sequence).  So if you have a table of "ranges" like this:
>
> CREATE TABLE ranges (
> range_id int unsigned not null auto_increment primary key,
> seq_id int unsigned not null foreign key references seq(id),
> begin int unsigned not null default 1,
> end int unsigned not null default 1,
> index(seq_id),
> index(begin),
> index(end)
> )
> - -- CONSTRAINT good_positions
> - -- CHECK (begin < end AND
>           end <= SELECT length from seq where id = seq_id),
> ;
>
> You need an auxiliary table "allnumbers" like this (adjust last select as
> necessary for your largest sequence length):
>
> CREATE TABLE allnumbers (
> number int unsigned not null primary key
> );
> INSERT INTO allnumbers (number)
> VALUES (0, 1, 2, 3, 4, 5, 6, 7, 8, 9);
> INSERT INTO allnumbers (number)
> SELECT DISTINCT N1.number +
>                (N2.number * 10) +
>                (N3.number * 100) +
>                (N4.number * 1000) +
>                (N5.number * 10000) +
>                (N6.number * 10000)
> FROM allnumbers as N1,
>      allnumbers as N2,
>      allnumbers as N3,
>      allnumbers as N4,
>      allnumbers as N5,
>      allnumbers as N6
> WHERE N1.number +
>      (N2.number * 10) +
>      (N3.number * 100) +
>      (N4.number * 1000) +
>      (N5.number * 10000) +
>      (N6.number * 10000) > 9
>
> (Whew!)
>
> Now make your temporary tables (in MySQL, an in-memory heap; YMMV)
>
> CREATE TEMPORARY TABLE coverage (
> range_id int unsigned not null,
> seq_id int unsigned not null,
> position int unsigned not null,
> primary key (range_id, seq_id, position)
> ) TYPE = Heap;
>
> INSERT INTO coverage (range_id, seq_id, position)
> SELECT range_id, seq_id, number
> FROM ranges
> 	INNER JOIN allnumbers
> 	ON (number >= begin AND number <= end)
> ;
>
> Now you can work with "coverage" as with usual set operations
> (intersection, union, exception, etc); see Celko's "SQL For Smarties"
> Note that this requires a RDBM with sub-selects, i.e. not MySQL
>
>
> Method #2 - MySQL compatible!
>
> OK, given all that theoretical crap above, here's what we actually do;
> it's quite fast.  What follows is a loosely commented SQL "script" that we
> feed to fasta when we want to "search using all the intervening sequence
> between a set of orf calls/hits already in the database".  fasta passes
> off any command that starts with "do" to the database server; the first
> statement without a "do" is a "select" from which fasta gets it's data.
>
> - -- extract the coordinates of the hits we already have:
> do
> create table myhits type = heap
> select lib_id as libid,
>        if(strand = 'f', lbeg, lend) as begin,
>        if(strand = 'f', lend, lbeg) as end -- reversed coordinates if
> strand <> 'f' from hit, search
> where hit.search_id = search.id
> and hit.exp <= 1e-6
> and search.tag = 'my-favorite-search'
> ;
>
> - -- grab the first bunch of intergenic regions
> do
> create temporary table ranges type = heap
> select h1.libid as libid,
>        max(h1.end + 1) as begin,
>        min(h2.begin - 1) as end
> from myhits as h1, myhits as h2
> where h1.libid = h2.libid
> and h2.begin > h1.end
> group by h1.libid, h1.begin
> ;
>
> - -- add all the beginning ranges we missed
> do
> insert into ranges (libid, begin, end)
> select h1.libid as libid,
>        1 as begin,
>        min(h1.begin - 1) as end
> from myhits as h1
> group by h1.libid
> having end >= begin
> ;
>
> - -- add all the ending ranges we missed
> do
> insert into ranges (libid, begin, end)
> select h1.libid as libid,
>        max(h1.end + 1) as begin,
>        libseq.len as end
> from myhits as h1 left join libseq on h1.libid = libseq.id
> group by h1.libid
> having end >= begin
> ;
>
> - -- ok, now add ranges for those contigs that had no hits at all
> do
> create temporary table missing type = heap
> select libseq.id as libid,
>        1 as begin,
>        libseq.len as end
> from libseq
> 	left join ranges on libseq.id = ranges.libid
> 	left join search on libseq.search_id = search.id
> where search.tag = 'my-favorite-search'
>   and ranges.libid is null
> group by libseq.id
> order by libseq.id
> ;
>
> - -- had to do this in two steps because mysql won't let you
> - -- insert into a table you're currently selecting from.
> do
> insert into ranges (libid, begin, end)
> select libid, begin, end
> from missing
> ;
>
> - -- lastly, we may have a few duplicate overlapping ranges,
> - -- because our earlier strategy didn't remove them; we'll
> - -- now keep the shortest of them:
> do
> create temporary table newranges (
> id int unsigned not null auto_increment primary key,
> libid bigint unsigned,
> begin bigint unsigned,
> end bigint unsigned
> );
>
> - -- again, two steps for MySQL
> do
> insert into newranges (libid, begin, end)
> select libid, max(begin) as begin, end
> from ranges
> group by libid, end
> order by libid
> ;
>
> - -- oh, we'll only search if they're longer than 150 bp.
> do
> delete
> from newranges
> where (end - begin + 1) < 150
> ;
>
> - -- OK, give fasta some data!
> - -- the @C is a coordinate flag for fasta
> select newranges.id,
>        mid(contig.seq,
>            newranges.begin,
> 	   (newranges.end - newranges.begin + 1)
>        ),
>        concat("CONTIG_ID: ", contig.id, " @C:", newranges.begin, " ",
> contig.name) from newranges, libseq, contig
> where newranges.libid = libseq.id
> and libseq.seq_id = contig.id
> ;
>
>
> That's all incredibly simple, right?
>
> - -Aaron
>
>
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.0.6 (OSF1)
> Comment: For info see http://www.gnupg.org
>
> iD8DBQE8D89yt6Sp9Om+GYMRAhXwAJ4irnhvTGDyv8ZBabXx4YUr72OoYgCfSKcc
> pWdEXSUZ2chUIFMgmYtvXBk=
> =NP3O
> -----END PGP SIGNATURE-----
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l

-- 
========================================================================
Lincoln D. Stein                           Cold Spring Harbor Laboratory
lstein@cshl.org			                  Cold Spring Harbor, NY

NOW HIRING BIOINFORMATICS POSTDOCTORAL FELLOWS AND PROGRAMMERS. 
PLEASE WRITE FOR DETAILS.
========================================================================