[BioSQL-l] how to quickly retrieve feature sequences

Mon Jun 21 09:42:52 EDT 2004

It turned out it's quick enough to retrieve sequences such as gene, promter
etc. The SQL I provided in last message had a 'bug' on "substring(t6.seq,
t2.start_pos,t2.end_pos) seq" line, which will retrieve the subsequence
starting at "t2.start_pos" with length of "t2.end_pos". But what I needed is
the gene sequences, which should be "substring(t6.seq,
t2.start_pos,t2.end_pos-t2.start_pos+1) seq".

If the average length of gene sequences is 1-1.5k, retriving every 1000 gene
sequences needs about 2-4 seconds on our server(Dell PowerEdge 2650 with
dual Xeon 2.6G, 512K). Is this fast enough for you guys?

Gang

-----Original Message-----
From: biosql-l-bounces at portal.open-bio.org
[mailto:biosql-l-bounces at portal.open-bio.org]On Behalf Of Gang Wu
Sent: Tuesday, June 15, 2004 1:31 PM
To: biosql-l at open-bio.org
Subject: RE: [BioSQL-l] how to quickly retrieve feature sequences

SQL again:

SELECT t1.seqfeature_id,t1.bioentry_id,t2.start_pos, t2.end_pos, t2.strand,
t4.value locus_tag,
substring(t6.seq, t2.start_pos,t2.end_pos) seq
FROM `seqfeature` t1 inner join location t2 on
t1.seqfeature_id=t2.seqfeature_id
inner join term t3 on t1.type_term_id=t3.term_id
inner join seqfeature_qualifier_value t4 on
t1.seqfeature_id=t4.seqfeature_id
inner join term t5 on t4.term_id=t5.term_id
inner join biosequence t6 on t1.bioentry_id=t6.bioentry_id
where t3.name='gene' and t5.name='locus_tag'
limit 2

Gang

-----Original Message-----
From: biosql-l-bounces at portal.open-bio.org
[mailto:biosql-l-bounces at portal.open-bio.org]On Behalf Of Gang Wu
Sent: Tuesday, June 15, 2004 12:38 PM
To: biosql-l at open-bio.org
Subject: [BioSQL-l] how to quickly retrieve feature sequences

Hi,

I just loaded the 5 Arabidopsis thalian Genbank genome files into my
sequence database(BioSQL 1.38). My question is: How can I efficiently
retrieve all gene sequences from the database? I tried to do that by joining
seqfeature, seqfeature_qualifier_value, location, term and biosequence
tables, but it turned out to be extremely slow(See the attached SQL, 2
records take about 20 seconds on my Dell PowerEdge 2650 with dual 2.6G
Xeons). Does anyone have a better way to do it?

All I can imagine to do this faster is(by Java or other languages): Pull all
gene location info; Pull erlated sequence from biosequence table; rotate
through the gene location list and retrieve the substring of the sequence.
But this does not seem attractive for me since for different applications, I
have to write code to pull the sequences by myself. Is it possible to
extend/modify the BioSQL schema to serve this purpose better?

My understanding is that a lot subsequent applications would be only
interested in certain pieces of the whole genome sequences and there must be
an efficient way to do that. If everyone has to invent their method, the
BioSQL might be a little bit too limited. Any idea on this?

Gang

_______________________________________________
BioSQL-l mailing list
BioSQL-l at open-bio.org
http://open-bio.org/mailman/listinfo/biosql-l

_______________________________________________
BioSQL-l mailing list
BioSQL-l at open-bio.org
http://open-bio.org/mailman/listinfo/biosql-l