[Bioperl-l] sequence proxy server

Fri Apr 6 13:49:32 UTC 2012

Hi all,

I'm an undergrad student in molecular biology at the ANU in Australia,
and my research projects are becoming increasingly bioinformatics
heavy. The latest one has involved quite a large amount of sequence
retrieval from GenBank and GenPept. The download speed to Australia
from NCBI's servers is rather slow, and i've been thinking about how
we can improve this. One solution would be to use Bio::DB::Flat with
GenBank sequences on a local computer. However, in a situation where
there are multiple people in a lab doing bioinformatics, it seems to
me a bit of a waste to have the entire genbank/genpept database, or
even the relevant sections thereof, on each computer. So, i though
about writing a "sequence proxy" cgi script, and a corresponding
module, which would work a bit like this:

The user calls Bio::DB::SeqProxy::GenBank as they would
Bio::DB::GenBank, with the exception that a parameter for the address
of the sequence proxy server is required.
The module then sends a request similar to that sent to NCBI's servers
 by calling Bio::DB::GenBank->get_Seq_by_x() to the sequence proxy
server
I believe all requests go to the efetch page now (please correct me if
I'm wrong, i have read the relevant bioperl module code but not
thoroughly), so the CGI script on the sequence proxy would take
arguments in a similar fashion to make writing the client side module
easier.
The CGI script would use a Bio::DB::Flat database, or an interface to
an SQL database to determine if the required sequence is stored
locally. (as a aside, i'd like your thoughts on Bio::DB::Flat vs
Bio::DB::Sql or similar)
If the sequence exists locally, it would be returned to the user,
either as plain text, or inside an XML container (see below).
If not, it would be retrieved from the remote database using the
relevant Bio::DB module, and returned.

The sequence would either be returned as the relevant sequence format
(which would default to GenBank format) in plain text, or as an XML
document similar to:

<result>
<successful>1</successful>
<sequence>___YOUR GENBANK FILE HERE___</sequence>
<source>Local Database</source>
</result>
The aim of the xml document would be to simplify handling of server
errors and allow for the specification of other metadata such as which
database the sequence came from.

Firstly, I'd like to know if this sounds feasible, and if so, if
someone is already working on something similar? I don't want to
reinvent the wheel.
Secondly, I'd like to ask for your comments and advice. Being
reasonably new to bioperl (started using bioperl about 6 months ago,
but I've been coding in various languages for 8 years) I don't expect
to have considered things that may seem obvious to a more experienced
bioperl-er, so please be as brutally constructive in your criticism as
you see fit =].

I know this is alot of questions, so thanks in advance for your help.

Cheers, and a happy Easter to those who celebrate it.

Regards
Kevin Murray