[BioSQL-l] Download from GenBank to BioSQL
Peter Cock
p.j.a.cock at googlemail.com
Wed Jan 15 13:45:45 UTC 2014
On Wed, Jan 15, 2014 at 12:47 PM, Trevor Bell
<trevorgrahambell at gmail.com> wrote:
> I would like to add a few thousand nucleotide sequences, which match a
> search phrase, to a BioSQL database. There are obviously many different
> ways of doing this. What is the "best practice" approach?
>
> Is it better to search via the website, download al the full records as a
> single file, parse this file offline and add to the database, rather than
> querying the entries one by one and adding them? I'm conscious of the
> overhead/load on the NCBI servers of fetching thousands of sequences
> individually.
Personally I would download GenBank format files from the NCBI using
their FTP site if possible (depends on what organism etc you work with)
or with the Entrez web API (respecting their usage guidelines, and including
a sanity test for complete downloads). If you use Entrez, you can make
batch requests (e.g. 100 sequences at a time - adjust depending on
the size you're working with, less if whole genomes).
Then, separately, I would import the GenBank files into BioSQL.
You should be able to do that in BioPerl/Biopython/BioRuby/BioJava.
Peter
More information about the BioSQL-l
mailing list