[Bioperl-l] Downloading multiple contigs using bioperl

Mon Sep 18 16:13:37 UTC 2006

> Hello,
> I think this might be a simple question - but I'm yet a novice...
> 
> Is there any way I can download, automatically and at once, all contigs of
> a
> given genome in Genebank, and ideally merge them all into one file? Or do
> I
> have to download every contig separately in order to receive the full
> genome?
> 
> In the latter case, is there some sort of list that provides the
> identifiers
> of all contigs of the genome I'm interested in?
> 
> Thank you very much,
> Schragi

It depends on the type of sequence record.  WGS files contain WGS line
annotation which gives a range of sequence records that can be retrieved:

LOCUS       AAFC03000000          131728 rc    DNA     linear   MAM
28-AUG-2006
DEFINITION  Bos taurus whole genome shotgun sequencing project.
ACCESSION   AAFC00000000
VERSION     AAFC00000000.3  GI:112180191
KEYWORDS    WGS.
....
FEATURES             Location/Qualifiers
     source          1..131728
                     /organism="Bos taurus"
                     /mol_type="genomic DNA"
                     /isolate="L1 Dominette 01449"
                     /db_xref="taxon:9913"
                     /sex="female"
                     /note="breed: Hereford"
WGS         AAFC03000001-AAFC03131728
WGS_SCAFLD  CM000177-CM000206
WGS_SCAFLD  CH974204-CH980624
//

The WGS line is the range of single sequences and the scaffolds represent
different scaffold or supercontig builds.  The contig files contain the list
of subsequences for the build (which can be pretty complex), but these
aren't necessary if you want the sequence itself.  That can be retrieved
directly from GenBank using Bio::DB::GenBank with the default settings; if
you use the web Entrez interface you can get the full sequences by selecting
the format 'GenBank(full)'.

Depending on what you are after, you may be better off downloading the
sequences via ftp, though.  Some of these files are very large (~100 MB or
more).  Retrieval via Bio::DB::GenBank converts everything into BioPerl
objects before saving, so these files may take a long time if they work at
all.

Christopher Fields
Postdoctoral Researcher - Switzer Lab
Dept. of Biochemistry
University of Illinois Urbana-Champaign