[Bioperl-l] WGS sequences through Bio::DB::GenBank

Chris Fields cjfields at uiuc.edu
Thu Mar 2 17:07:09 UTC 2006


Brian,

I working out some of the WGS subsequence parsing and it's actually pretty
simple (much more so than CONTIG).  The WGS tag just gives the sequence
range and the WGS_SCAFLD tag is a list of scaffolds, each which can be
chromosomal (CM*) supercontigs or smaller subchromosomal chunks (I think,
CH*) and is a contig of shorter sequences.  Essentially, CM* files are
contigs of CH* files which are contigs of the base WGS files.  In many cases
there are only WGS files (no scaffolds), while a few a have chromosomal
scaffolds and a smaller number have multiple scaffold types.  I'm starting
simple (WGS files only) and working my way to the more complex types before
I commit anything.

An issue I can foresee is many WGS file ranges are huge (O. sativa WGS
master file lists WGS ~52000 subfiles, 12 chromosomal supercontigs, ~3000
subchromosomal scaffold contigs).  So, which to chose from, or set a default
(I'm guessing largest, using recursion to piece everything else together)?
We'll also run into an issue with the max # of ids for many of these.

Also, in relation to the contig; I found this blurb in the eutils document
in NCBI short courses
(http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=coursework.section.sample-app
s) which I found very interesting:

Application 4: Downloading Contigs
I want to download a flatfile with the full sequence of an assembly (eg. a
contig).
Solution: Use EFetch with &rettype=gbwithparts
URL:efetch.fcgi?db=nucleotide&id=27479347&rettype=gbwithparts

I tried it out and it works well.  Should we be using this for contig
building instead of the loop built into NCBIHelper?  It seems much more
direct/quicker.  I really haven't tried messing with it until I have WGS
figured out.

Christopher Fields
Postdoctoral Researcher - Switzer Lab
Dept. of Biochemistry
University of Illinois Urbana-Champaign 


> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Chris Fields
> Sent: Wednesday, March 01, 2006 10:04 PM
> To: 'Brian Osborne'; bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] WGS sequences through Bio::DB::GenBank
> 
> Thanks, Brian.  I was actually typing this up when you responded.
> 
> Okay, to answer my own question somewhat (and to confirm your answer),
> there
> IS no direct way; efetch doesn't complete these files, so the best way is
> with a query.  I'm posting this so anybody searching the mail list with
> the
> same question will maybe find this.  The NCBI help desk basically told me
> to
> use a query like so:
> 
> http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=nucleotide&te
> rm
> =AAOH00000000[accn]+AND+wgs_contig[prop]
> 
> which needs to be parsed for the individual contigs.  I tried the same
> query
> using Bio::DB::Query::GenBank and got it to work.
> 
> As for NCBIHelper, I'll give it a look and try adding this in but it won't
> be until next week.
> 
> Christopher Fields
> Postdoctoral Researcher - Switzer Lab
> Dept. of Biochemistry
> University of Illinois Urbana-Champaign
> 
> 
> > -----Original Message-----
> > From: Brian Osborne [mailto:osborne1 at optonline.net]
> > Sent: Wednesday, March 01, 2006 9:55 PM
> > To: Chris Fields
> > Subject: Re: [Bioperl-l] WGS sequences through Bio::DB::GenBank
> >
> > Chris,
> >
> > No, NCBIHelper.pm doesn't handle the WGS block, presumably this is where
> > it
> > should be coded. The approach would be very similar to that used for the
> > CONTIG block, piece the sequence together by retrieving the CONTIG
> > information specified by the WGS_SCAFLD entries.
> >
> > Brian O.
> >
> >
> > On 2/28/06 9:41 PM, "Chris Fields" <cjfields at uiuc.edu> wrote:
> >
> > > I know that a recent post showed that you could retrieve CONTIG
> > sequences
> > > from GenBank files fairly easily:
> > >
> > > http://bioperl.org/pipermail/bioperl-l/2006-February/020891.html
> > >
> > > I'm driving myself a bit buggy looking for this, and I may be blind to
> > it,
> > > but can the same be done with WGS files?  I've tried Bio::DB::GenBank
> > and a
> > > few other Bio::DB* modules to see if it's been implemented but haven't
> > had
> > > any luck yet.  I may try getting around it using
> > Bio::DB::Query::GenBank,
> > > but just trying to find a more direct route.
> > >
> > > Christopher Fields
> > > Postdoctoral Researcher - Switzer Lab
> > > Dept. of Biochemistry
> > > University of Illinois Urbana-Champaign
> > >
> > >
> > > _______________________________________________
> > > Bioperl-l mailing list
> > > Bioperl-l at lists.open-bio.org
> > > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l




More information about the Bioperl-l mailing list