[Bioperl-l] CONTIG sequence files from the NCBI

michael watson (IAH-C) michael.watson at bbsrc.ac.uk
Thu Feb 23 10:17:39 UTC 2006


What I mean is, you have accession1, which is a contig file referring to
n other sequence files.  Accession1 has a version number.  Is that
version number increased when one of the sequences that constitute it is
updated? 

-----Original Message-----
From: Brian Osborne [mailto:osborne1 at optonline.net] 
Sent: 18 February 2006 04:56
To: michael watson (IAH-C); bioperl-l
Subject: Re: [Bioperl-l] CONTIG sequence files from the NCBI

Michael,

Yes, BioPerl has done this for you. Essentially what it does it take all
the ids in the CONTIG section and query for each individually, then use
the sequences and the location data to create the single large sequence.
This sequence is appended to the annotation and feature section of the
initial Genbank entry. If you want to study this yourself take a look at
Bio::DB::NCBIHelper::postprocess_data.

OK, to answer your first question with my assumption: what NCBI is doing
is simply providing a shorthand rather than an entire large sequence,
therefore no feature coordinates change, whether it's shorthand, CONTIG,
or longhand, ORIGIN. Second, my explanation tells you that all the
sequences are the very latest versions of each sequence, that's how
eutils works by default.
However, I don't think I've answered your question because I'm not sure
I understand what you mean by "when I ask bioperl if these sequences
have been updated, I will be told no". All Bioperl does is read the file
provided by GenBank and use its stated version, nothing fancy.

Brian O.


On 2/16/06 5:31 AM, "michael watson (IAH-C)"
<michael.watson at bbsrc.ac.uk>
wrote:

> Hi
> 
> I have two questions really.  I fetched bacterial genome sequences 
> from the NCBI using Bio::DB::GenBank.
> 
> Some of these sequence entries are CONTIG sequences, ie they just 
> point to other sequences that need to be joined together to form the 
> entire genome.
> 
> Looking at my downloads, it looks as if bioperl has done all the 
> necessary joining for me - or maybe it was the NCBI that did the 
> joining?
> 
> OK, so firstly, did bioperl do the joining, and if so, are all the 
> co-ordinates of the features updated to reflect their new location on 
> the new, joined sequence?
> 
> And secondly, sequence versions... I'm thinking that possibly the 
> sequence version of the CONTIG may be 1 (as it hasn't changed) yet the

> versions of the sequences it refers to might have changed, so when I 
> ask bioperl if these sequences have been updated, I will be told no 
> because the CONTIG sequence version is 1, but I should be told yes 
> because the underlying sequences have...?
> 
> Make sense?
> 
> Thanks
> Mick
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l






More information about the Bioperl-l mailing list