[Bioperl-l] CONTIG sequence files from the NCBI

Brian Osborne osborne1 at optonline.net
Sat Feb 18 04:56:08 UTC 2006


Michael,

Yes, BioPerl has done this for you. Essentially what it does it take all the
ids in the CONTIG section and query for each individually, then use the
sequences and the location data to create the single large sequence. This
sequence is appended to the annotation and feature section of the initial
Genbank entry. If you want to study this yourself take a look at
Bio::DB::NCBIHelper::postprocess_data.

OK, to answer your first question with my assumption: what NCBI is doing is
simply providing a shorthand rather than an entire large sequence, therefore
no feature coordinates change, whether it's shorthand, CONTIG, or longhand,
ORIGIN. Second, my explanation tells you that all the sequences are the very
latest versions of each sequence, that's how eutils works by default.
However, I don't think I've answered your question because I'm not sure I
understand what you mean by "when I ask bioperl if these sequences have been
updated, I will be told no". All Bioperl does is read the file provided by
GenBank and use its stated version, nothing fancy.

Brian O.


On 2/16/06 5:31 AM, "michael watson (IAH-C)" <michael.watson at bbsrc.ac.uk>
wrote:

> Hi
> 
> I have two questions really.  I fetched bacterial genome sequences from
> the NCBI using Bio::DB::GenBank.
> 
> Some of these sequence entries are CONTIG sequences, ie they just point
> to other sequences that need to be joined together to form the entire
> genome.
> 
> Looking at my downloads, it looks as if bioperl has done all the
> necessary joining for me - or maybe it was the NCBI that did the
> joining?
> 
> OK, so firstly, did bioperl do the joining, and if so, are all the
> co-ordinates of the features updated to reflect their new location on
> the new, joined sequence?
> 
> And secondly, sequence versions... I'm thinking that possibly the
> sequence version of the CONTIG may be 1 (as it hasn't changed) yet the
> versions of the sequences it refers to might have changed, so when I ask
> bioperl if these sequences have been updated, I will be told no because
> the CONTIG sequence version is 1, but I should be told yes because the
> underlying sequences have...?
> 
> Make sense?
> 
> Thanks
> Mick
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l





More information about the Bioperl-l mailing list