[Bioperl-l] CON(structed) sequence databases?
cjfields at uiuc.edu
Wed Jan 31 14:05:46 UTC 2007
On Jan 30, 2007, at 1:45 AM, JK ((Jesper Agerbo Krogh)) wrote:
> What do you do about parsing sequences from the "CON"-divisions of
> EMBL/Genbank? The entries looks just like this one:
> The bioperl 1.4 parser dies on the embl-version and the 1.5 parser
> the complete .dat file as a single entry.
For GenBank CONTIG/WGS line parsing you'll have to update to Bioperl
1.5.2 (I added that in after 1.5.1). The CONTIG data is currently
just carved up by newline and stored as SimpleValue annotation when
parsing GenBank records; I don't believe it is even parsed with EMBL
at this time. Although we could probably do something using
Bio::Location objects, there really hasn't been much demand for it
since one can retrieve the sequences assembled by NCBI by requesting
the full GenBank record (automatically set up in Bio::DB::GenBank) or
requesting return format 'gbwithparts' when using eutils.
To retrieve the parsed data from a GenBank record in a Bio::Seq object:
my @contigs = $seq->annotation->get_Annotations('CONTIG');
If the complete .dat file is read as a single file then there's
definitely a bug (end of seq record isn't detected), which is
possible since I only tested against single CON files. Could you
point out the dat file you checked so I can test it out?
More information about the Bioperl-l