[Bioperl-l] GenBank files CONTIG line

Matthew Laird lairdm at sfu.ca
Thu Mar 19 06:09:01 UTC 2015


Hi all,

So I finally found time to dig in to this issue from last year, 
reflected in github issue #83, which was causing a significant slowdown 
in parsing genbank files. It turns out a change in Bio::PrimarySeqI made 
within the last year has some unintended consequences. Functions like 
trunc() and recom() now use $self->clone to make a new sequence object 
to return except for sequence objects of type Bio::Seq::LargePrimarySeq 
or Bio::Seq::LargeSeq, I think at the least Bio::Seq::RichSeq should be 
added to that list.

The case I'm seeing (simple driver attached to illustrate) is looping 
through a genbank file and doing a trunc() on the sequence based on the 
coordinates of an individual cds feature. By using clone() on the 
sequence first you're cloning the entire original sequence object 
including all the features - all the cds, gene, etc records - then 
simply writing the new sub-sequence in to this cloned object. I don't 
think that's what the user would expect to happen. It works fine for 
simple sequence objects, but for a rich sequence this is probably the 
wrong behaviour.  To see this uncomment the Dumper line in the driver 
below and you'll see all the returned truncated sub-sequence objects 
still have all the original features, including those outside of the 
target range. This issue is then further compounded if you try to do 
further manipulations on the returned sequence, assuming it'd be a 
simple sequence as it was in previous releases - suddenly you have a 
large sequence object with a lot of extra features to manipulate you 
might not have been expecting (the source of my original issue).

I'm creating a pull request adding Bio::Seq::RichSeq as a seq type not 
to use clone() for to trunc, but perhaps the change should also be made 
to all the other functions in Bio::PrimarySeqI that now use clone() 
since using clone on a rich sequence type in this situation has a very 
noticeable effect on performance. Alternatively on trunc() we should 
parse a rich sequence type to ensure all features in the returned object 
really are limited to the requested range, however that would have 
significant performance consequences.

Thoughts? Thanks gang!

On 14-09-16 02:16 PM, Matthew Laird wrote:
> Good afternoon,
>
> I wanted to report what I think is an issue but I'm not positive yet.  
> I found this old mailing list posting from May 
> (http://lists.open-bio.org/pipermail/bioperl-l/2014-May/071583.html) 
> about the changes to NCBI's genbank files, and I just grabbed the 
> latest bioperl live with August's patch to hopefully solve it. That 
> part worked great, instead of spewing a few GB of warns and the whole 
> sequence multiple times it read the genbank file and wrote out an embl 
> file perfectly fine.
>
> However the current bioperl live created a new issue.  I have a mirror 
> of NCBI's bacterial genomes directory (yes, I know, I need to move to 
> the new directory structure in the next 6 months) and this pipeline 
> takes the genbank file and makes the embl, ptt, faa, and fna as 
> needed.  This usually takes seconds.  Whatever changed in bioperl live 
> compared to BioPerl 1.6.922 causes the script to spin doing something 
> very intensely for tens of minutes, slowly writing out the ptt file.
>
> Simply copying genbank.pm from bioperl live to my install directory 
> solved both the CONTIG issue and kept the whole conversion process 
> speedy.  So I'm happy for now, but I wanted to mention this in case it 
> rings a bell with anyone on what could have changed to make parsing a 
> gbk in to a ptt so much less efficient now.
>
> Thanks.
>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: test_bioperl.pl
Type: application/x-perl
Size: 737 bytes
Desc: not available
URL: <http://mailman.open-bio.org/pipermail/bioperl-l/attachments/20150318/97073d71/attachment.pl>


More information about the Bioperl-l mailing list