[Bioperl-l] [Attachment Removed] GenBank files CONTIG line
Fields, Christopher J
cjfields at illinois.edu
Thu Mar 19 18:14:45 UTC 2015
Matthew,
The attachment appears to be scrubbed, can you send it directly or post it as a gist? Just need the code and maybe the NCBI ID of the test case.
The fix seems sane enough; for large files it might be easier to have a back-end store (e.g. Bio::DB::SeqFeature::Store) that can house the original data persistently and sections of the data output as needed. Not sure if you want to take that on :)
chris
> On Mar 19, 2015, at 1:09 AM, Matthew Laird <lairdm at sfu.ca> wrote:
>
> Hi all,
>
> So I finally found time to dig in to this issue from last year, reflected in github issue #83, which was causing a significant slowdown in parsing genbank files. It turns out a change in Bio::PrimarySeqI made within the last year has some unintended consequences. Functions like trunc() and recom() now use $self->clone to make a new sequence object to return except for sequence objects of type Bio::Seq::LargePrimarySeq or Bio::Seq::LargeSeq, I think at the least Bio::Seq::RichSeq should be added to that list.
>
> The case I'm seeing (simple driver attached to illustrate) is looping through a genbank file and doing a trunc() on the sequence based on the coordinates of an individual cds feature. By using clone() on the sequence first you're cloning the entire original sequence object including all the features - all the cds, gene, etc records - then simply writing the new sub-sequence in to this cloned object. I don't think that's what the user would expect to happen. It works fine for simple sequence objects, but for a rich sequence this is probably the wrong behaviour. To see this uncomment the Dumper line in the driver below and you'll see all the returned truncated sub-sequence objects still have all the original features, including those outside of the target range. This issue is then further compounded if you try to do further manipulations on the returned sequence, assuming it'd be a simple sequence as it was in previous releases - suddenly you have a large sequence object with a lot of extra features to manipulate you might not have been expecting (the source of my original issue).
>
> I'm creating a pull request adding Bio::Seq::RichSeq as a seq type not to use clone() for to trunc, but perhaps the change should also be made to all the other functions in Bio::PrimarySeqI that now use clone() since using clone on a rich sequence type in this situation has a very noticeable effect on performance. Alternatively on trunc() we should parse a rich sequence type to ensure all features in the returned object really are limited to the requested range, however that would have significant performance consequences.
>
> Thoughts? Thanks gang!
>
> On 14-09-16 02:16 PM, Matthew Laird wrote:
>> Good afternoon,
>>
>> I wanted to report what I think is an issue but I'm not positive yet. I found this old mailing list posting from May (http://lists.open-bio.org/pipermail/bioperl-l/2014-May/071583.html) about the changes to NCBI's genbank files, and I just grabbed the latest bioperl live with August's patch to hopefully solve it. That part worked great, instead of spewing a few GB of warns and the whole sequence multiple times it read the genbank file and wrote out an embl file perfectly fine.
>>
>> However the current bioperl live created a new issue. I have a mirror of NCBI's bacterial genomes directory (yes, I know, I need to move to the new directory structure in the next 6 months) and this pipeline takes the genbank file and makes the embl, ptt, faa, and fna as needed. This usually takes seconds. Whatever changed in bioperl live compared to BioPerl 1.6.922 causes the script to spin doing something very intensely for tens of minutes, slowly writing out the ptt file.
>>
>> Simply copying genbank.pm from bioperl live to my install directory solved both the CONTIG issue and kept the whole conversion process speedy. So I'm happy for now, but I wanted to mention this in case it rings a bell with anyone on what could have changed to make parsing a gbk in to a ptt so much less efficient now.
>>
>> Thanks.
>>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list