[Bioperl-l] truncating a sequence and remapping annotations

Roy Chaudhuri roy.chaudhuri at gmail.com
Sat Aug 29 13:22:53 UTC 2009


Hi Joshua,

A couple of years ago I did implement (in a fairly hacky way) a
trunc_with_features method that does exactly this. It was incorporated
into Bio::SeqUtils and is still there as far as I know. Maybe it would
be suitable for your purposes?

Roy.

2009/8/28 Joshua Orvis <jorvis at gmail.com>:
> I should weigh in here since I am the above-mentioned 'user' who posed the
> question in #bioperl.
>
> To clarify, to train one particular gene finder I need to take a full
> genbank file with annotation for a whole genome and create separate gbk
> records, one for each gene.  Each record will then contain the gene, exon
> coordinates for the CDS and sequence for the gene.
>
> I can iterate through the features of the full record and do the math myself
> for each spliced coordinate, making/writing individual records as I go, but
> thought I would see if BioPerl had any mechanism to extract a region of an
> annotated record and treat the starting base of that extraction as position
> 1, recoordinating all the other features that were present.  Then I could
> just iterate through the features of the whole entry, extracting regions for
> each gene as I see them.
>
> Hopefully this makes sense.
>
> Joshua
>
> On Thu, Aug 27, 2009 at 2:41 PM, Jason Stajich <jason at bioperl.org> wrote:
>
>>
>> Yeah one thought that we batted around at a hackathon many moons ago had
>> been to use Bio::DB::SeqFeature in a lightweight way under the hood to
>> represent sequences in layers more rather than the arbitrary data model that
>> is setup by focusing on handling GenBank records.  A lot of the architecture
>> development (that is like 10-15 years old now!) was initially just focused
>> on round-tripping the sequence files. We more recently felt like a new model
>> was more appropriate.  With the fast SQLite implementation that Lincoln has
>> put in for DB::SeqFeature we could in theory map every sequence into a
>> SQLite DB and then have the power of the interface.
>>
>> Some more bells and whistles might be needed but the basic API is respected
>> AFAIK and it prevents needing to store whole sequences in memory.  The
>> SeqIO->DB::SeqFeature loading would need some finessing so that as parsed
>> the sequence object could be updated efficiently.
>>
>> Actually this might also help reduce the number of objects needed to be
>> created by basically efficiently serializing sequences into the DB on
>> parsing (and with some simple caching this could make for pretty fast
>> system).  Since disk is basically not a limitation now could be an
>> interesting experiment?  Maybe it is too out there, but if not it could be
>> something major enough that it has to go in a bioperl-2/bioperl-ng.   It
>> sort of assumes the data model of Bio::DB::SeqFeature is adequate for all
>> the messiness of sequence data formats and one problem for some people has
>> been the seq file format => GFF in order to load it into a SeqFeature DB for
>> Gbrowse... So I don't know what are the boundary cases here.  Certainly for
>> FASTA it should be straightforward.
>>
>> -jason
>>
>> On Aug 27, 2009, at 11:20 AM, Chris Fields wrote:
>>
>>  It's not implemented completely.  As Jason mentioned in the bug report, it
>>> was meant to be part of an overall system to truncate sequences with
>>> remapped features, but the implementation in place is substandard.  It's
>>> open for implementation if anyone wants to take it up.
>>>
>>> I should point out, though, in my opinion Bio::DB::GFF/SeqFeature deal
>>> with this in a more elegant and lightweight way, and is probably the
>>> direction I would take.  YMMV.
>>>
>>> chris
>>>
>>> On Aug 27, 2009, at 12:40 PM, Robert Buels wrote:
>>>
>>>  Looks like bug 1572 is related to this:
>>>> http://bugzilla.open-bio.org/show_bug.cgi?id=1572
>>>>
>>>> Rob
>>>>
>>>> Robert Buels wrote:
>>>>
>>>>> Hi all,
>>>>> Recently a user came into #bioperl looking to truncate an annotated
>>>>> sequence (leaving the region between e.g. 150 to 250 nt), and have the
>>>>> annotations from the original sequence be remapped onto the new truncated
>>>>> sequence.
>>>>> Poking through code, I came across an undocumented function trunc() that
>>>>> from the comments looks like it was written by Jason as part of a master
>>>>> plan to implement this very functionality.
>>>>> Just wondering, what's the status of that?
>>>>> Rob
>>>>>
>>>>
>>>>
>>>> --
>>>> Robert Buels
>>>> Bioinformatics Analyst, Sol Genomics Network
>>>> Boyce Thompson Institute for Plant Research
>>>> Tower Rd
>>>> Ithaca, NY  14853
>>>> Tel: 503-889-8539
>>>> rmb32 at cornell.edu
>>>> http://www.sgn.cornell.edu
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>
>> --
>> Jason Stajich
>> jason.stajich at gmail.com
>> jason at bioperl.org
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>




More information about the Bioperl-l mailing list