[Bioperl-l] Bug in Contig.pm? How to compare two sequence objects?

Robson de Souza robfsouza at gmail.com
Mon Jul 26 17:54:22 UTC 2010


On Mon, Jul 26, 2010 at 12:37 PM, Dan Bolser <dan.bolser at gmail.com> wrote:
> Thanks again Robson,
>
> One thing I was going to email the list about is a lack of tests for
> ace.pm and phrap.pm ... because of missing libs I'm not running the
> sam or the maq tests, so any changes that I'm making to Contig.pm /
> Scaffold.pm are not really being tested.
>
> Do you think anyone on the list (or you) would be able to generate a
> few hundred tests to help me start modifying code with confidence?

I hope so. I was reading Chris's presentation at BOSC and he mentioned
that people are working on the support for the samtools so I believe
somebody must have a dataset they could share.

> One idea was to actually make Contig.pm implement SeqFeatureI, so that
> you could call start and end and seq on the contig (consensus)
> directly.
>
> Here are some notes from IRC...
>
> 04:00 < dbolser> I think "$contig_object->start" is totally reasonable
> 04:01 < dbolser> otherwise its something like
> "$contig_object->get_feature_collection->get_feature_by_id("_main_contig_feature:".$contig_object->id)->start"
> 04:02 < dbolser> (untested ;-)
> 09:01 < genehack> dbolser: i think you've got an argument that there should be;
>                  it's probably going to be a lot easier to do that in
>                  something like Biome (i.e., BioPerl in Moose)

Never thought about it... but I can see that could be useful when
contigs are part of higher level assemblies, like scaffolds or
assembled chromosomes, but you could also achieve this by making
Contig.pm a Bio::RangeI or, perhaps better, a Bio::LocatableSeq. Any
reason to prefer the Bio::SeqFeatureI interface?

> Any reason not to cc this to the list?

Lack of attention. I was so concerned about the answer that I didn't
notice I pressed "Reply to" :)
Best,
Robson

> Thanks again.
>
> All the best,
> Dan.
>
>
>
>
> On 26 July 2010 17:23, Robson de Souza <robfsouza at gmail.com> wrote:
>> Hi,
>>
>> On Sun, Jul 25, 2010 at 12:35 PM, Dan Bolser <dan.bolser at gmail.com> wrote:
>>> Cheers for the clarification Robson.
>>
>> You are welcome :). Thanks for improving my long abandoned child... :)
>>
>>> How come the 'SYNOPSIS' code does produce warnings about replacing the
>>> seq? (the workaround is easy enough, don't add_seq, but still...)
>>
>> I don't know. I remember testing it while I develop Contig.pm in
>> parallel to ace.pm and phrap.pm and not seeing unexpected warnings...
>> Did you try ace.pm recently? It uses set_seq_coord and this method
>> calls add_seq, while phrap.pm uses add_seq directly, but I'm unaware
>> of previous bug reports on ace.pm regarding unexpected warnings...
>> note that set_seq_coord calls remove_seq before adding but prints
>> another warning. Compare set_seq_coord to add_seq alone and maybe you
>> will figure out what is going on...
>>
>> Regarding why you can't just add Bio::LocatableSeq objects and have
>> add_seq just set the coordinates by itself, I investigated my long
>> forgotten code and just realized that the whole problem was that the
>> coordinates of a Bio::LocatableSeq object represent the location of
>> the object in the original sequence (i.e. unaligned read coordinates)
>> instead of its coordinates in the aligned consensus sequence.
>> Bio::LocatableSeq issues a warning whenever you try to set the object
>> start and end coordinates to the gapped consensus coordinates because
>> such coordinates lead to a length shorter than the actual read length:
>>
>> use Bio::LocatableSeq;
>> use Bio::SeqFeature::Collection;
>>  my $ls = Bio::LocatableSeq->new(-start=> 100500, -end => 100503, -seq
>> =>"agggcACT-Gccc", -id=>"uga");
>> my $sfc = Bio::SeqFeature::Collection->new();
>> $sfc->add_features([ $ls ]);
>> print $sfc->feature_count,"\t",$ls->length."\n";
>>
>> Additionally, Bio::LocatableSeq objects do not support the
>> primary_tag() method because they are just Bio::RangeI and not
>> Bio::SeqFeatureI objects. These two issues prevented me from using
>> Bio::LocatableSeq coordinates to represent gapped consensus
>> coordinates but the example code above suggests suppressing
>> Bio::LocatableSeq warnings and using the object class
>> ($feat->isa(Bio::LocatableSeq"")) instead of a regular expression on
>> the primary tag ("_unaligned_coord:$readID") would be enough to avoid
>> the need for extra Bio::SeqFeature::Generic objects to store read
>> coordinates.
>>
>>> Since you said 'feel free to act', I have been faffing around here:
>>> http://github.com/dbolser/bioperl-live
>>> I'm not really sure if that is so useful, please advise.
>>
>> The comments I made above points to one area where you could have some
>> improvements in both memory usage and usability: drop the need for
>> additional Bio::SeqFeatureI objects to represent gapped consensus
>> coordinates. It would require you to check all Contig.pm methods for
>> the use of the "_aligned_coord:$readID" tags (I don't expect to see
>> any references to this outside Contig.pm) and change set_seq_coord to
>> a method that changes/sets the start and end values of the
>> Bio::LocatableSeq objects.
>> This might be a bit of work so just go for it if you want and have
>> time to experiment.
>>
>> Another detail: could you please replace references in Contig.pm and
>> other Bio::Assembly classes POD to Bio::Assembly::* classes that do
>> not exists to the correct Bio::* classes? Somebody changed this to the
>> wrong names long ago...
>>
>>> Thanks again for help,
>>
>> Well these are my thoughts on the matter. Write again if you decide to
>> take this on and are unable to make progress.
>> Best,
>> Robson
>>
>>> Dan.
>>>
>>>
>>> P.S.
>>> How come you are not in irc://irc.freenode.net/#bioperl ;-)
>>>
>>>
>>> On 25 July 2010 15:42, Robson de Souza <robfsouza at gmail.com> wrote:
>>>> Hi Dan,
>>>>
>>>> It is been a long time since I last loooked at this but, if I remember
>>>> correctly, the point is that Bio::
>>>>
>>>> On Sun, Jul 25, 2010 at 9:23 AM, Dan Bolser <dan.bolser at gmail.com> wrote:
>>>>> The following bug report boils down to this question:
>>>>> How should two sequence objects be compared for identity? Does the
>>>>> object override 'eq' or implement an 'identical' method?
>>>>
>>>> I think an 'identical' or 'equal' method would be the best alternative
>>>> since having a full method call would allow passing arguments like
>>>> '-mode => "complete"' to check all sequence features and annotations
>>>> if they exist and '-mode => "basic"' to check id() and seq() values.
>>>> Bio::Assembly::Contig depends mostly on the last one, although only
>>>> id() is tracked most of the time (because of the internal hashes).
>>>>
>>>>> I found the following apparent bug in Contig.pm while executing the
>>>>> documented 'SYNOPSIS' code:
>>>> [snip]
>>>>> It seems to be a bug in the documented behaviour of set_seq_coord:
>>>>>        "If the sequence was previously added using add_seq, its
>>>>> coordinates are changed/set.  Otherwise, add_seq is called and the
>>>>> sequence is added to the contig."
>>>>
>>>> In fact, it should not print warnings all the time....
>>>>
>>>>> The offending line in that function seems to be:
>>>>>  if( ... &&
>>>>>      ($seq ne $self->{'_elem'}{$seqID}{'_seq'}) ) {
>>>>>          ... <spew warnings>
>>>>>  }
>>>>>  $self->add_seq($seq);
>>>>> which compares the *passed* sequence object to the sequence string for
>>>>> the *stored* sequence object of the same name. This comparison is
>>>>> always fails if I understood correctly, therefore set_seq_coord always
>>>>> spews warnings if called after add_seq.
>>>>
>>>> Not the sequence string, but the objects themselves, i.e. the string
>>>> perl uses to represent Bio::LocatableSeq objects... it is a memory
>>>> based version of identical() :)
>>>>
>>>>> Out of curiosity, how come I can't just say:
>>>>> my $ls = Bio::LocatableSeq->
>>>>>  new( -seq      => 'ACCG-T',
>>>>>       -id       => 'r1',
>>>>>       -alphabet => 'dna'
>>>>>       -start    => 3,
>>>>>       -end      => 8,
>>>>>       -strand   => 1
>>>>>     );
>>>>> $c->add_seq( $ls );
>>>>
>>>> Oh, I don't remember but it was either a bad design decision I made 8
>>>> years ago to acommodate the Bio::Align::AlignI interface or a problem
>>>> with Bio::SeqFeature::Collection at that time. Whatever the case, it
>>>> would be nice to change it... you just need to create a
>>>> Bio::SeqFeature::Generic when
>>>> add_seq is called. I just won't have time to do it myself so feel free to act...
>>>>
>>>> Best,
>>>> Robson
>>>>
>>>>> I hope the above report can be of some use.
>>>>>
>>>>> Sincerely,
>>>>> Dan.
>>>>> _______________________________________________
>>>>> Bioperl-l mailing list
>>>>> Bioperl-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>>
>>>>
>>>
>>
>




More information about the Bioperl-l mailing list