[Bioperl-l] Bio::Coordinate::Collection could DoWhatIMean better (w/ patch)

George Hartzell hartzell at alerce.com
Fri Sep 5 19:01:35 UTC 2008


Hi all,

Bio::Coordinate::Collection surprised me a bit.  At first I thought
there was a bug, but it's clearly doing what it's supposed to.  Now
I'm wondering if what it's supposed to be doing makes sense in some
context, or if what I expected would be better functionality.

t/CoordinateMapper.t sets up the following scenario:

#
# Collection
#
#         1   5     6   10
#         |---|     |---|
#-----|-----------------------
#     1   5   9     15  19
#         pair1     pair2

Then goes on to do the following query:

  # match more than two
  $pos = Bio::Location::Simple->new (-start => 5, -end => 19 );
  ok $res = $transcribe->map($pos);
  is $res->each_gap, 2;
  is $res->each_match, 2;

I was surprised to see that there were two gaps, one gene:10-19 and
one from gene:5-14.  Looking at the code, what's really happening is
that, for the exon1 mapper there's match with gene:5-9 and a gap with
gene:10-19 and for the exon2 mapper there's a gap with gene:5-14 and a
match with gene:15-19.  All four Result's just get tossed into the
return value.

The result my intuition wants is that there are two matches (gene:5-9
with exon1 and gene:15-19 with exon2) and a gap (gene:10-14).

Yes, I guess that I could just synthesize these myself from the result
in my app.

It still seems that the current result is a bug though, since there's
no way of knowing when you're walking through $res->each_Location that
the first "gap" is with respect to the exon1 mapper and that the
second "gap" is with respect to the exon2 mapper.  The gaps are
meaningless.

I "fixed" it to work the way I think it should (two matches, one
gap).  I actually extended the test case a bit so that there's a
multi-base gap, a match, another multibase-gap, another match, then a
single base gap (just to make sure I got that right...).  I had to
touch up the test file a bit to account for my new test.

The gaps that I return have a strand of 'undef', which seems to be The
Right Thing.  There's also a bit of funny business where I hang onto
the seq_id of the gapped sequence.  It assumes that the "in" sequence
is the same for all of the mappers.  This seems safe since otherwise
the entire query is kind of weird....

There's a patch to todays svn head at:

  http://shrimp.alerce.com/bioperl/collection-diffs.txt

The patch changes Build.PL to include a dependency on Set::IntSpan,
CoordinateMapper.t to update the tests, and
Bio/CoordinateMapper/Collection.pm for the new code.

Who's code would this break.

If anyone's relying on the current behaviour re: gaps, what's the
situation in which you find it useful?

Thanks!

g.



More information about the Bioperl-l mailing list