[Bioperl-l] Genscan exon frame computation

Mon, 29 Jan 2001 09:24:03 -0800

A revisit of this is on the task list. I had a discussion a while
ago with Mark Dalphin, because he claimed that he managed to
figured out the exon frame based on start coordinate and frame
value.

I still don't fully understand his code sample, as he was also
using his own definition of frame. Still, the discussion let me
see how one can figure out the frame. I've enclosed the relevant
code section of my implementation below. Whoever feels in the
position please review and double-check.

This will add a frame attribute to each individual exon, which
makes it possible to deliberately shuffle exons from one
prediction (for those who aren't aware: Genscan with default
parameters outputs only exons in the 'optimal path'; there may be
other exons which also achieve very good scores and the output of
which can be triggered by -subopt).

Things still to do in this respect comprise of a rigorous test
(take all exons of each prediction, translate them individually in
the frame they've been assigned, and check that there are no
intervening stops) and an adaptation of cds() in GeneStructure.pm
(when concatenating exons, make sure that the frame of one and
frame/phase of the previous match, and if not, fill with Ns).

If anyone volunteers to add the test to Genpred.t I'd be really
glad. This does not involve module design, just plain application
coding, and anyone literate in Perl/Bioperl should be able to jump
in here.

Comments welcome, esp. regarding the cds() comment I made above.

	Hilmar
-- 
-----------------------------------------------------------------
Hilmar Lapp                                email: hlapp@gmx.net
GNF, San Diego, Ca. 92122                  phone: +1 858 812 1757
-----------------------------------------------------------------

# Figure out the frame of this exon. This is NOT the frame
# given by Genscan, which is the absolute frame of the base
# starting the first predicted complete codon. By comparing
# to the absolute frame of the first base we can compute the
# offset of the first complete codon to the first base of the
# exon, which determines the frame of the exon.
my $cod_offset;
if($predobj->strand() == 1) {
    $cod_offset = $flds[6] - (($predobj->start()-1) % 3);
    # Possible values are -2, -1, 0, 1, 2. -1 and -2 correspond
    # to offsets 2 and 1, resp. Offset 3 is the same as 0.
    $cod_offset += 3 if($cod_offset < 1);		    
} else {
    # On the reverse strand the Genscan frame also refers to
    # the first base of the first complete codon, but viewed
    # from forward, which is the third base viewed from
    # reverse.
    # Note that end() is in fact start() here because we always
    # annotate in forward direction (otherwise we wouldn't need
    # strand()).
    $cod_offset = $flds[6] - (($predobj->end()-3) % 3);
    # Possible values are -2, -1, 0, 1, 2. Due to the reverse
    # situation, {2,-1} and {1,-2} correspond to offsets
    # 1 and 2, resp. Offset 3 is the same as 0.
    $cod_offset -= 3 if($cod_offset >= 0);
    $cod_offset = -$cod_offset;
}
# Offsets 2 and 1 correspond to frame 1 and 2 (frame of exon
# is the frame of the first base relative to the exon, or the
# number of bases the first codon is missing).
$predobj->frame(3 - $cod_offset);