[Bioperl-l] Bio::SeqIO::staden::read make test error

Wed May 11 11:58:35 EDT 2005

Roy Chaudhuri wrote:

>Hi all.
>
>I'm attempting to install bioperl-ext as I need to read in sequences
>from abi files. However the Bio::SeqIO::staden::read module fails the
>tests (the output from "perl Makefile.PL; make; make test" in the
>bioperl-ext/Bio/SeqIO/staden directory is pasted below).
>
>[...]
>
Clark Tibbetts paper:

http://www-2.cs.cmu.edu/afs/cs/project/genome/WWW/Papers/clark.html

tells how to parse an ABIF file. So you can do it all in perl. The 
script below, for instance prints out the sequence embedded in an ABIF 
file. The unedited sequence. I'm not sure how common editing of an .ab1 
file is anymore. But you could get the edited sequence with:

quikABIFdata( $file, "PBAS", 2, "A*" )

Actually, if the ABI basecaller used is the relatively recent "KB 
basecaller", then quality values are stored in the PCON record of the 
trace file. But you would need a different unpacking method to get them:

quikABIFdata( $file, "PCON", 1, "C*" )

(Also the quality values are returned as separate elements in the array, 
but sequence is returned as a string containing the whole sequence in 
the first element of the array. So you would either need to join "PCON" 
or split "PBAS" to get them into equivalent formats.)

I would like to write this as a method for bioperl, but I'm more of a 
bench scientist than a programmer. So I haven't learned how to write an 
object oriented program yet. Basically, I can use bioperl modules, but I 
can't write them.

Anyway, my point is that it only takes about 40 lines of perl to read 
and find a record in an ABI trace file. The actual chromatograms for 
each base are also accessible using this technique. They are present in 
the DATA1-4 records (raw data, stored as signed shorts --need to unpack 
with "s", but might be different if you are not using a "big-endian" 
machine). The processed trace data is in DATA9-12. That is the data that 
phred uses to do base calling. I unpack that with "n" but my primitive 
null padding method below would need to be modified to get the 2 digit 
tag numbered records.

#!/usr/bin/perl
use strict;
use warnings;
my $file = $ARGV[0];
my  ($sequence)  = quikABIFdata( $file, "PBAS", 1, "A*" );
print $sequence,"\n";

sub quikABIFdata {
=pod
$_[]    parameter      default
0       filename
1       tag             TUBE
2       tagnum          0001
3       unpack_method   none
If unpacked, then a list is returned. If no unpacking is done
the whole blob is returned as a string.
The unpack methods are described in perldoc -f pack.
=cut
    my $tag = $_[1] || "TUBE";
    if ( defined $_[2] ) {
        my $tagnum = chr($_[2]);
        $tag .= "\000\000\000$tagnum";
    } else {
        $tag = $tag."\000\000\000\001";
    }
    my $in = slurpABIFfile( $_[0] ) || return undef;
    return undef unless (  my ( $refarraylen, $datptr ) =
        $in =~ /$tag.{8}(.{4})(.{4})/s  );
    $refarraylen = unpack ( "N", $refarraylen );

    my $dat;
    if ( $refarraylen > 4 ) {
        $datptr  = unpack ( "N", $datptr );
        $dat = substr ( $in, $datptr, $refarraylen );
    } else {
        $dat = $datptr;
    }

    return( $dat ) unless ( defined $_[3] );
    my @dat = unpack ( $_[3], $dat );
}
sub slurpABIFfile {
    return ( undef ) unless ( defined($_[0]) && -r $_[0] );
    undef $/;
    open(INPUTFILE, "$_[0]") or die "Can't open $_[0], $!\n";
    my $whole_trace_file = <INPUTFILE>;
}