[Bioperl-l] Bio::SeqIO::staden::read make test error
Phillip San Miguel
pmiguel at purdue.edu
Wed May 11 11:58:35 EDT 2005
Roy Chaudhuri wrote:
>Hi all.
>
>I'm attempting to install bioperl-ext as I need to read in sequences
>from abi files. However the Bio::SeqIO::staden::read module fails the
>tests (the output from "perl Makefile.PL; make; make test" in the
>bioperl-ext/Bio/SeqIO/staden directory is pasted below).
>
>[...]
>
Clark Tibbetts paper:
http://www-2.cs.cmu.edu/afs/cs/project/genome/WWW/Papers/clark.html
tells how to parse an ABIF file. So you can do it all in perl. The
script below, for instance prints out the sequence embedded in an ABIF
file. The unedited sequence. I'm not sure how common editing of an .ab1
file is anymore. But you could get the edited sequence with:
quikABIFdata( $file, "PBAS", 2, "A*" )
Actually, if the ABI basecaller used is the relatively recent "KB
basecaller", then quality values are stored in the PCON record of the
trace file. But you would need a different unpacking method to get them:
quikABIFdata( $file, "PCON", 1, "C*" )
(Also the quality values are returned as separate elements in the array,
but sequence is returned as a string containing the whole sequence in
the first element of the array. So you would either need to join "PCON"
or split "PBAS" to get them into equivalent formats.)
I would like to write this as a method for bioperl, but I'm more of a
bench scientist than a programmer. So I haven't learned how to write an
object oriented program yet. Basically, I can use bioperl modules, but I
can't write them.
Anyway, my point is that it only takes about 40 lines of perl to read
and find a record in an ABI trace file. The actual chromatograms for
each base are also accessible using this technique. They are present in
the DATA1-4 records (raw data, stored as signed shorts --need to unpack
with "s", but might be different if you are not using a "big-endian"
machine). The processed trace data is in DATA9-12. That is the data that
phred uses to do base calling. I unpack that with "n" but my primitive
null padding method below would need to be modified to get the 2 digit
tag numbered records.
#!/usr/bin/perl
use strict;
use warnings;
my $file = $ARGV[0];
my ($sequence) = quikABIFdata( $file, "PBAS", 1, "A*" );
print $sequence,"\n";
sub quikABIFdata {
=pod
$_[] parameter default
0 filename
1 tag TUBE
2 tagnum 0001
3 unpack_method none
If unpacked, then a list is returned. If no unpacking is done
the whole blob is returned as a string.
The unpack methods are described in perldoc -f pack.
=cut
my $tag = $_[1] || "TUBE";
if ( defined $_[2] ) {
my $tagnum = chr($_[2]);
$tag .= "\000\000\000$tagnum";
} else {
$tag = $tag."\000\000\000\001";
}
my $in = slurpABIFfile( $_[0] ) || return undef;
return undef unless ( my ( $refarraylen, $datptr ) =
$in =~ /$tag.{8}(.{4})(.{4})/s );
$refarraylen = unpack ( "N", $refarraylen );
my $dat;
if ( $refarraylen > 4 ) {
$datptr = unpack ( "N", $datptr );
$dat = substr ( $in, $datptr, $refarraylen );
} else {
$dat = $datptr;
}
return( $dat ) unless ( defined $_[3] );
my @dat = unpack ( $_[3], $dat );
}
sub slurpABIFfile {
return ( undef ) unless ( defined($_[0]) && -r $_[0] );
undef $/;
open(INPUTFILE, "$_[0]") or die "Can't open $_[0], $!\n";
my $whole_trace_file = <INPUTFILE>;
}
More information about the Bioperl-l
mailing list