[Bioperl-l] Strange behaviour in the write_seq function for large fasta

David Gacquer dgacquer at ulb.ac.be
Wed Dec 21 13:26:07 UTC 2011


Dear BioPerl users/developers,

I am facing a strange issue with the $seq_out->write_seq function when 
using large fasta files

I have downloaded the hg19 chromosome 1, and applied the following code 
(basically I wanted to mask some regions in it but the problem also 
appears when copying the sequence without modifications):

sub main{
     my $seq_in  = Bio::SeqIO->new( -format => 'largefasta', -file => 
$ARGV[0]);
     my $seq_out  = Bio::SeqIO->new( -format => 'largefasta', -file => 
'>'.$ARGV[1]);
     my $seq_obj_in = $seq_in->next_seq();
     my $modified_seq = $seq_obj_in->seq();
     my $seq_obj_out = Bio::Seq::LargePrimarySeq->new( -seq => 
$modified_seq, -id  => $seq_obj_in->id, -desc => $seq_obj_in->desc);
     $seq_out->write_seq($seq_obj_out);
}

when checking the output fasta file, the sequence of chr1 is 1-bp shorter.

I have noticed that in the original fasta file, each line contains 
exactly 50 nucleotides, while the output of the $seq_out->write_seq 
function contains always 60 characters per line.
chr1 is exactly 249,250,621 bp (4,154,177 * 60 + 1) so to verify that 
the very last base was missing, I created the following fasta files:

chr121.fa

 >chrA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAG

chr122.fa

 >chrA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAG

They contain respectively 121 (60*2+1) and 122 (60*2+2) bp, the last 
character being a G. When running the above code:

chr121.out.fa

 >chrA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

chr122.out.fa

 >chrA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AG

The output for the 122 bp chromosome is correct (2 lines of 60 bp and 
the last line with 2 bp, AG) but for the 121 bp chromosome, the last 
character is missing (2 lines of 60 bp only, last G is missing).

When replacing -format => 'largefasta' by -format => 'fasta' or writing 
the output without the write_seq function however, the problem is solved.

Am I missing something or is there a problem with the write_seq function 
used with large fasta files? (I am running BioPerl on a Mac under OS X 
Snow Leopard)

Best regards

David

-- 
David Gacquer, Ph. D.

IRIBHM - Universite Libre de Bruxelles
Bldg C, room C.4.117
ULB, Campus Erasme, CP602
808 route de Lennik
B-1070 Brussels
Belgium

Phone: +32-2-555 4187
Fax: +32-2-555 4655
E-mail: dgacquer at ulb.ac.be




More information about the Bioperl-l mailing list