[Bioperl-l] Strange behaviour in the write_seq function for large fasta
David Gacquer
dgacquer at ulb.ac.be
Wed Dec 21 13:26:07 UTC 2011
Dear BioPerl users/developers,
I am facing a strange issue with the $seq_out->write_seq function when
using large fasta files
I have downloaded the hg19 chromosome 1, and applied the following code
(basically I wanted to mask some regions in it but the problem also
appears when copying the sequence without modifications):
sub main{
my $seq_in = Bio::SeqIO->new( -format => 'largefasta', -file =>
$ARGV[0]);
my $seq_out = Bio::SeqIO->new( -format => 'largefasta', -file =>
'>'.$ARGV[1]);
my $seq_obj_in = $seq_in->next_seq();
my $modified_seq = $seq_obj_in->seq();
my $seq_obj_out = Bio::Seq::LargePrimarySeq->new( -seq =>
$modified_seq, -id => $seq_obj_in->id, -desc => $seq_obj_in->desc);
$seq_out->write_seq($seq_obj_out);
}
when checking the output fasta file, the sequence of chr1 is 1-bp shorter.
I have noticed that in the original fasta file, each line contains
exactly 50 nucleotides, while the output of the $seq_out->write_seq
function contains always 60 characters per line.
chr1 is exactly 249,250,621 bp (4,154,177 * 60 + 1) so to verify that
the very last base was missing, I created the following fasta files:
chr121.fa
>chrA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAG
chr122.fa
>chrA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAG
They contain respectively 121 (60*2+1) and 122 (60*2+2) bp, the last
character being a G. When running the above code:
chr121.out.fa
>chrA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
chr122.out.fa
>chrA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AG
The output for the 122 bp chromosome is correct (2 lines of 60 bp and
the last line with 2 bp, AG) but for the 121 bp chromosome, the last
character is missing (2 lines of 60 bp only, last G is missing).
When replacing -format => 'largefasta' by -format => 'fasta' or writing
the output without the write_seq function however, the problem is solved.
Am I missing something or is there a problem with the write_seq function
used with large fasta files? (I am running BioPerl on a Mac under OS X
Snow Leopard)
Best regards
David
--
David Gacquer, Ph. D.
IRIBHM - Universite Libre de Bruxelles
Bldg C, room C.4.117
ULB, Campus Erasme, CP602
808 route de Lennik
B-1070 Brussels
Belgium
Phone: +32-2-555 4187
Fax: +32-2-555 4655
E-mail: dgacquer at ulb.ac.be
More information about the Bioperl-l
mailing list