[Bioperl-l] "Command-Line Bioinformatics"
Jay Hannah
Jay at jays.net
Wed Feb 28 09:39:55 UTC 2007
Reading this article:
http://www.linuxjournal.com/article/6977
Sequencing the SARS Virus - Linux Journal, Nov 2003
This guy needs Perl and/or BioPerl. :)
> The sequence file is in FASTA format consisting of a header line
> and the sequence, split into fixed-width lines. The following
> counts the number of Gs and Cs in the sequence and presents the
> total as a fraction of the total number of bases:
>
> > grep -v "^>" AY274119.fa | fold -w 1 |
> tr "ATGC" "..xx" | sort | uniq -c |
> sed 's/[^0-9]//g' | t -s "\012" " " |
> sed 's/\([0-9]*\) \([0-9]*\)/scale = 3;
> ↪\2 \/ (\1+\2)/' |
> bc -i
> scale = 3; 12127 / (17624+12127)
> .407
>
> Out of the 29,751 bases in our sequence, 12,127 are either G or C,
> giving a GC content of 41%.
BioPerl version:
use Bio::SeqIO;
my $io = Bio::SeqIO->new(
-file => 'AY274119.fa',
-format => 'Fasta'
);
my $seq = $io->next_seq->seq;
print ( ($seq =~ tr/GC/GC/) / length ($seq) );
Command-line Perl:
perl -e '$/ = undef; $_ = <>; s/>.*//; s/\n//g; print tr/GC/GC/ /
length($_)' AY274119.fa
I'm sure you can Perl Golf my stabs at it. :)
j
seqlab.net
http://www.bioperl.org/wiki/User:Jhannah
More information about the Bioperl-l
mailing list