[Bioperl-l] [ANNOUNCE] tab sequence file format
Heikki Lehvaslaiho
heikki at ebi.ac.uk
Thu Apr 17 14:53:40 EDT 2003
Philip Lijnzaad has written a new sequence format module called 'tab'.
It is in CVS. Here is the blurb he wrote:
It is very useful when doing large scale stuff using the Unix command
line utilities (grep, sort, awk, sed, split, you name it). Imagine
that you have a format converter 'seqconvert' along the following
lines:
my $in = Bio::SeqIO->newFh(-fh => \*STDIN , '-format' => $from);
my $out = Bio::SeqIO->newFh(-fh=> \*STDOUT, '-format' => $to);
print $out $_ while <$in>;
then you can very easily filter sequence files for duplicates as:
$ seqconvert < foo.fa -from fasta -to tab | sort -u |\
seqconvert -from tab -to fasta > foo-unique.fa
Or grep [-v] for certain sequences with:
$ seqconvert < foo.fa -from fasta -to tab | grep -v '^S[a-z]*control'|\
seqconvert -from tab -to fasta > foo-without-controls.fa
Or chop up a huge file with sequences into smaller chunks with:
$ seqconvert < all.fa -from fasta -to tab | split -l 10 - chunk-
$ for i in chunk-*; do seqconvert -from tab -to fasta <$i> $i.fa; done
# (this creates files chunk-aa.fa, chunk-ab.fa, ..., each containing
# 10 sequences)
More information about the Bioperl-l
mailing list