[Bioperl-l] [ANNOUNCE] tab sequence file format

Thu Apr 17 14:53:40 EDT 2003

Philip Lijnzaad has written a new sequence format module called 'tab'.
It is in CVS. Here is the blurb he wrote: 

It is very useful when doing large scale stuff using the Unix command
line utilities (grep, sort, awk, sed, split, you name it). Imagine
that you have a format converter 'seqconvert' along the following
lines:

 my $in  = Bio::SeqIO->newFh(-fh => \*STDIN , '-format' => $from);
 my $out = Bio::SeqIO->newFh(-fh=> \*STDOUT, '-format' => $to);
 print $out $_ while <$in>;

then you can very easily filter sequence files for duplicates as:

 $ seqconvert < foo.fa -from fasta -to tab | sort -u |\
       seqconvert -from tab -to fasta > foo-unique.fa

Or grep [-v] for certain sequences with:

 $ seqconvert < foo.fa -from fasta -to tab | grep -v '^S[a-z]*control'|\
       seqconvert -from tab -to fasta > foo-without-controls.fa

Or chop up a huge file with sequences into smaller chunks with:

 $ seqconvert < all.fa -from fasta -to tab | split -l 10 - chunk-
 $ for i in chunk-*; do seqconvert -from tab -to fasta <$i> $i.fa; done
  # (this creates files chunk-aa.fa, chunk-ab.fa, ..., each containing
  # 10 sequences)