fasta splitter

Tue Oct 8 22:51:16 UTC 2002

+----[ Asi hablaba Peter Rice (peter.rice at uk.lionbioscience.com):
|

[ snipped ]

| 
| One problem doing this in EMBOSS is the need to generate filenames for your 
|  split files - but maybe a base filename would be enough to generate 
| names. 

Now let me get myself into the discussion. The splitter I use is
called 'shatter' and is part of the SEALS package, which I
guess is unmaintained (and perhaps obsolete?)  and is
basically perl. 
ftp://ftp.ncbi.nih.gov/pub/walker/seals/software

The following discussion works for splitting into individual
sequences, but not into groups of sequences. In this case a
different naming scheme should be used, (though perhaps the
same argument specifier '-word' could be used?). 

The approach of shatter (both for splitting FASTA files, but
also for splitting concatenated BLAST reports, which are
splitted by 'shatterblast') is to let you choose the 'word'
which will be used as a basename. Both shatters know about
the NCBI FASTA standard and thus, given a FASTA header like
the following:
>gi|123456|gb|AA123456|AA123456.1 Homo sapiens protein X etc

will take the gi as word 2 (123456), the accession number
(AA123456) as word 4, the accession.version (AA123456.1) as
word 5 and so on. 

In the command-line you just say 'shatter -word 1 fastafile'
if you want the first word after the '>' to be the basename.

This produces files with that basename and terminated in .fa

The program will consider whitespace and the character '|'
as word delimiters.

In my own experience this is a good thing. I've used shatter
with many different FASTA flavours and adjusting the word to
be used as basename is plain easy.

BLAST reports are also trivial since query sequences, are
also usually in FASTA format, and you get basically the same
header, though after the 'Query=' magic word. In this case
you get files with the same basename, but ending in .br

Just my 2 cents. Hope this makes it into EMBOSS.

Fernan

| and change the output file. You can add a command line option for the 
| number of sequences in an output file. Cleaning up output files for a rerun 
| is an exercise for the user (unless you want to invent a new ACD type that 
| does it :-)
| 
| Needs a modified version of the seqFileReopen function to handle the file 
| naming, but nothing complicated is involved.
| 
| regards.
| 
| Peter
| 
| -- 
| ------------------------------------------------
| Peter Rice, LION Bioscience Ltd, Cambridge, UK
| peter.rice at uk.lionbioscience.com +44 1223 224723
| 
| 
|
+----]

-- 
F e r n a n   A g u e r o
http://genoma.unsam.edu.ar/~fernan