fasta splitter

Tue Oct 8 19:00:12 UTC 2002

There's more than one way to split a fasta file...

1.  Split M entries into N files, file 1 receives 1->M/N,
file 2 receives M/N+1->2M/N, etc. Advantages - only one
file needs to be open at a time, simple.  Disadvantage -
the resulting split is typically uneven.  Do this with the
NCBI databases and you'll find that they are heavily weighted
for smaller sequences at the beginning and longer ones at the
end.  If the point of the split is to load balance (this is
what I use it for, with parallel BLAST) some nodes will finish
much earlier than others. Implementation: (deleted, I found
this method not to be generally useful)

1b.  head/tail/segment entries out of a fasta file.  While (1)
caused a lot of problems I've often needed to chop out a specific
part of a fasta file.  Why?  Because some piece of software was
blowing up on the 351,234 entry, but only if preceded by several
thousand other entries. Finding the smallest piece that will trigger the
bug can save hours of run time debugging these sorts of problems. 
Implementation:

   ftp://saf.bio.caltech.edu/pub/software/molbio/fastarange.c

2.  Split M entries into N files, cycling output to each file.
That is, entry M goes to file M modulo N.  Advantage - resulting
files tend to be more even in size.  Disadvantage - N output files
must be open at once (or you have to cycle through N times, once
per phase); if M is small and the size of each entry large the
resulting files will not generally be balanced.  Example, splitting
the yeast genome, heaven help us when full length human chromosomes
start showing up as single FASTA file entries. Implementation:

  ftp://saf.bio.caltech.edu/pub/software/molbio/fastasplitn.c

3.  Split P bases in M entries into N files "evenly", fragmenting
sequences if they are too large.  Advantage:  fixes the genome
data problem from (2). Disadvantages:  even more complex than
(2) and "entries" in resulting files do not correspond one to
one with the original. Even with clever naming conventions 
(yeastII_100001_200000) end users will be confused.  Clever
names will be truncated by most software at the worst possible
place resulting in a "hit" on "yeastII_" :-(.  Implemenation:
(well, partially, this one translates in all 6 frames, but
it has some of the naming/fragmenting features):

   ftp://saf.bio.caltech.edu/pub/software/molbio/fasttrans.c

4.  Split by content.  Ie, strip all the human sequences out
of nr.  I don't beleive there is a general solution because there
is no univerasally agreed upon FASTA header line format.
Implementation:  SRS or something similar.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech