needle -filter

David Mathog mathog at mendel.bio.caltech.edu
Mon Jun 9 18:55:25 UTC 2003


> Peter Rice wrote:
> > simon andrews (BI) wrote:
> > In the manual for needle it suggests that it too can accept -filter
as a qualifier, but I can't get it to work.  
> > 
> > cat seq1.txt seq2.txt | needle -filter -sformat1 fasta -sformat2 fasta
> > 

Hmm.  So presumably they're coming out of some other program
together in a stream and it's inconvenient for some reason
to write them to files.  Ok.

> You are trying to read 2 inputs from stdin. needle will accept one 
> sequence from stdin and another from "somewhere else".
> 
> But you can do this:
> 
> needle "cat seq1.txt|" "cat seq2.txt|" -sformat1 fasta -sformat2 fasta

That's got to be one of the ugliest syntaxes for reading 
in two files I've ever seen!  Plus I don't understand how
it differs from:

  needle seq1.txt seq2.txt

It might be possible on some platforms to come up with a
"firstfasta" filter program which would emit just the 
first fasta entry from the stream. It would have to run
character by character and be able to push the ">" of the
second entry back into the input stream, and I don't
think that's guaranteed to work everywhere.  Probably it
would work on Unix though, so you could maybe do something
like this:

  needle "firstfasta" "firstfasta" 

What Simon needs, and what Emboss doesn't have, is a built in
splitter for multisequence files that will allow the individual
sequences to be directed to specific inputs in a program like
needle.  Failing that one could create two fifos, use an external
splitter to direct the bits into the fifos, and run  needle
with the fifos for the input file names.

Probably better to build the splitter into EMBOSS though.
Something like:

cat twosequences.fasta | program -filter -route 1:infile1,2:infile2

where infile1/infile2 are the command line names for things that
are typically called "-sequence" and the like.  The problem
with needle (and water) is that the sequences typically
go on the command line unadorned, like:

  needle seq1 seq2

for which the syntax might be:

  -route 1:1,2:2

-route without -filter would be an error.  The stream
properties would make it a bit awkward for something like this:

  -route 1:1,1:2

If the program works by loading input 1, then input 2.  No way
to back up the stream so that input 2 could load with entry
1.  The splitter/router could handle this, but only generally
by saving the contents of the first streamed sequence somewhere
for reuse.  

For a program to compare one to many there could also be:

  -route 1:1,2-END:2

In theory this splitter/router shouldn't be too hard to implement.
In practice the various file inputs would need to read their
data in the order specified by route, and short of reading each
program's code one would have no way of knowing what that order is.
Which suggests:

  program -listroutes

which would emit the read order information that -route would
use later.


Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech



More information about the EMBOSS mailing list