needle -filter

simon andrews (BI) simon.andrews at bbsrc.ac.uk
Tue Jun 10 07:53:52 UTC 2003



> From: David Mathog 
> > Peter Rice wrote:
> > > simon andrews (BI) wrote:
> > > In the manual for needle it suggests that it too can 
> > > accept -filter as a qualifier, but I can't get it to work.  
> > > 
> > > cat seq1.txt seq2.txt | needle -filter -sformat1 
> > > fasta -sformat2  fasta
> > > 
> 
> Hmm.  So presumably they're coming out of some other program 
> together in a stream and it's inconvenient for some reason to 
> write them to files.  Ok.

Yes, it's going to be used as part of a CGI script so the sequences are already in variables in the script.  I could go down the line of creating temp files, but this gets to be a real pain (making sure different processes don't clash, and cleaning up afterwards), and it would slow things down as well.


> > You are trying to read 2 inputs from stdin. needle will 
> > accept one sequence from stdin and another from "somewhere
> > else".
> > 
> > But you can do this:
> > 
> > needle "cat seq1.txt|" "cat seq2.txt|" -sformat1 fasta 
> > -sformat2 fasta

That's an interesting thing to know, but I don't think it really helps me.  The example I gave with cat was just used to generate the pipe, the actual sequences would all have to come from one input stream.


> It might be possible on some platforms to come up with a 
> "firstfasta" filter program which would emit just the 
> first fasta entry from the stream. It would have to run 
> character by character and be able to push the ">" of the 
> second entry back into the input stream, and I don't think 
> that's guaranteed to work everywhere.  Probably it would work 
> on Unix though, so you could maybe do something like this:
> 
>   needle "firstfasta" "firstfasta" 

I don't think that would be any easier than going down the whole tempfile route.  You'd still have the problem of getting firstfasta to figure out which stream it should pass a sequence in from when multiple instances were running.

 
> What Simon needs, and what Emboss doesn't have, is a built in 
> splitter for multisequence files that will allow the 
> individual sequences to be directed to specific inputs in a 
> program like needle.  Failing that one could create two 
> fifos, use an external splitter to direct the bits into the 
> fifos, and run  needle with the fifos for the input file names.
> 
> Probably better to build the splitter into EMBOSS though. 
> Something like:
> 
> cat twosequences.fasta | program -filter -route 
> 1:infile1,2:infile2

This seems a good addition to the package, but for many programs you could probably make the route part optional, and just say that -filter expects all its sequences on STDIN in the same format.  I don't think it would be too much of an imposition on the user to say that if you want to submit multiple sequences to a program in a single stream then they all have to come in the correct order.

Projects such as BioPerl have stream filters which will read multiple concatenated sequences in most formats so hopefully that shouldn't present too many problems, but as David pointed out you'd have to be able to rewind (at least one line) in the stream for something like Fasta format.


> If the program works by loading input 1, then input 2.  No 
> way to back up the stream so that input 2 could load with 
> entry 1.  The splitter/router could handle this, but only 
> generally by saving the contents of the first streamed 
> sequence somewhere for reuse.  

This sounds like the sort of scenario I'm trying to avoid with my initial query!  What happens if you're putting 200 sequences into a multiple alignment, in a random order, and there's 5 copies running.  It all gets a bit complex and messy :-)  Again I'd be tempted to just say that if you want to put your inputs on STDIN they have to come in the right order.  As a programmer that shouldn't be too hard to organise.

As an aside, the idea of the asis: USA is a really nice way around this whole problem.  The trouble is that it's limited (I presume) by the command line length your shell allows, and by you not being able to specify a name for the sequence.  If there was some easy way around this limitation, then that would solve the problem for the situations I can think I'm likely to encounter.

Simon.



More information about the EMBOSS mailing list