[EMBOSS] Reading Two Sequences from stdin with water

Fri Jun 3 17:18:01 UTC 2005

On Fri, Jun 03, 2005 at 03:09:03PM +0100, pmr at ebi.ac.uk wrote:
> Jan T. Kim writes:
> > is it possible to read both input sequences to a pairwise alignment
> > from one input stream?
> >
> >     cat x.fasta | water -asequence fasta::stdin:seq1 -bsequence
> > fasta::stdin:seq2 -outfile stdout -auto
> >
> > gives
> >
> >    EMBOSS An error in ajfile.c at line 1926:
> > Error reading from file 'stdin'
> >
> > It may well be that water consumes the entire input stream on getting the
> > first sequence, thus rendering itself unable to acquire the second one.
> >
> > Is there a solution to this? I would really like to avoid the mess of
> > temporary files and run water in a clean pipe (pun intended  ;-)  )
> 
> EMBOSS will only cleanly read stdin as one input. We should probably trap
> that internally and give an error if we find stdin opening again. I wonder
> whether there is any useful way to share the stdin filebuffer. Hmmmm... in
> the early days of EMBOSS we decided not to allow it, but it could be worth
> a try. You would still be in trouble if you tried to read the second
> sequence first though.

Conceptually, this could be cleanly handled (which is why I tried in
the first place), by having the function for obtaining the input sequences
determine the source files in a first pass of the list of sources, and
then obtain all requested sequences that come from the same file in one
go through that file. This could be applied to the standard input just
as to any other file.

However, if the current code acquires the two sequences one after the
other and independently of each other, it will require a possibly less than
trivial rewrites to change that -- likely, the API for obtaining a
sequence specified by a USA would have to be extended such that multiple
sequences can be obtained from one file in one pass through that file,
and some functions to group lists of USAs into sublists of USAs that
refer to the same file would have to be provided.

> Assuming your x.fasta file has only seq1 and seq2 in that order, reading
> seq1 will continue until the first line of seq2 is reached. By then it
> would be too late for seq2 to be read cleanly.

Well, the approach outlined above does not have that limitation, and
it also works for interleaved sequence formats. But if the EMBOSS
internals are as I assume above, it's clear to me that this is something
for the long-term wishlist.

> At least you have fasta:: specified - with no specified format, EMBOSS has
> to read a long way into the input just to check whether it is really GCG
> format.

Yes, heuristic format determination and non-seekable inputs don't mix
too well generally...

> As for the asis format, I suppose an EMBOSS utility that reads x.fasta and
> outputs asis::ctagtacgatgcgatcg asis::tgatcgatggctacgtagc would be useful
> to you - then you could put `sillyname x.fasta` in your command line... at
> least until the command line gets too long. Hard to preserve the ID and
> description of the sequences though.

Yes -- in my case, I have the sequences available within a Python script
anyway, so the asis approach works fine for me (even with a popen
facility that goes through a shell -- I'll have to check how to eliminate
that for future occasions where sequences may be too long for the
command line, though).

> "If you think water is pure, just remember what fish do in it."

I like to boil my water, adding an all-natural disinfectant known as
"coffee" for this reason...  ;-)

Best regards, Jan
-- 
 +- Jan T. Kim -------------------------------------------------------+
 |    *NEW*    email: jtk at cmp.uea.ac.uk                               |
 |    *NEW*    WWW:   http://www.cmp.uea.ac.uk/people/jtk             |
 *-----=<  hierarchical systems are for files, not for humans  >=-----*