[Biopython-dev] Multiple alignment - Clustalw etc...

Tue Mar 31 12:56:21 UTC 2009

On Tue, Mar 31, 2009 at 12:25 PM, Cymon Cox <cy at cymon.org> wrote:
>> What I typically do fits pretty nicely with the SeqIO/AlignIO paradigm:
>> (1) use SeqIO to prepare the FASTA input file.
>> (2) run the command line tool (e.g. MUSCLE).
>> (3) use AlignIO (or SeqIO) to read the alignment output file.
>
> Well, yes - we can always not use the biopython interface.

Ideally step (2) in the above would be handled via a Biopython
command line wrapper, offering keyword arguments etc.

>> i.e. Have a Biopython wrapper use a temp file to record the
>> given records to in a format appropriate for the command line
>> tool selected, and capturing the output?  In the case of
>> ClustalW or MUSCLE this means making a temp FASTA input
>> file.  For ClustalW we'd then have to open the output file, read
>> it, and then delete it.
>
> Yes, that's what I'm suggesting.
>
> Here's my reasoning: it seems to me the input and output formats of the data
> required by a particular alignment tool are incidental and should be hidden
> from the user.

OK - I see this as doing some implicit behind the scenes magic.
Arguably this kind of thing is still nice to have if it makes things
simpler for the user.

I may over use this mantra, but "Explicit is better than implicit",
from the Zen of Python.  http://www.python.org/dev/peps/pep-0020/

> At present the Clustalw interface forces you to write a fasta
> formatted file of your records to disk, and then has Clustalw
> write an aligned matrix to disk in a format specified by the user.

The Clustalw tool only takes FASTA formatted input, so if you have
a bunch of sequences in memory you are forced to convert them
into FASTA format to use them as input.  The question is where
does this conversion take place - explicitly by the user, or implicitly
by a wrapper.

> If the latter is Clustal format, then the record is parsed and an alignment
> object is returned, else None is returned. In either case, an output file(s)
> remains on disk.

It should be a fairly simple enhancement to look at the arguments
to see if another output format we can parse was selected, e.g.
PHYLIP?) and also parse that.  Do you think that would be a
sensible addition to Bio.Clustalw.do_alignment?  Its never been
an issue for me as if you are using the Bio.Clustalw.do_alignment
interface you probably don't care about the output file format.

> So, say we have a bunch of sequences in pir format and we'd like them
> aligned and saved in stockholm format:
>
> from Bio import SeqIO
> from Bio import AlignIO
> from Bio import Clustalw
> from Bio.Clustalw import MultipleAlignCL
> records = SeqIO.parse(open("Tests/NBRF/DMA_nuc.pir", "r"), "pir")
> AlignIO.write([records], open("temp.fasta", "w"), "fasta")

The above line is wrong - it should be:
SeqIO.write(records, open("temp.fasta", "w"), "fasta")
At this point your PIR sequences are not yet aligned, so they'll (probably)
have different lengths, so shouldn't be treated as an alignment.  If it
doesn't raise an error maybe it should...

Also you don't explicitly close the handle this way.

> cline = MultipleAlignCL("temp.fasta", command="clustalw")
> align = Clustalw.do_alignment(cline)
> AlignIO.write([align], open("temp.sth", "w"), "stockholm")

> we end up with 4 output files on disk: temp.aln,  temp.dnd,  temp.fasta,
> temp.sth - 3 of which are incidental.

Yes - but as the ClustalW doesn't read in PIR files, and doesn't output
Stockholm files on its own, so this has to happen.  It's just a question
of who does it (the user, or the wrapper code).

> (BTW, using the above procedure on the files "B_nuc.pir" and "Cw_prot.pir"
> in Tests/NBRF hangs on RH and Ubuntu linux: it seems to be waiting for the
> subprocess to return, which it never does: pid, sts = os.waitpid(self.pid,
> 0))

I would guess this is because you never properly closed the
temp.fasta file, so it may not have been flushed to disk when the
Clustalw tool was called.

> As I say, I'd like to see this:
>>>> from Bio.Align import MultipleAlignment
>>>> records = list(SeqIO.parse(open("Tests/NBRF/DMA_nuc.pir", "r"), "pir"))
>>>> align = MultipleAlignment(records, executable="clustalw")
>>>> AlignIO.write([align], open("temp.sth", "w"), "stockholm")
>
> ie resulting in one file temp.sth, which we've explicitly written to disk.

So you'd like the wrapper to take care of creating and deleting the
temp input FASTA file, and also deleting the temp output ClustalW
file after parsing it.  This can probably be done quite cleanly using
python's NamedTemporaryFile object.

>> For other tools we may be able to just capture its output on
>> stdout and not have to clean up a temp output file.
>>
>> All the possible command line tools have their own arguments,
>> range of file formats, behaviour with respect to default filenames
>> etc.  Trying to capture all this in a single wrapper seems rather
>> ambitious.  For example, how would you handle gap penalties?
>> Keep in mind that different tools may use the same name for
>> a gap extension penalty but interpret the values differently.
>
> Sorry, I wasn't very clear about what I intended:
>
> MultipleAlignment(records, executable="clustalw", <keyword args>)
> returns Clustalw.do_alignment(records, <keyword args>)
> and
> MultipleAlignment(records, executable="muscle", <keyword args>)
> returns Muscle.do_alignments(records, <keyword args>)
>
> I'm not suggesting unifying all programme options into a single interface,
> just wrap the individual alignment tool modules in a common call,
> MulitpleAlignment(), align_records(), or whatever...

I see.

> As for the keyword options, at present the Clustalw interface supports the
> manual setting of some attributes to the MultipleAlignCL instance, but there
> is no type or value checking. I think as many options as possible should be
> supported through keyword arguments - tedious, but doable.
>
>> Also, while I can see this might be nice for short alignments
>> (which are quick to run), its rather implicit or magic.
>
> Not sure what you mean here? Why would the size of alignment matter?

Size of alignment influences the compute time, and therefore is an issue for
anyone doing things at the python prompt.  Moreover, if the alignments are
big and slow, you generally want to make sure the output file is kept on disk,
as you'll probably want to read it more than once.

> And as for it being magic, its seems to me it does, and only does, what
> it says on the label - aligns the data.

The magic is the behind the scenes creation/deletion of the input/output
files, and the conversion between file formats.

>> I personally prefer to have to deal with the files explicitly myself
>> - but then I have been dealing with large alignments which I want
>> to keep on disk.
>
> I tend to build many (small - <100 taxa) single gene alignments - in one
> use-case, 280 of them...

In your case I would assume the alignment takes minutes to run.  You tend
to care more about preserving the output files if they take hours to create ;)

>> > Secondly, the biopython interface does not support calling
>> > Clustalw to perform profile alignments,

That is something we should probably add.

> OK, well having had my say, I'm quite happy to write the Muscle module in
> the style of the current Clustalw interface, or whatever style is most
> appropriate for exposing the filename handles. But I'm not sure what that
> would be - perhaps you could elaborate on this a bit...

I've elaborated, perhaps too much? ;)

Basically you seem to be thinking about a high level abstraction for
multiple alignment tools (dependent on the Bio.SeqIO and Bio.AlignIO
module), while I am more focused on the low level abstraction for
wrapping any command line tool.  This isn't to say we can't have both,
but to me it makes sense to start with the low level stuff first.

We (unfortunately) have several styles of command line tool wrappers
in Biopython already - this is a wart that has been on my mental to do
list for some time.  I think we should focus on dealing with command
line strings, and keep this separate from how the tools are invoked
(e.g. subprocess or os.system), preparation of input files, and how
any output is parsed.  As long as this core is in place, more advanced
wrappers are possible on top of this basic infrastructure (Tiago may
have some comments here from his Bio.PopGen work).

Essentially all our command line wrappers start by building a command
line string.  In some cases this command line string is exposed to the
user (e.g. Bio.EMBOSS), and they can choose how they want to invoke
it.  For example, they can explicitly opt to use the Python subprocess
module and pipes if they want to - or use a standard invocation from
Bio.Applications (we may want to add a couple of variations to this
module).

Other wrappers (e.g. Bio.Blast.NCBIStandalone) instead call the tool
for you. In the case of Bio.Blast.NCBIStandalone, if you don't want
the handles because you've told Blast to save its output to a file,
our wrapper still returns the standard output and standard error
handles - it is forced on you (see Bug 2654).   Also, there is no easy
way to see what the actual command line string was, which can make
debugging hard, and also prevents certain things (e.g. submitting the
command line as a task to a cluster of workstations).  At least
Bio.Clustalw offers a command line string object (MultipleAlignCL),
its just the do_alignment helper function I'm not so keen on.

The Bio.Clustalw.do_alignment wrapper is rather unusual in that it
automatically parses the output - while most of our wrappers don't.
Decoupling the parsing is more modular - it makes it easy for the user
to use any parser for the output from a command line tool (either
using stdout, or by reading an output file).  I like this, and it fits
with the handle based approach in most of our parsers.

So, I would suggest we think about adding new wrappers under Bio.Align
(e.g. Bio.Align.Clustalw, Bio.Align.Muscle, Bio.Align.TCoffee - or
perhaps all together in Bio.Align.Applications or something) based on
the Bio.Application module as used in Bio.EMBOSS.  We could then
deprecate Bio.Clustalw, which should also help tidy up the top level
name space.  Initially at least, I wouldn't include any clever wrapper
code at all.

Once we have the basic command line objects done, these could be used
to later add another layer on top implementing Cymon's ideas for
multiple alignment wrappers taking care of intermediate file and
inter-converting file formats on the fly, although I remain to be
convinced about the value this.  If you can pull it off (cross
platform, on several versions of python) then a user friendly high
level interface would be impressive.

Peter