[Biopython-dev] Clustal alignment format header line

Tue May 12 15:43:47 UTC 2009

2009/5/12 Peter <biopython at maubp.freeserve.co.uk>

> On Tue, May 12, 2009 at 12:07 PM, Cymon Cox <cy at cymon.org> wrote:
> > Both Muscle (-clw) and Probcons (-clustalw)  output a programme specific
> > header line for the clustal format alignment:
> >
> > "MUSCLE (3.7) multiple sequence alignment
> >
> >
> > AK1H_ECOLI/1-378      CPDSINAALICRGEKMSIAIMAGVLEAR etc"
> >
> > "PROBCONS version 1.12 multiple sequence alignment
> >
> > AK1H_ECOLI/1-378    CPDSINAALICRGEKMSIAIMA
> >
> > "
> >
> > Bio.AlignIO will not read these alignments
> > Bio/AlignIO/ClustalIO.py:94
> >  if line[:7] != 'CLUSTAL':
> >       raise ValueError("Did not find CLUSTAL header")
> >
> > Muscle does have a -clwstrict flag but ProbCons doesnt.
> >
> > Would it be a good idea to relax the header parsing?
> >
> > C.
>
> Maybe.  Up until now the only example of this I had personally come
> across was MUSCLE, but they helpfully provide the -clwstrict argument
> so the issue wasn't important.
>
> There are also of course the official variants like:
>
> CLUSTAL W (1.81) multiple sequence alignment
> CLUSTAL 2.0.9 multiple sequence alignment
>
> How would you code this?  A flexible option would be to take anything
> where the first line ends with "multiple sequence alignment", but this
> risks letting a lot of non-clustal files though which will then
> (hopefully) fail, but probably with a much more cryptic error message.
> A white list of safe variants like "MUSCLE" and "PROBCONS" would be
> safest.
>
> Also I have a vague memory of some tool using something like "CLUSTAL
> ... from ToolX" but I don't recall the details.

T-COFFEE for one:
"CLUSTAL FORMAT for T-COFFEE Version_6.92 [http://www.tcoffee.org] [MODE:
], CPU=0.00 sec, SCORE=100, Nseq=2, Len=601"

Is it so bad to let it fail on the structure of the data - effectively
ignore the header? Maybe have a general "this doesnt look like clustal
formatted data" error based on the data structure...

C.

--