[Biopython-dev] Clustal alignment format header line
cy at cymon.org
Tue May 12 15:43:47 UTC 2009
2009/5/12 Peter <biopython at maubp.freeserve.co.uk>
> On Tue, May 12, 2009 at 12:07 PM, Cymon Cox <cy at cymon.org> wrote:
> > Both Muscle (-clw) and Probcons (-clustalw) output a programme specific
> > header line for the clustal format alignment:
> > "MUSCLE (3.7) multiple sequence alignment
> > AK1H_ECOLI/1-378 CPDSINAALICRGEKMSIAIMAGVLEAR etc"
> > "PROBCONS version 1.12 multiple sequence alignment
> > AK1H_ECOLI/1-378 CPDSINAALICRGEKMSIAIMA
> > "
> > Bio.AlignIO will not read these alignments
> > Bio/AlignIO/ClustalIO.py:94
> > if line[:7] != 'CLUSTAL':
> > raise ValueError("Did not find CLUSTAL header")
> > Muscle does have a -clwstrict flag but ProbCons doesnt.
> > Would it be a good idea to relax the header parsing?
> > C.
> Maybe. Up until now the only example of this I had personally come
> across was MUSCLE, but they helpfully provide the -clwstrict argument
> so the issue wasn't important.
> There are also of course the official variants like:
> CLUSTAL W (1.81) multiple sequence alignment
> CLUSTAL 2.0.9 multiple sequence alignment
> How would you code this? A flexible option would be to take anything
> where the first line ends with "multiple sequence alignment", but this
> risks letting a lot of non-clustal files though which will then
> (hopefully) fail, but probably with a much more cryptic error message.
> A white list of safe variants like "MUSCLE" and "PROBCONS" would be
> Also I have a vague memory of some tool using something like "CLUSTAL
> ... from ToolX" but I don't recall the details.
T-COFFEE for one:
"CLUSTAL FORMAT for T-COFFEE Version_6.92 [http://www.tcoffee.org] [MODE:
], CPU=0.00 sec, SCORE=100, Nseq=2, Len=601"
Is it so bad to let it fail on the structure of the data - effectively
ignore the header? Maybe have a general "this doesnt look like clustal
formatted data" error based on the data structure...
More information about the Biopython-dev