[Biopython-dev] [Bug 3119] Bio.Nexus can't parse file from Prank 100701 (1st July 2010)

bugzilla-daemon at portal.open-bio.org bugzilla-daemon at portal.open-bio.org
Wed Jul 28 10:46:19 UTC 2010


http://bugzilla.open-bio.org/show_bug.cgi?id=3119





------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk  2010-07-28 06:46 EST -------
Created an attachment (id=1530)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=1530&action=view)
Hand corrected NEXUS output from prank v100701

I am attaching a hand edited version of the PRANK v100701 NEXUS output where I
have wrapped the names with single quotes, and removed the stray comma in the
translate statement. See below for details. Bio.Nexus is happy with this file.

(In reply to comment #4)
> Slashes in Taxon names may cause troubles (even when properly quoted), not
> only for Bio.Nexus, but also for many other programs. If you want to use /
> or other special characters in taxon names, better use a " or ' around them.
> It might be best to avoid them entirely, my experience is that at one point
> during file processing there will be a software that complains.

I should have been clearer earlier: Yes, I understand that special characters
like slash will cause some tools problems, but they are nevertheless common.
In particular, PFAM alignments take the form name/start-end to encode which
subregion of a protein is being shown - like the example here which uses
AK1H_ECOLI/1-378 and AKH_HAEIN/1-382 as the taxa names.

I have just checked in a change to the error message, which I think throws
more light on the issue:

http://github.com/biopython/biopython/commit/d8a4a6edc98fa69885b6865336020db02035ff0b

Now I get:

>>> from Bio.Nexus import Nexus
>>> n = Nexus.Nexus("output_prank_v100701.nex")
Traceback (most recent call last):
...
Bio.Nexus.Nexus.NexusError: Taxon AK1H_ECOLI: Illegal character / in sequence
/1-378CPDSINAALICRGEKMSIAIMAGVLEARGHNVTVIDPVEKLLAVGHYLESTVDIAESTRR (check
dimensions/interleaving)

Notice that the tail of the taxon name ('/1-378') is being treated as part of
the sequence. Having looked at the code and read the relevant bits of the NEXUS
specification (Maddison et al), I think that PRANK is producing invalid taxa
labels. In order to include characters like slashes and dashes (minus signs)
that are considered punctation (and thus indicate the end of the taxa label)
the labels should have been wrapped in single quotes.

See the attachment.

> The translate statement in the nexus file ends both with a , AND a ; after the
> second taxon, which is also not nexus compliant.

(In reply to comment #6)
> I think this is a bug - taxa in a translate statement are separated by commas,
> and after the last one, there is a semicolon, not both. Which makes sense.

I have not looked at this aspect in detail, but will take you word for it.
See the attachment.

(In reply to comment #6)
> 
> You're welcome to report it - probably you have more info at hand how the file
> was generated...
>

For the record, the file was generated with the following, input file in FASTA
format has two sequences which already have gaps in them:

http://biopython.open-bio.org/SRC/biopython/Tests/Fasta/fa01
http://github.com/biopython/biopython/raw/master/Tests/Fasta/fa01

Then run prank (here using v081202), from the same directory:

$ prank -d=fa01 -f=17 -noxml -notree

Warning: option '+F' is not selected. You can select it by adding flag "+F".

PRANK: aligning sequences in 'fa01', writing results to 'output.?.nex' [plain
alignment].

Generating approximate guidetree.
Generating multiple alignment.            
#1#(1/1): 95% aligned                    
Generating improved guidetree.
Generating improved multiple alignment.
#1#(1/1): computing full probability               
Alignment done. Total time 1s

$ diff output.1.nex output.2.nex
$ more output.2.nex
#NEXUS
...

See previously attachment 1524 for the output.


(In reply to comment #6)
> 
> Frank
> 
> PS. I Updated tree parsing in Nexus to handle the 
> 
> tree * PRANK = ...
> 
> statement.
> 

Great.

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.



More information about the Biopython-dev mailing list