[Bioperl-l] Comparative genomics
   
    Arlin Stoltzfus
     
    arlin@carb.nist.gov
       
    Tue, 02 Oct 2001 10:00:10 -0400
    
    
  
Bioperlers--
Those interested in representing phylogenetic trees and associated 
inferences might benefit from an ongoing e-discussion of evolutionary 
systematists and computational biologists who wish to develop an 
XML format for transfer of phylogenetic data.  The archives of the 
mailing list are here: 
 http://evolution.genetics.washington.edu/pipermail/xml/
though perhaps it would be more useful to look at this page: 
  http://evolve.zoo.ox.ac.uk/PhyloXML/
A few points: 
1. The nested parentheses format for representing a phylogeny like this: 
  (fish, ((cat, dog), rat)); 
is called the "Newick" or "New Hampshire" standard. Branch lengths are 
added by putting ":<number>" after the descendant node.  Internal nodes 
can be named, as in "(fish, ((cat, dog) Carnivora, rat) Mammalia)".  It is 
conventional to allow a block of multiple trees with weights and names, 
one tree per line.  Newick is the standard for trees, as universal as 
FASTA is for sequences. 
2. NEXUS is a standard format with separate blocks for representing 
alignments, trees, assumptions, etc used in phylogenetic analysis.  
NEXUS incorporates a TREES block for Newick trees.  OTUs (i.e., 
sequences) and characters (i.e., alignment columns) can be assigned 
to subsets in a SETS block to allow differential treatment in analyses 
(e.g., different models for 1st, 2nd and 3rd codon positions).  
NEXUS has been in use for close to a decade as an input format for 
phylogenetic analysis programs such as PAUP and MacClade, though my 
guess would be that it is not used by the majority of such programs.  
A proper format description has been published:  
 Maddison, D. R., D. L. Swofford, et al. (1997). "NEXUS: an
 extendible file format for systematic information." Systematic
 Biology 46: 590-621.
The published standard is much more flexible and extensive than any current 
implementation. For instance, the standard allows the specification of 
genetic codes for different nucleotide sequences, but this feature is 
not used in any program, to my knowledge.  Probably anything you 
want to do could be done within the published NEXUS standard.  
3.  The Newick tree format is limited.  There is no general mechanism to 
annotate whole trees, nodes or branches, only a mechanism to add branch 
lengths.  The nested parenthesis format is hierarchical, which means that 
it implies a rooted tree, although most trees used in phylogenetics are not 
rooted.  The hierarchical representation allows polytomies (>2 children of a 
parent) but not anastomosis (>1 parent, as when symbiosis or recombination 
occurs).  The Newick format could be extended to allow more annotation 
of specific nodes and branches with analytical results, cross-references, 
and display parameters.  In fact, the NEXUS standard suggests that in its 
TREES block, such additional information can be put in square brackets 
(which is how comments are demarcated in a NEXUS file) following the 
optional branch length.  
4. A more general way to represent a tree is a (non-hierarchical) list 
of nodes and edges.  This is the basis for other graph modelling languages 
such as XGMML.  Perhaps XGMML could be used with little modification. 
Arlin