[Bioperl-l] Comparative genomics
Arlin Stoltzfus
arlin@carb.nist.gov
Tue, 02 Oct 2001 10:00:10 -0400
Bioperlers--
Those interested in representing phylogenetic trees and associated
inferences might benefit from an ongoing e-discussion of evolutionary
systematists and computational biologists who wish to develop an
XML format for transfer of phylogenetic data. The archives of the
mailing list are here:
http://evolution.genetics.washington.edu/pipermail/xml/
though perhaps it would be more useful to look at this page:
http://evolve.zoo.ox.ac.uk/PhyloXML/
A few points:
1. The nested parentheses format for representing a phylogeny like this:
(fish, ((cat, dog), rat));
is called the "Newick" or "New Hampshire" standard. Branch lengths are
added by putting ":<number>" after the descendant node. Internal nodes
can be named, as in "(fish, ((cat, dog) Carnivora, rat) Mammalia)". It is
conventional to allow a block of multiple trees with weights and names,
one tree per line. Newick is the standard for trees, as universal as
FASTA is for sequences.
2. NEXUS is a standard format with separate blocks for representing
alignments, trees, assumptions, etc used in phylogenetic analysis.
NEXUS incorporates a TREES block for Newick trees. OTUs (i.e.,
sequences) and characters (i.e., alignment columns) can be assigned
to subsets in a SETS block to allow differential treatment in analyses
(e.g., different models for 1st, 2nd and 3rd codon positions).
NEXUS has been in use for close to a decade as an input format for
phylogenetic analysis programs such as PAUP and MacClade, though my
guess would be that it is not used by the majority of such programs.
A proper format description has been published:
Maddison, D. R., D. L. Swofford, et al. (1997). "NEXUS: an
extendible file format for systematic information." Systematic
Biology 46: 590-621.
The published standard is much more flexible and extensive than any current
implementation. For instance, the standard allows the specification of
genetic codes for different nucleotide sequences, but this feature is
not used in any program, to my knowledge. Probably anything you
want to do could be done within the published NEXUS standard.
3. The Newick tree format is limited. There is no general mechanism to
annotate whole trees, nodes or branches, only a mechanism to add branch
lengths. The nested parenthesis format is hierarchical, which means that
it implies a rooted tree, although most trees used in phylogenetics are not
rooted. The hierarchical representation allows polytomies (>2 children of a
parent) but not anastomosis (>1 parent, as when symbiosis or recombination
occurs). The Newick format could be extended to allow more annotation
of specific nodes and branches with analytical results, cross-references,
and display parameters. In fact, the NEXUS standard suggests that in its
TREES block, such additional information can be put in square brackets
(which is how comments are demarcated in a NEXUS file) following the
optional branch length.
4. A more general way to represent a tree is a (non-hierarchical) list
of nodes and edges. This is the basis for other graph modelling languages
such as XGMML. Perhaps XGMML could be used with little modification.
Arlin