[Biopython-dev] Distance Matrix Parsers

Mon Jun 12 21:57:36 UTC 2006

[Send to the Dev list only - forward to the main discussion list if you 
think best Marc]

One general question about the architecture: Are you thinking of having 
a generic "distance matrix object", and parsers/formats defined for 
several different file formats?

Peter (me) wrote:
>>In my experience, most software tools usually write the distances as a
>>full symmetric matrix.  However, the "standard" explicitly discusses
>>lower triangular form (missing out the diagonal distance zero entries)
>>which has the significant advantage of using about half the disk  
>>space. This is significant once you get into thousands of taxa.

Marc Colosimo wrote:
> This is still small potatoes compared to the input needed to generate  
> the distance matrixs (especially with DNA/RNA sequences of any  
> decently sized gene).

Regarding size of matrix file versus size of alignment file, that isn't 
hallways true.

(*) The matrix file size goes as the square of the number of taxa, the 
alignment file only linearly.

(*) The matrix file is invariant with respect to the length of the 
sequences/number of columns in the alignment.

(*) The matrix file size goes linearly with the precision (number of 
decimal places) used.

As you are using "decently sized genes" then you will have large 
alignment files, but I would imagine you have at most hundred of genes 
per alignment - not thousands (?).

For my own examples, I have about two thousand domains (not full genes) 
and the phylip distance matrix file was MUCH bigger than the alignment file.

Peter (me) wrote
>>So, make sure any parser can cope with both full symmetric, and lower
>>triangular forms - ideally without the user having to care.

Marc Colosimo wrote:
> Phylip does ask you which to either read or write; this is a pain at  
> times. So, having a parser figure this out would be nice. However,  
> the user should know about the choices.

Its fairly easy for the parser to cope with either: For each line of 
input, only use the "lower triangular" portion - just ignore any 
remaining text which would be present for a full matrix (square) file, 
or not present for a lower triangular file.

Peter wrote:
>>This also raises the point about how to store the matrix in memory.
>>Does Numeric/NumPy have an efficient way of storing symmetric  
>>matrices? This is less flexible than the suggested list of lists,
 >>but for large datasets would need much less memory.

Marc Colosimo wrote:
> I believe that SciPy  (Numeric/NumPy/etc..) is more efficient at  
> storing these things. But you lose that when you want to do pythonish  
> things to it (like write it back out).

It depends on our target audience.  My experience with two thousand taxa 
means that I am slightly concerned about the memory, and would lean 
towards storing the data using Numeric/NumPy.  This could be done within 
a nice python object, with methods to write it out again in phylip 
format etc - so it could still behave "nicely".

Peter wrote:
>>Second point - the "official" PHYLIP distance matrix file format
>>truncates the taxa names at 10 characters.  Some tools (e.g. clustalw)
>>ignore this limitation and will use as many as needed for the full  
>>name.

Marc Colosimo wrote:
> ...
> 
> By definition this isn't a variant of Phylip, but another format. So,  
> one would need two parsers: PhylipDist and Dist (or ClustalDist).

That would be another way of looking at the issue, sure.  [See below]

Peter wrote:
>>For writing matrices to file, the issue of following the strict 10
>>character taxa limit might best be handled as an option (default to  
>>max 10, with a warning if any names are truncated, and an error if
>>truncation renders names non-unique?).

Marc Colosimo wrote:
> DON'T give an option of 10 or more. That is NOT the definition of the  
> Phylip file Matrix structure, so why give the option? Make another  
> class that outputs the whole name (ClustalDist).

I like clustal's "long name variant of Phylip distance format", as for 
my datasets my gene/domain names are longer than 10 characters.  I may 
well be in a minority here (for now).

I suppose if would be "good practice" to follow the official (but not 
overly precise) phylip definition on this issue.

So your idea of defining two similar formats would resolve this.  In 
terms of implementation, one could probably just subclass the other to 
reduce the amount of duplicated code.

> I am pretty sure that Phylip doesn't care about non-unique names so  
> why error out? However, the class should have a means for the user to  
> ask this question.

Because the (truncated) taxa names are going to be used as tree node 
names by any tree building program, they really should be unique.  I 
would expect any tree program to throw an error in this case, which is 
why I suggested we should try not to create such files in the first place.

Peter wrote:
>>Likewise an option to save matrices as either fully symmetric or lower
>>triangular.  I would lean towards using fully symmetric as the default
>>as it seems to be more common.

Marc Colosimo wrote:
> Phylip's default seems to be a "Square" distance matrix, i.e. fully  
> symmetric. Keep this in mind when naming or documentation.

Good point.

Peter