[Biojava-dev] phylo code

Tue Aug 7 14:34:44 UTC 2007

Hi Richard and Thasso,

On Aug 7, 2007, at 3:48 AM, Richard Holland wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Thanks for your feedback Thasso.
>
> The fire/events thing is certainly a misnomer - guilty as charged (I
> wrote the code...) - but I suppose I wasn't expecting the naming to
> matter much. I'll bear that in mind for future code. We can't really
> change the existing interfaces now as they've been released and it is
> not nice to users for us to change public interfaces that might  
> already
> be in use.
>
> The PHYLIP format handler was written by Jim Balhoff. Jim - do you  
> have
> any responses to Thasso's comments about the output options?

I think it would be great to have import and export classes for  
PHYLIP trees and distance matrices.  The current code handles only  
alignments.  The other data would be in separate files, and so not  
part of this parser.

> I like the sound of your PHYLIP short-name map. You could  
> definitely go
> ahead and contribute an update which implemented that. (Don't  
> forget to
> make your code clear the map between one file and the next!)

Yes, I think the the map is a great idea.  The first edition of the  
PHYLIP parser was simple and strictly stuck to the format  
specification.  The map would be a great way to transparently use  
longer names when running PHYLIP behind the scenes.  If the user is  
actually exporting a PHYLIP formatted alignment to disk, it might be  
nice to have a few options for what should happen - the current  
truncation method could be one option, another might be to simply put  
in the long name and put a space before the sequence starts (not  
strictly PHYLIP, but it is a simple alignment format recognized by  
some programs), another might be to raise an exception or otherwise  
alert if sequence names are too long.

Another enhancement to the PHYLIP classes would be to let the  
developer specify interleaved or sequential alignment format for  
import and export (and for both the length of the lines for export).   
Right now I think there are some possible files which will not be  
parsed correctly - probably a sequential style file with newlines  
within the sequences (if a "sequential" alignment has no newlines, it  
is equivalent to "interleaved").  Or instead of specifying  
interleaved or sequential, figure out how to detect them reliably.   
Here are the examples:

<http://evolution.genetics.washington.edu/phylip/doc/ 
main.html#inputfiles>

Best regards,
Jim

____________________________________________
James P. Balhoff, Ph.D.
National Evolutionary Synthesis Center
2024 West Main St., Suite A200
Durham, NC 27705
USA