[Bioperl-l] AlignIO::* match_char, gap_char and missing_char etc

Nathan Haigh n.haigh at sheffield.ac.uk
Thu May 12 06:10:23 EDT 2005


I've noticed some inconsistency in the way sequence alignments are read and
stored and printed when match_char, gap_char and missing_char are used.

 

Should sequences be stored exactly  the way they are represented in the
file? Should there be default values for formats that support one or more of
match_char, gap_char and missing_char or should these only be set if they
are used in the alignment file? Should formats that don't support match_char
check for and do an unmatch during a write_aln? Should formats that use
specific characters for match_char, gap_char and missing_char check and do
map_char if required during a write_aln? 

 

I was going to have a look through Align::* and try to make them more
consistent with regards to these. What I propose to do is:

 

1)      Have default values for match_char, gap_char and missing_char for
those formats that only support a particular character

2)      Have match_char, gap_char and missing_char set when the appropriate
command is found for setting these characters

3)      Store the sequences exactly as they are in the alignment file
(except maybe for match_char)

4)      During write_aln check are conducted to ensure the sequences are
compliant with the features (match_char, gap_char and missing_char )
supported by that format and do map_char, unmatch/match as required.

 

I suppose the only thing is whether Unmatch should be called during read_aln
in order to store sequences with the correct residue characters instead of
the match_char. The reason being that many formats don't support this and
the user can always call "match" on the SimpleAlign object, thus bringing
some level of consistency to the use of this feature.

 

This will be my first foray into making bigger changes in Bioperl as a
developer! Yikes! So I'd like to know what people think as well as their
experiences with similar problems. I'm most familiar with nexus, clustal,
phylip and fasta so it would be nice to hear about comments/problems with
some of the other formats!

 

Cheers

Nath

 

 

----------------------------------

Nathan Haigh

PostDoctoral Research Associate

 

Department of Animal and Plant Sciences

University of Sheffield

Western Bank

Sheffield

S10 2TN

 

Tel: +44 (0)114 22 20112

Mob: +44 (0)7742 533 569

Fax: +44 (0)114 22 20002

 




More information about the Bioperl-l mailing list