[Bioperl-l] SimpleAlign and character sets

Heikki Lehvaslaiho heikki@ebi.ac.uk
Tue, 05 Jun 2001 11:05:52 +0100


[long and rambling, sorry]


BioPerlers,

I've continued playing with SimpleAlign. I was trying thinking what is
needed to make it more useful and realized that it is necessary to
know what certain special characters mean in the sequence.

Bioperl is build around idea that libraries do not define character
sets used. Programs should work with both upper and lower case letters
and so on. However, Bio::PrimarySeq::validate_seq throws an error
unless 
$seqstring =~ /^[A-Za-z\-\.\*]+$/. In practise this means that gaps in
the alignment can be marked with either '-' or '.'. '*' is commonly
used only for stops in proteins. Other characters like '~' has to be
changed into something else before put into a sequence object.

QUESTION 1. Is this wide enough character set or should it be
extended? (see below)

There are two really common MSE formats which are not in BioPerl:
phylip and nexus. Both are used heavily in evolutionary studies.

QUESTION 2. Is anyone  interested in binding bioperl better to
phylogenetic programs?

Jason mentioned EMBOSS some time ago. It has bindings to PHYLIP. In my
opinion someone should look into that. 


phylip MSE format is an easy one. I have it almost ready. Only really
old PHYLIP programs use sequential format; I am implementing only the
interleaved format. 

nexus used by PAUP, MacClade and other Mac/MS programs is a lot more
complex. I've started writing it. The input is not too bad but to
output the SimpleAlign object needs to know what characters are used
for 


1. missing char    (usually '?')
2. unknown         (N for nucleotides, X for protein)
3. match char      (usually .)
4. gap char        (usually -)

if any of these are present. 

Have a look at an example:
...
Begin data;
    Dimensions ntax=14 nchar=7262;
    Format datatype=DNA interleave gap=- equate="U=T ~=."
           matchchar=.;
    Matrix
   TFnode1  GCCGAACTTA GGAAATTAGT CTGAACAGGT GAGAGGGTGC GCCAGAGAAC 
       Das  ---------- ---------- ---------- ---------- ---------- 
...

(equate is yet an other nice nexus command which is luckily easy to
parse.)


 '?' seems to me to be a good candidate to add into sequence set.
Would it break something? What is there against adding it?

On the plus side is that one could take a consensus_string() from an
alignment (e.g. 'ATGCG?TG?') and put it into a sequence object. 

On the other hand I am not too sure how necessary the missing
character is in sequence data (PAUP can handle general cladistic
character data) and there is the '*' character left.


To output this kind of format, I am planning to add a set/get method
into SimpleAlign for each of the four character classes. In my opinion
they balance the need to have that information in the sequence object
without adding unnecessary complexity to the class. Only IO methods
needing them, like nexus, will require them and there are obvious
default values for each of them.

QUESTION 4. Comments or suggestions on this from anyone.



Speaking of complexity of the SimpleAlign class, I am going to
reorganize the methods in the file and add headers.  My plan is below:


	-Heikki


     Modifier methods
     * add_seq
     * remove_seq
     * purge
     * sort_alphabetically

     Sequence selection methods
     * each_seq
     * each_alphabetically
     * each_seq_with_id

     Create new alignments
     * select
     * slice

     Change sequences within the MSE
     * map_chars
     * uppercase
     * match          [ RENAMED FROM dot()  ]
     * unmatch        [ TO BE ADDED ]

     MSE attibutes
     * id
     * missing_char   [ TO BE ADDED ]
     * unknown_char   [ TO BE ADDED ]
     * match_char     [ TO BE ADDED ]
     * gap_char       [ TO BE ADDED ]

     Alignment descriptors
     * consensus_string
     * consensus_aa
     * is_flush
     * length
     * maxdisplayname_length
     * no_residues
     * no_sequences
     * percentage_identity

     Alignment positions
     * column_from_residue_number

     Sequence names
     * displayname
     * set_displayname_count
     * set_displayname_flat
     * set_displayname_normal
     _________________________________________________________________

                               Modifier methods

   These methods modify the MSE by adding or removing complete 
   sequences.

add_seq
remove_seq
purge
sort_alphabetically
     _________________________________________________________________

                          Sequence selection methods

   Methods returning one or more sequences objects.

each_alphabetically
each_seq
each_seq_with_id
     _________________________________________________________________

                             Create new alignments

   The result of these methods are horizontal or vertical subsets of 
   the current MSE.

select
slice
     _________________________________________________________________

                        Change sequences within the MSE

   These methods affect characters in all sequences without changeing 
   the alignment.

map_chars
uppercase
match
unmatch
     _________________________________________________________________

                                 MSE attibutes

   Methods for setting and reading the MSE attributes.

id
missing_char
unknown_char
match_char
gap_char
     _________________________________________________________________

                             Alignment descriptors

   These read only methods describe the MSE in various ways.

consensus_string
consensus_aa
is_flush
length
maxdisplayname_length
no_residues
no_sequences
percentage_identity
     _________________________________________________________________

                              Alignment positions

   A method to map a sequence position into an alignment column.

column_from_residue_number
     _________________________________________________________________

                                Sequence names

   Methods to manipulate the display name. The default name based on   
   the sequence id and subsequence positions can be overridden in 
   various ways.

displayname
set_displayname_count
set_displayname_flat
set_displayname_normal