[Bioperl-l] SimpleAlign and character sets
Heikki Lehvaslaiho
heikki@ebi.ac.uk
Tue, 05 Jun 2001 11:05:52 +0100
[long and rambling, sorry]
BioPerlers,
I've continued playing with SimpleAlign. I was trying thinking what is
needed to make it more useful and realized that it is necessary to
know what certain special characters mean in the sequence.
Bioperl is build around idea that libraries do not define character
sets used. Programs should work with both upper and lower case letters
and so on. However, Bio::PrimarySeq::validate_seq throws an error
unless
$seqstring =~ /^[A-Za-z\-\.\*]+$/. In practise this means that gaps in
the alignment can be marked with either '-' or '.'. '*' is commonly
used only for stops in proteins. Other characters like '~' has to be
changed into something else before put into a sequence object.
QUESTION 1. Is this wide enough character set or should it be
extended? (see below)
There are two really common MSE formats which are not in BioPerl:
phylip and nexus. Both are used heavily in evolutionary studies.
QUESTION 2. Is anyone interested in binding bioperl better to
phylogenetic programs?
Jason mentioned EMBOSS some time ago. It has bindings to PHYLIP. In my
opinion someone should look into that.
phylip MSE format is an easy one. I have it almost ready. Only really
old PHYLIP programs use sequential format; I am implementing only the
interleaved format.
nexus used by PAUP, MacClade and other Mac/MS programs is a lot more
complex. I've started writing it. The input is not too bad but to
output the SimpleAlign object needs to know what characters are used
for
1. missing char (usually '?')
2. unknown (N for nucleotides, X for protein)
3. match char (usually .)
4. gap char (usually -)
if any of these are present.
Have a look at an example:
...
Begin data;
Dimensions ntax=14 nchar=7262;
Format datatype=DNA interleave gap=- equate="U=T ~=."
matchchar=.;
Matrix
TFnode1 GCCGAACTTA GGAAATTAGT CTGAACAGGT GAGAGGGTGC GCCAGAGAAC
Das ---------- ---------- ---------- ---------- ----------
...
(equate is yet an other nice nexus command which is luckily easy to
parse.)
'?' seems to me to be a good candidate to add into sequence set.
Would it break something? What is there against adding it?
On the plus side is that one could take a consensus_string() from an
alignment (e.g. 'ATGCG?TG?') and put it into a sequence object.
On the other hand I am not too sure how necessary the missing
character is in sequence data (PAUP can handle general cladistic
character data) and there is the '*' character left.
To output this kind of format, I am planning to add a set/get method
into SimpleAlign for each of the four character classes. In my opinion
they balance the need to have that information in the sequence object
without adding unnecessary complexity to the class. Only IO methods
needing them, like nexus, will require them and there are obvious
default values for each of them.
QUESTION 4. Comments or suggestions on this from anyone.
Speaking of complexity of the SimpleAlign class, I am going to
reorganize the methods in the file and add headers. My plan is below:
-Heikki
Modifier methods
* add_seq
* remove_seq
* purge
* sort_alphabetically
Sequence selection methods
* each_seq
* each_alphabetically
* each_seq_with_id
Create new alignments
* select
* slice
Change sequences within the MSE
* map_chars
* uppercase
* match [ RENAMED FROM dot() ]
* unmatch [ TO BE ADDED ]
MSE attibutes
* id
* missing_char [ TO BE ADDED ]
* unknown_char [ TO BE ADDED ]
* match_char [ TO BE ADDED ]
* gap_char [ TO BE ADDED ]
Alignment descriptors
* consensus_string
* consensus_aa
* is_flush
* length
* maxdisplayname_length
* no_residues
* no_sequences
* percentage_identity
Alignment positions
* column_from_residue_number
Sequence names
* displayname
* set_displayname_count
* set_displayname_flat
* set_displayname_normal
_________________________________________________________________
Modifier methods
These methods modify the MSE by adding or removing complete
sequences.
add_seq
remove_seq
purge
sort_alphabetically
_________________________________________________________________
Sequence selection methods
Methods returning one or more sequences objects.
each_alphabetically
each_seq
each_seq_with_id
_________________________________________________________________
Create new alignments
The result of these methods are horizontal or vertical subsets of
the current MSE.
select
slice
_________________________________________________________________
Change sequences within the MSE
These methods affect characters in all sequences without changeing
the alignment.
map_chars
uppercase
match
unmatch
_________________________________________________________________
MSE attibutes
Methods for setting and reading the MSE attributes.
id
missing_char
unknown_char
match_char
gap_char
_________________________________________________________________
Alignment descriptors
These read only methods describe the MSE in various ways.
consensus_string
consensus_aa
is_flush
length
maxdisplayname_length
no_residues
no_sequences
percentage_identity
_________________________________________________________________
Alignment positions
A method to map a sequence position into an alignment column.
column_from_residue_number
_________________________________________________________________
Sequence names
Methods to manipulate the display name. The default name based on
the sequence id and subsequence positions can be overridden in
various ways.
displayname
set_displayname_count
set_displayname_flat
set_displayname_normal