[BioPython] Performing sequence alignments, etc.

Sun Oct 14 17:38:32 UTC 2007

Caitlin wrote:
> Hi all.
> 
> I'm relatively new to the field of bioinformatics and I'm trying to
> perform a multiple sequence alignment on 5-6 sequences (fasta format -
> dna sequences). I'd like the output to be formatted in the following
> manner (clustalw standalone output):

For reading and writing Clustalw alignment files, you could either use 
Bio.SeqIO (format name "clustal") or the Bio.Clustalw module.
http://biopython.org/wiki/SeqIO

> When one more more nucleotides columns are identical, clustalw displays
> an asterisk. If not, a blank space is displayed. Is this a standard
> feature of BioPython?

There is an example of Clustalw output online here - note there can also 
be a column of numbers on the right hand side (not shown here):
http://www.bioperl.org/wiki/ClustalW_multiple_alignment_format

It sounds like you are describing the simple consensus string which 
clustalw outputs under the alignment (using *:. and space).

Biopython has a SummaryInfo object which can calculate simple consensus 
sequences (see the tutorial). Perhaps this would be close to what you 
want to do.

> Also, I'm evaluating several sequences but I'd like to obtain the most
> recent complete genomes possible from various countries. Is there a
> convenient source to use (GenBank?) if I don't know the accession
> numbers?

What sort of Genomes? Bacteria? Vertebrates?  You could start by having 
a look at any of the EMBL, NCBI/GenBank or the Japanese DDBJ (these 
three are kept in sync with each other).

Biopython has quite a nice interface for searching and downloading 
sequences from GenBank (again, see the tutorial) so that would be my 
first suggestion.

Peter