[Biopython] pairwise sequence alignment programs in biopython

Michiel de Hoon mjldehoon at yahoo.com
Wed Jul 11 01:53:11 UTC 2018


 Dear John,



> I’m looking for the best tool to use to do this in biopython
It depends on what you mean by "best". Both pairwise2 and Align.PairwiseAligner implement a dynamic programming that is guaranteed to find the optimal alignment as defined by the gap penalties and match/mismatch scores. However dynamic programming may be slow for long sequences. It's up to you to decide if the runtime is acceptable.



> So far I have performed tests with pairwise2 and Align.PairwiseAligner. 

> From my tests it seems that pairwise2 has a limit of ~2000 residues – i.e. if I give it a sequence of 2500 residues to compare against itself it crashes. PairwiseAligner seems to be able to handle much longer sequences without issue. 

I am not sure where the difference is coming from. Align.PairwiseAligner and pairwise2 are based on the same algorithm, though the implementation details will differ.
In both cases, the memory requirements scale as the sequence length squared, and you should not run into memory issues for a sequence of 2500 residues.

 
> They may include non-standard residues which will be denoted as X.

In Bio.Align.PairwiseAligner, these will get a match and mismatch score of 0.

>The sequences will be of varying length from around 20 residues up to several thousand residues – put simply the range of sequences in the PDB.

That sounds doable.

> I need to be able to set gap penalties – which is possible in both of these programs. 


Both can set gap penalties, but Align.PairwiseAligner is more flexible in terms of the gap penalties it can accept (at the same running speed).


> Are they both maintained / stable? 


Both are maintained, but since their functionality overlaps, I think at some point we will choose one over the other.


> Are they comparable in their results?

If the gap penalties and match/mismatch scores are the same, then the results should be identical.

The main differences are running speed (Align.PairwiseAligner is slightly faster), allowable gap penalties (without resorting to gap penalties calculated on the fly using a Python, which both programs can handle but will be much slower), and user interface.

> Is the limitation in sequence length in pairwise2 a known issue? A quick google search suggests most people use pairwise2, which is strange given its sequence length limitation. 
The reason for this may be that pairwise2 has been around for many years (I think more than 10 years at least), while Align.PairwiseAligner was introduced only in Biopython release 1.72, which is the most recent version.

Full disclosure: I wrote Align.PairwiseAligner.

Best,-Michiel


    On Wednesday, July 11, 2018, 5:06:18 AM GMT+9, John Berrisford <jmb at ebi.ac.uk> wrote:  
 
 <!--#yiv4459899763 _filtered #yiv4459899763 {font-family:"Cambria Math";panose-1:2 4 5 3 5 4 6 3 2 4;} _filtered #yiv4459899763 {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;}#yiv4459899763 #yiv4459899763 p.yiv4459899763MsoNormal, #yiv4459899763 li.yiv4459899763MsoNormal, #yiv4459899763 div.yiv4459899763MsoNormal {margin:0cm;margin-bottom:.0001pt;font-size:11.0pt;font-family:"Calibri", sans-serif;}#yiv4459899763 a:link, #yiv4459899763 span.yiv4459899763MsoHyperlink {color:#0563C1;text-decoration:underline;}#yiv4459899763 a:visited, #yiv4459899763 span.yiv4459899763MsoHyperlinkFollowed {color:#954F72;text-decoration:underline;}#yiv4459899763 span.yiv4459899763EmailStyle17 {font-family:"Calibri", sans-serif;color:windowtext;}#yiv4459899763 .yiv4459899763MsoChpDefault {} _filtered #yiv4459899763 {margin:72.0pt 72.0pt 72.0pt 72.0pt;}#yiv4459899763 div.yiv4459899763WordSection1 {}-->
Hi 

  

I’m looking at performing pairwise alignments of polymer sequences in biopython. 

These will be protein or nucleotide sequences. They may include non-standard residues which will be denoted as X. 

The sequences will be of varying length from around 20 residues up to several thousand residues – put simply the range of sequences in the PDB. 

  

I’m looking for the best tool to use to do this in biopython

  

So far I have performed tests with pairwise2 and Align.PairwiseAligner. 

>From my tests it seems that pairwise2 has a limit of ~2000 residues – i.e. if I give it a sequence of 2500 residues to compare against itself it crashes. PairwiseAligner seems to be able to handle much longer sequences without issue. 

  

I need to be able to set gap penalties – which is possible in both of these programs. 

  

So my question are:

Are these the only options in biopython? – I would prefer a python implementation rather than something that requires external compilation i.e. Emboss Needle

Are these the best options?

Are they both maintained / stable? 

Are they comparable in their results?

Is the limitation in sequence length in pairwise2 a known issue? A quick google search suggests most people use pairwise2, which is strange given its sequence length limitation. 

  

Thank you

  

John 

  

--

John Berrisford

PDBe

European Bioinformatics Institute (EMBL-EBI)

European Molecular Biology Laboratory

Wellcome Genome Campus

Hinxton

Cambridge CB10 1SD UK

Tel: +44 1223 492529

  

https://www.pdbe.org

https://www.facebook.com/proteindatabank

https://twitter.com/PDBeurope

  
_______________________________________________
Biopython mailing list  -  Biopython at mailman.open-bio.org
http://mailman.open-bio.org/mailman/listinfo/biopython  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20180711/3988fdf1/attachment-0001.html>


More information about the Biopython mailing list