[Biopython] pairwise sequence alignment programs in biopython

John Berrisford jmb at ebi.ac.uk
Wed Jul 11 08:47:26 UTC 2018


Dear Marcus and Peter

I'm writing a program that will be run on lots of different machines - the spec (os, ram etc...) of which I will have no control over. 
My test machine is an 8GB 64bit windows 10 laptop. 

My tests are a work in progress in github
https://github.com/berrisfordjohn/adding_stats_to_mmcif/blob/master/tests/test_seq_align.py

all I'm doing is aligning is taking a long a sequence and against varying lengths of itself against the same thing. i.e. take a 5500 residue sequence and then align the first 2000 residue against the first 2000 residues. 
In my tests on my machine 2000 residues is ok with pairwise2, but 2500 residues fails.  As this appears be machine specific your results may vary. 

However, I am pleased to report that pairwisealigner is working with large sequences and I'm glad to hear that it is similar in alignment results to pairwise2. Next check is ensuring that the alignments do as I expect. 

Thanks

John

-----Original Message-----
From: Peter Cock <p.j.a.cock at googlemail.com> 
Sent: 11 July 2018 08:52
To: John Berrisford <jmb at ebi.ac.uk>
Cc: Biopython Mailing List <biopython at mailman.open-bio.org>
Subject: Re: [Biopython] pairwise sequence alignment programs in biopython

To clarify on length of sequences, I had forgotten the details, see:

https://github.com/biopython/biopython/pull/1655#issuecomment-390180240

If you just want the alignment lengths, the new Align.PairwiseAligner wins, if you want the alignments themselves, then pairwise2 wins.

On the other hand, with random sequences of 5000bp, Michiel reported his new Align.PairwiseAligner was faster.

How much memory (RAM) do you have, and are you using a 32bit operating system? It is likely memory limits which is stopping you align over about 2000 sequences.

Peter

On Wed, Jul 11, 2018 at 12:12 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> Hi John,
>
> The Align.PairwiseAligner code is new in Biopython 1.72, and better 
> support for longer sequences was one of the improvements.
>
> You would probably find it useful to read over the pull request:
> https://github.com/biopython/biopython/pull/1655
>
>
> Peter
>
> On Tue, Jul 10, 2018 at 7:51 PM, John Berrisford <jmb at ebi.ac.uk> wrote:
>> Hi
>>
>>
>>
>> I’m looking at performing pairwise alignments of polymer sequences in 
>> biopython.
>>
>> These will be protein or nucleotide sequences. They may include 
>> non-standard residues which will be denoted as X.
>>
>> The sequences will be of varying length from around 20 residues up to 
>> several thousand residues – put simply the range of sequences in the PDB.
>>
>>
>>
>> I’m looking for the best tool to use to do this in biopython
>>
>>
>>
>> So far I have performed tests with pairwise2 and Align.PairwiseAligner.
>>
>> From my tests it seems that pairwise2 has a limit of ~2000 residues – i.e.
>> if I give it a sequence of 2500 residues to compare against itself it 
>> crashes. PairwiseAligner seems to be able to handle much longer 
>> sequences without issue.
>>
>>
>>
>> I need to be able to set gap penalties – which is possible in both of 
>> these programs.
>>
>>
>>
>> So my question are:
>>
>> Are these the only options in biopython? – I would prefer a python 
>> implementation rather than something that requires external compilation i.e.
>> Emboss Needle
>>
>> Are these the best options?
>>
>> Are they both maintained / stable?
>>
>> Are they comparable in their results?
>>
>> Is the limitation in sequence length in pairwise2 a known issue? A 
>> quick google search suggests most people use pairwise2, which is 
>> strange given its sequence length limitation.
>>
>>
>>
>> Thank you
>>
>>
>>
>> John
>>
>>
>>
>> --
>>
>> John Berrisford
>>
>> PDBe
>>
>> European Bioinformatics Institute (EMBL-EBI)
>>
>> European Molecular Biology Laboratory
>>
>> Wellcome Genome Campus
>>
>> Hinxton
>>
>> Cambridge CB10 1SD UK
>>
>> Tel: +44 1223 492529
>>
>>
>>
>> https://www.pdbe.org
>>
>> https://www.facebook.com/proteindatabank
>>
>> https://twitter.com/PDBeurope
>>
>>
>>
>>
>> _______________________________________________
>> Biopython mailing list  -  Biopython at mailman.open-bio.org 
>> http://mailman.open-bio.org/mailman/listinfo/biopython




More information about the Biopython mailing list