[Biopython] Divergent sequence data set

Peter biopython at maubp.freeserve.co.uk
Wed Nov 18 10:24:48 UTC 2009


On Wed, Nov 18, 2009 at 8:19 AM, Animesh Agrawal
<animesh.agrawal at anu.edu.au> wrote:
>
> Hi,
>
> I have been trying to develop a divergent sequence data set for a
> phylogenetic analysis. Do we have something in Biopython, where for a given
> set of  sequences we can choose identity threshold to reduce redundancy in
> the dataset.
>
> Cheers,
>
> Animesh

Hi Animesh,

There are probably 100s of ways to do this. I think you should consult
the literature as the the best approach (in terms of the algorithm), or
talk to a phylogeneticist. Once you have an algorithm in mind, it can
probably be done with python.

For example, you could do pairwise BLAST alignments (e.g. using the
NCBI standalone tools) or maybe pairwise Needleman-Wunsch global
alignment (e.g. using the EMBOSS needle tool) and construct a distance
matrix in terms of percentage identity.

You could build a rough phylogenetic tree (perhaps using NJ if your
starting dataset is very large), and use that to sample the nodes to
get a fairly uniform distribution w.r.t. the phylogenetic space.

These are just rough ideas - I am not a phylogenetics specialist.

I have a vague recollection that one of the sequence alignment
tools includes an option to do something like this for you... but I
can't remember the details.

Peter




More information about the Biopython mailing list