[BioPython] Eliminating redundancy : how?
   
    Iddo Friedberg
     
    idoerg@cc.huji.ac.il
       
    Thu, 12 Jul 2001 18:14:58 +0300 (GMT+0300)
    
    
  
Hi Quoc-Dien,
On Thu, 12 Jul 2001, Quoc-Dien Trinh wrote:
: I have a medium selection of protein sequences (about 500) and I wish to
: eliminate redundancy. The only method I have thought of so far is to blast
: each sequence vs a Blast db created with this selection, and proceed to
: eliminate everything of threshold < 0.02 (using BioPython, of course).
Problems I see with that are:
1) You are doing 500*500 comparisons.
2) You e-values are based on a very small database. Which means that you
have to make your theshold more severe, and may throw away false
negatives.
3) You have to parse the whole file.
Alternatively, you can use bl2seq, the pairwise blast, to make pairwise
alignments, then look at their each alignment's e-value. that way you
reduce by half the number of comparisons (500*499)/2. Bioperl have a
bl2seq parser. Or you can write some throwaway script for your own
purposes. An awk one-liner will do if all you need is to look at the
e-value. The e-value you get is calcualted based on nr's size, so there is
less chance of false negatives.
Hope this helped,
Iddo
PS: bl2seq can be downloaded from NCBI's ftp site:
ftp://ncbi.nlm.nih.gov/blast/executables/
I.
--
Iddo Friedberg                                  | Tel: +972-2-6758647
Dept. of Molecular Genetics and Biotechnology   | Fax: +972-2-6757308
The Hebrew University - Hadassah Medical School | email: idoerg@cc.huji.ac.il
POB 12272, Jerusalem 91120                      |
Israel                                          |
http://bioinfo.md.huji.ac.il/marg/people-home/iddo/