[Biopython] Is this feasible?

Brad Chapman chapmanb at 50mail.com
Tue Jun 8 14:17:54 UTC 2010


Jordan;

> Hello I'm relatively new to both programming and bioinformatics. 

Welcome. Happy to have you on board.

> I have a list of sequences that have evolved away from a given
> germline sequence. I was going to use biopython to iteratively map the
> closest mutant to the germline and pull it out of the list. I would
> then align these two sequences and give a score. I would then take the
> remaining sequences and find the one that is closest to the one that
> was just taken out of the list and do the same thing as the list is
> empty.

The BLASTing and finding the closest match makes good sense. Why
do you need to remove the hit sequences from the reference 
database? Could sequences in your input list not match to the same
gene or different sections of the same gene? This also biases the
search depending on your input order.

If you do need to BLAST without replacement, the approach I would
take is to keep a list of hit IDs you've already used, and avoid
using these alignments again when parsing subsequence searches.

Some specific thoughts on your pseudocode:


>     Database = [ seq_record for seq_record in seqIO.parse('Input.fasta', "fasta")]

Unless your database is small you probably don't want to do this as is
loads the entire file into memory. Instead you'd want to index this file
with formatdb:

formatdb -p T -i Input.fasta

The subprocess module is the way to go:

http://docs.python.org/library/subprocess.html

subprocess.call(["formatdb", "-p", "T", "-i", "Input.fasta"])

>     germline = seqIO.read('germline.fasta')
>     while Database:
>         cline = NcbiblastpCommandline(query=germline, db=Database out='tmp.xml')
>         blast_records  = NCBIXML.parse(run_blast(cline))
>         print blast_records.alignment[0]
>         print germline

Here what you could do for your removal search is to keep a list of
hits you've seen and pick the first one that is not in that list:

hits_seen = []
for query in to_search:
     blast_rec = _do_your_blast_and_parse_it()
     new_hit = None
     for align in blast_rec.alignments:
         if align.title not in hits_seen:
             new_hit = align
             break
     if new_hit:
         hits_seen.append(new_hit.title)
         _output_what_you_want(new_hit)
              

> I guess my first question is does this seem logical. Is blast the
> best algorithm to use for this scenario? 

Hopefully this helps move you forward.

> The other problem is
> creating my own database. I read the documentation, and it said
> you could create your own database to run local blast (which I
> have), I just have no idea how do to that. 

That's the formatdb command listed above. There's plenty to read on
the web about commandline options for DNA/protein databases.

> The second thing is in
> blast_records.alignment[0], will this always give me the best scoring
> sequence?

Yes, they are ordered by score just like in the raw BLAST file.

Hope this helps,
Brad



More information about the Biopython mailing list