[Biopython-dev] [Bug 2494] _retrieve_taxon in BioSQL.py needs urgent optimization

Wed May 7 15:36:43 UTC 2008

http://bugzilla.open-bio.org/show_bug.cgi?id=2494

------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk  2008-05-07 11:36 EST -------
Created an attachment (id=917)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=917&action=view)
Patch to BioSQL/BioSeq.py

Hi Eric.

I've tried your script with MySQL 5.0 under Linux, and see similar example
timings, e.g.:

getTaxonSQLsimplex took 458.646 ms
getTaxonSQL took 8152.112 ms
getTaxonSQLall took 8565.304 ms
getTaxonLoop took 18.612 ms

However, your loop function doesn't return exactly the same list as the
original code.  In particular you do not exclude taxonomy lineage entries with
a rank of "no rank".  Also I didn't like the hard coded assumption about
taxon_id 1 as a top node.  What do you think of this version:

def getTaxonLoopPeter(adaptor, taxon_id):
    # climbing up the hierarchy: bottom-up approach based on the child/parent
link with parent_taxon_id
    taxonomy = []
    while taxon_id :
        name, rank, parent_taxon_id = adaptor.execute_one(
        "SELECT taxon_name.name, taxon.node_rank, taxon.parent_taxon_id" \
        " FROM taxon, taxon_name" \
        " WHERE taxon.taxon_id=taxon_name.taxon_id" \
        " AND taxon_name.name_class='scientific name'" \
        " AND taxon.taxon_id = %s", (taxon_id,))
        if taxon_id == parent_taxon_id :
            # If the taxon table has been populated by the BioSQL script
            # load_ncbi_taxonomy.pl this is how top parent nodes are stored.
            # Personally, I would have used a NULL parent_taxon_id here.
            break
        if rank <> "no rank" :
            #For consistency with older versions of Biopython, we are only
            #interested in taxonomy entries with a stated rank.
            #Add this to the start of the lineage list.
            taxonomy.insert(0, name)
        taxon_id = parent_taxon_id
    return taxonomy

I'm attaching a patch to BioSQL/BioSeq.py that uses this code in place of the
current left/right dependent version.  While this does seem to be much faster
in your test script, I'm not sure how much difference this will make in normal
usage.

Peter

-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.