[Biopython-dev] [Wg-phyloinformatics] BioGeography update/BioPython tree module discussion

Mon Aug 10 20:25:10 UTC 2009

PS: Evidence of interest in this GBIF functionality already, see fwd 
below...

PPS: Commit with updates names, deleted old files here:
http://github.com/nmatzke/biopython/commits/Geography

-------- Original Message --------
Subject: Re: biogeopython
Date: Fri, 07 Aug 2009 16:34:26 -0700
From: Nick Matzke <matzke at berkeley.edu>
Reply-To: matzke at berkeley.edu
Organization: Dept. Integ. Biology, UC Berkeley
To: James Pringle <jpringle at unh.edu>
References: 
<d8d18d8e0908050837r79d6282cg8e72b2c1f1928abd at mail.gmail.com>	 
<4A7C6DEE.1000305 at berkeley.edu> 
<d8d18d8e0908071125t7ed37900ld21e83a82df57b57 at mail.gmail.com>

Coolness, let me know how it works for you, feedback appreciated at this
stage.  Cheers!
Nick

James Pringle wrote:
 > Thanks!
 > Jamie
 >
 > On Fri, Aug 7, 2009 at 2:09 PM, Nick Matzke <matzke at berkeley.edu
 > <mailto:matzke at berkeley.edu>> wrote:
 >
 >     Hi Jamie!
 >
 >     It's still under development, eventually it will be a biopython
 >     module, but what I've got should do exactly what you need.
 >
 >     Just take the files from the most recent commit here:
 >     http://github.com/nmatzke/biopython/commits/Geography
 >
 >     ...and run test_gbif_xml.py to get the idea, it will search on a
 >     taxon name, count/download all hits, parse the xml to a set of
 >     record objects,  output each record to screen or tab-delimited file,
 >     etc.
 >
 >     Cheers!
 >     Nick
 >
 >
 >
 >
 >
 >     James Pringle wrote:
 >
 >         Dear Mr. Matzke--
 >
 >            I am an oceanographer at the University of New Hampshire, and
 >         with my colleagues John Wares and Jeb Byers am looking at the
 >         interaction of ocean circulation and species ranges.    As part
 >         of that effort, I am using GBIF data, and was looking at your
 >         Summer-of-Code project.    I want to start from a species name
 >         and get lat/long of occurance data.   Is you toolbox in usable
 >         shape (I am an ok pythonista)?  What is the best way to download
 >         a tested version of it (I can figure out how to get code from
 >         CVS/GIT, etc, so I am just looking for a pointer to a stable-ish
 >         tree)?
 >
 >         Cheers,
 >         & Thanks
 >         Jamie Pringle
 >
 >
 >     --
 >     ====================================================
 >     Nicholas J. Matzke
 >     Ph.D. Candidate, Graduate Student Researcher
 >     Huelsenbeck Lab
 >     Center for Theoretical Evolutionary Genomics
 >     4151 VLSB (Valley Life Sciences Building)
 >     Department of Integrative Biology
 >     University of California, Berkeley
 >
 >     Lab websites:
 >     http://ib.berkeley.edu/people/lab_detail.php?lab=54
 >     http://fisher.berkeley.edu/cteg/hlab.html
 >     Dept. personal page:
 >     http://ib.berkeley.edu/people/students/person_detail.php?person=370
 >     Lab personal page: 
http://fisher.berkeley.edu/cteg/members/matzke.html
 >     Lab phone: 510-643-6299
 >     Dept. fax: 510-643-6264
 >     Cell phone: 510-301-0179
 >     Email: matzke at berkeley.edu <mailto:matzke at berkeley.edu>
 >
 >     Mailing address:
 >     Department of Integrative Biology
 >     3060 VLSB #3140
 >     Berkeley, CA 94720-3140
 >
 >     -----------------------------------------------------
 >     "[W]hen people thought the earth was flat, they were wrong. When
 >     people thought the earth was spherical, they were wrong. But if you
 >     think that thinking the earth is spherical is just as wrong as
 >     thinking the earth is flat, then your view is wronger than both of
 >     them put together."
 >
 >     Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical
 >     Inquirer, 14(1), 35-44. Fall 1989.
 >     http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
 >     ====================================================
 >
 >

Nick Matzke wrote:
> Hi all...updates...
> 
> Summary: Major focus is getting the GBIF access/search/parse module into 
> "done"/submittable shape.  This primarily requires getting the 
> documentation and testing up to biopython specs.  I have a fair bit of 
> documentation and testing, need advice (see below) for specifics on what 
> it should look like.
> 
> 
> Brad Chapman wrote:
>> Hi Nick;
>> Thanks for the update -- great to see things moving along.
>>
>>> - removed any reliance on lagrange tree module, refactored all 
>>> phylogeny code to use the revised Bio.Nexus.Tree module
>>
>> Awesome -- glad this worked for you. Are the lagrange_* files in
>> Bio.Geography still necessary? If not, we should remove them from
>> the repository to clean things up.
> 
> 
> Ah, they had been deleted locally but it took an extra command to delete 
> on git.  Done.
> 
>>
>> More generally, it would be really helpful if we could do a bit of
>> housekeeping on the repository. The Geography namespace has a lot of
>> things in it which belong in different parts of the tree:
>>
>> - The test code should move to the 'Tests' directory as a set of
>>   test_Geography* files that we can use for unit testing the code.
> 
> OK, I will do this.  Should I try and figure out the unittest stuff?  I 
> could use a simple example of what this is supposed to look like.
> 
> 
>> - Similarly there are a lot of data files in there which are
>>   appear to be test related; these could move to Tests/Geography
> 
> Will do.
> 
>> - What is happening with the Nodes_v2 and Treesv2 files? They look
>>   like duplicates of the Nexus Nodes and Trees with some changes.
>>   Could we roll those changes into the main Nexus code to avoid
>>   duplication?
> 
> Yeah, these were just copies with your bug fix, and with a few mods I 
> used to track crashes.  Presumably I don't need these with after a fresh 
> download of biopython.
> 
> 
> 
>>> - Code dealing with GBIF xml output completely refactored into the 
>>> following classes:
>>>
>>> * ObsRecs (observation records & search results/summary)
>>> * ObsRec (an individual observation record)
>>> * XmlString (functions for cleaning xml returned by Gbif)
>>> * GbifXml (extention of capabilities for ElementTree xml trees, 
>>> parsed from GBIF xml returns.
>>
>> I'm agreed with Hilmar -- the user classes would probably benefit from 
>> expanded
>> naming. There is a art to naming to get them somewhere between the 
>> hideous RidicuouslyLongNamesWithEverythingSpecified names and short 
>> truncated names.
>> Specifically, you've got a lot of filler in the names -- dbfUtils,
>> geogUtils, shpUtils. The Utils probably doesn't tell the user much
>> and makes all of the names sort of blend together, just as the 
>> Rec/Recs pluralization hides a quite large difference in what the 
>> classes hold.
> 
> Will work on this, these should be made part of the 
> GbifObservationRecord() object or be accessed by it, basically they only 
> exist to classify lat/long points into user-specified areas.
> 
>> Something like Observation and ObservationSearchResult would make it
>> clear immediately what they do and the information they hold.
> 
> 
> Agreed, here is a new scheme for the names (changes already made):
> 
> =============
> class GbifSearchResults():   
> 
> GbifSearchResults is a class for holding a series of 
> GbifObservationRecord records, and processing them e.g. into classified 
> areas.
> 
> Also can hold a GbifDarwincoreXmlString record (the raw output returned 
> from a GBIF search) and a GbifXmlTree (a class for holding/processing 
> the ElementTree object returned by parsing the GbifDarwincoreXmlString).
> 
> 
> 
> class GbifObservationRecord():
> 
> GbifObservationRecord is a class for holding an individual observation 
> at an individual lat/long point.
> 
> 
> 
> class GbifDarwincoreXmlString(str):
> 
> GbifDarwincoreXmlString is a class for holding the xmlstring returned by 
> a GBIF search, & processing it to plain text, then an xmltree (an 
> ElementTree).
>     
> GbifDarwincoreXmlString inherits string methods from str (class String).
> 
> 
> 
> class GbifXmlTree():
> gbifxml is a class for holding and processing xmltrees of GBIF records.
> =============
> 
> ...description of methods below...
> 
> 
>>
>>> This week:
>>
>> What are your thoughts on documentation? As a naive user of these
>> tools without much experience with the formats, I could offer better
>> feedback if I had an idea of the public APIs and how they are
>> expected to be used. Moreover, cookbook and API documentation is 
>> something we will definitely need to integrate into Biopython. How 
>> does this fit in your timeline for the remaining weeks?
> 
> The API is really just the interface with GBIF.  I think developing a 
> cookbook entry is pretty easy, I assume you want something like one of 
> the entries in the official biopython cookbook?
> 
> Re: API documentation...are you just talking about the function 
> descriptions that are typically in """ """ strings beneath the function 
> definitions?  I've got that done.  Again, if there is more, an example 
> of what it should look like would be useful.
> 
> Documentation for the GBIF stuff below.
> 
> ============
> gbif_xml.py
> Functions for accessing GBIF, downloading records, processing them into 
> a class, and extracting information from the xmltree in that class.
> 
> 
> class GbifObservationRecord(Exception): pass
> class GbifObservationRecord():
> GbifObservationRecord is a class for holding an individual observation 
> at an individual lat/long point.
> 
> 
> __init__(self):
> 
> This is an instantiation class for setting up new objects of this class.
> 
> 
> 
> latlong_to_obj(self, line):
> 
> Read in a string, read species/lat/long to GbifObservationRecord object
> This can be slow, e.g. 10 seconds for even just ~1000 records.
> 
> 
> parse_occurrence_element(self, element):
> 
> Parse a TaxonOccurrence element, store in OccurrenceRecord
> 
> 
> fill_occ_attribute(self, element, el_tag, format='str'):
> 
> Return the text found in matching element matching_el.text.
> 
> 
> 
> find_1st_matching_subelement(self, element, el_tag, return_element):
> 
> Burrow down into the XML tree, retrieve the first element with the 
> matching tag.
> 
> 
> record_to_string(self):
> 
> Print the attributes of a record to a string
> 
> 
> 
> 
> 
> 
> 
> class GbifDarwincoreXmlString(Exception): pass
> 
> class GbifDarwincoreXmlString(str):
> GbifDarwincoreXmlString is a class for holding the xmlstring returned by 
> a GBIF search, & processing it to plain text, then an xmltree (an 
> ElementTree).
> 
> GbifDarwincoreXmlString inherits string methods from str (class String).
> 
> 
> 
> __init__(self, rawstring=None):
> 
> This is an instantiation class for setting up new objects of this class.
> 
> 
> 
> fix_ASCII_lines(self, endline=''):
> 
> Convert each line in an input string into pure ASCII
> (This avoids crashes when printing to screen, etc.)
> 
> 
> _fix_ASCII_line(self, line):
> 
> Convert a single string line into pure ASCII
> (This avoids crashes when printing to screen, etc.)
> 
> 
> _unescape(self, text):
> 
> #
> Removes HTML or XML character references and entities from a text string.
> 
> @param text The HTML (or XML) source text.
> @return The plain text, as a Unicode string, if necessary.
> source: http://effbot.org/zone/re-sub.htm#unescape-html
> 
> 
> _fix_ampersand(self, line):
> 
> Replaces "&" with "&amp;" in a string; this is otherwise
> not caught by the unescape and unicodedata.normalize functions.
> 
> 
> 
> 
> 
> 
> 
> class GbifXmlTreeError(Exception): pass
> class GbifXmlTree():
> gbifxml is a class for holding and processing xmltrees of GBIF records.
> 
> __init__(self, xmltree=None):
> 
> This is an instantiation class for setting up new objects of this class.
> 
> 
> print_xmltree(self):
> 
> Prints all the elements & subelements of the xmltree to screen (may require
> fix_ASCII to input file to succeed)
> 
> 
> print_subelements(self, element):
> 
> Takes an element from an XML tree and prints the subelements tag & text, 
> and
> the within-tag items (key/value or whatnot)
> 
> 
> _element_items_to_dictionary(self, element_items):
> 
> If the XML tree element has items encoded in the tag, e.g. key/value or
> whatever, this function puts them in a python dictionary and returns
> them.
> 
> 
> extract_latlongs(self, element):
> 
> Create a temporary pseudofile, extract lat longs to it,
> return results as string.
> 
> Inspired by: http://www.skymind.com/~ocrow/python_string/
> (Method 5: Write to a pseudo file)
> 
> 
> 
> 
> _extract_latlong_datum(self, element, file_str):
> 
> Searches an element in an XML tree for lat/long information, and the
> complete name. Searches recursively, if there are subelements.
> 
> file_str is a string created by StringIO in extract_latlongs() (i.e., a 
> temp filestr)
> 
> 
> 
> extract_all_matching_elements(self, start_element, el_to_match):
> 
> Returns a list of the elements, picking elements by TaxonOccurrence; 
> this should
> return a list of elements equal to the number of hits.
> 
> 
> 
> _recursive_el_match(self, element, el_to_match, output_list):
> 
> Search recursively through xmltree, starting with element, recording all 
> instances of el_to_match.
> 
> 
> find_to_elements_w_ancs(self, el_tag, anc_el_tag):
> 
> Burrow into XML to get an element with tag el_tag, return only those 
> el_tags underneath a particular parent element parent_el_tag
> 
> 
> xml_recursive_search_w_anc(self, element, el_tag, anc_el_tag, 
> match_el_list):
> 
> Recursively burrows down to find whatever elements with el_tag exist 
> inside a parent_el_tag.
> 
> 
> 
> create_sub_xmltree(self, element):
> 
> Create a subset xmltree (to avoid going back to irrelevant parents)
> 
> 
> 
> _xml_burrow_up(self, element, anc_el_tag, found_anc):
> 
> Burrow up xml to find anc_el_tag
> 
> 
> 
> _xml_burrow_up_cousin(element, cousin_el_tag, found_cousin):
> 
> Burrow up from element of interest, until a cousin is found with 
> cousin_el_tag
> 
> 
> 
> 
> _return_parent_in_xmltree(self, child_to_search_for):
> 
> Search through an xmltree to get the parent of child_to_search_for
> 
> 
> 
> _return_parent_in_element(self, potential_parent, child_to_search_for, 
> returned_parent):
> 
> Search through an XML element to return parent of child_to_search_for
> 
> 
> find_1st_matching_element(self, element, el_tag, return_element):
> 
> Burrow down into the XML tree, retrieve the first element with the 
> matching tag
> 
> 
> 
> 
> extract_numhits(self, element):
> 
> Search an element of a parsed XML string and find the
> number of hits, if it exists.  Recursively searches,
> if there are subelements.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> class GbifSearchResults(Exception): pass
> 
> class GbifSearchResults():
> 
> GbifSearchResults is a class for holding a series of 
> GbifObservationRecord records, and processing them e.g. into classified 
> areas.
> 
> 
> 
> __init__(self, gbif_recs_xmltree=None):
> 
> This is an instantiation class for setting up new objects of this class.
> 
> 
> 
> print_records(self):
> 
> Print all records in tab-delimited format to screen.
> 
> 
> 
> 
> print_records_to_file(self, fn):
> 
> Print the attributes of a record to a file with filename fn
> 
> 
> 
> latlongs_to_obj(self):
> 
> Takes the string from extract_latlongs, puts each line into a
> GbifObservationRecord object.
> 
> Return a list of the objects
> 
> 
> Functions devoted to accessing/downloading GBIF records
> access_gbif(self, url, params):
> 
> Helper function to access various GBIF services
> 
> choose the URL ("url") from here:
> http://data.gbif.org/ws/rest/occurrence
> 
> params are a dictionary of key/value pairs
> 
> "self._open" is from Bio.Entrez.self._open, online here:
> http://www.biopython.org/DIST/docs/api/Bio.Entrez-pysrc.html#self._open
> 
> Get the handle of results
> (looks like e.g.: <addinfourl at 75575128 whose fp = <socket._fileobject 
> object at 0x48117f0>> )
> 
> (open with results_handle.read() )
> 
> 
> _get_hits(self, params):
> 
> Get the actual hits that are be returned by a given search
> (this allows parsing & gradual downloading of searches larger
> than e.g. 1000 records)
> 
> It will return the LAST non-none instance (in a standard search result 
> there
> should be only one, anyway).
> 
> 
> 
> 
> get_xml_hits(self, params):
> 
> Returns hits like _get_hits, but returns a parsed XML tree.
> 
> 
> 
> 
> get_record(self, key):
> 
> Given the key, get a single record, return xmltree for it.
> 
> 
> 
> get_numhits(self, params):
> 
> Get the number of hits that will be returned by a given search
> (this allows parsing & gradual downloading of searches larger
> than e.g. 1000 records)
> 
> It will return the LAST non-none instance (in a standard search result 
> there
> should be only one, anyway).
> 
> 
> xmlstring_to_xmltree(self, xmlstring):
> 
> Take the text string returned by GBIF and parse to an XML tree using 
> ElementTree.
> Requires the intermediate step of saving to a temporary file (required 
> to make
> ElementTree.parse work, apparently)
> 
> 
> 
> tempfn = 'tempxml.xml'
> fh = open(tempfn, 'w')
> fh.write(xmlstring)
> fh.close()
> 
> 
> 
> 
> 
> get_all_records_by_increment(self, params, inc):
> 
> Download all of the records in stages, store in list of elements.
> Increments of e.g. 100 to not overload server
> 
> 
> 
> extract_occurrences_from_gbif_xmltree_list(self, gbif_xmltree):
> 
> Extract all of the 'TaxonOccurrence' elements to a list, store them in a 
> GbifObservationRecord.
> 
> 
> 
> _paramsdict_to_string(self, params):
> 
> Converts the python dictionary of search parameters into a text
> string for submission to GBIF
> 
> 
> 
> _open(self, cgi, params={}):
> 
> Function for accessing online databases.
> 
> Modified from:
> http://www.biopython.org/DIST/docs/api/Bio.Entrez-module.html
> 
> Helper function to build the URL and open a handle to it (PRIVATE).
> 
> Open a handle to GBIF.  cgi is the URL for the cgi script to access.
> params is a dictionary with the options to pass to it.  Does some
> simple error checking, and will raise an IOError if it encounters one.
> 
> This function also enforces the "three second rule" to avoid abusing
> the GBIF servers (modified after NCBI requirement).
> ============
> 
> 
>>
>> Thanks again. Hope this helps,
>> Brad
> 
> Very much, thanks!!
> Nick
> 

-- 
====================================================
Nicholas J. Matzke
Ph.D. Candidate, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley

Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page: 
http://ib.berkeley.edu/people/students/person_detail.php?person=370
Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: matzke at berkeley.edu

Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140

-----------------------------------------------------
"[W]hen people thought the earth was flat, they were wrong. When people 
thought the earth was spherical, they were wrong. But if you think that 
thinking the earth is spherical is just as wrong as thinking the earth 
is flat, then your view is wronger than both of them put together."

Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 
14(1), 35-44. Fall 1989.
http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
====================================================