[Bioperl-l] fetching exons in genomic coordinates from NCBI

Dave Messina David.Messina at sbc.su.se
Fri May 27 09:22:14 UTC 2011


On Fri, May 27, 2011 at 10:20, Reece Hart <reecehart at gmail.com> wrote:

> On Wed, May 25, 2011 at 5:13 AM, Dave Messina <David.Messina at sbc.su.se>wrote:
>
>> As far as I know, you're doing it the NCBI recommended way, byzantine
>> though it may be. Of course I too would be keen to hear of a better approach
>> if anyone's got one.
>>
>
> Is that really a "recommended" way? Aside from the NCBI eutils pages which
> describe how to submit queries, I didn't see anything about how to process
> the results.
>


When I said that, I was thinking about the esearch and efetch part, but now
that I look around, I believe that yes, the NCBI expects us to parse the XML
using XML libraries such as libXML.

Or XmlWrapp. See this relatively current page which states that "the NCBI
C++ Toolkit has incorporated and enhanced the open source XmlWrapp package,
which provides a simplified way for developers to work with XML.":

    http://www.ncbi.nlm.nih.gov/books/NBK8829/

There is also Genome Workbench, which I have no experience with, but which
apparently does read NCBI's XML:

    http://www.ncbi.nlm.nih.gov/projects/gbench/


So, I ended up reverse engineering the XML by comparing at several efetch
> results with web pages.


If you haven't already, you might take a look at the dtd and schema:
http://www.ncbi.nlm.nih.gov/data_specs/dtd/
http://www.ncbi.nlm.nih.gov/data_specs/schema/

In particular, I think the ones you want are these:
http://www.ncbi.nlm.nih.gov/dtd/NCBI_Entrezgene.mod.dtd
http://www.ncbi.nlm.nih.gov/data_specs/schema/NCBI_Entrezgene.mod.xsd


I am certainly not an expert in this area, but yeah, it sure seems like
there should be some more human-readable guide to their XML formats than
just the above.


Dave



More information about the Bioperl-l mailing list