[BioRuby] RDF Triples in BioRuby, a funding proposal to Google SoC

Rutger Vos rutgeraldo at gmail.com
Mon Mar 15 12:27:27 UTC 2010


To follow up along more practical lines, I've had to deal with similar
design issues in Bio::Phylo (perl), TreeBASE and Mesquite (both java).
I've learned it makes sense to have:

- a simple "annotation" object, with getters and setters for the
predicate namespace uri, the predicate string, and the value object
(either a literal or a uri),

- a get_annotations method for all (fundamental) data objects in the
toolkit that returns a collection of these annotation object

this way, when you serialize any bioruby object into rdf, you can add
as many other statements about that object as you want.

Would a refactoring along those lines have a chance of being
acceptable to the bioruby community (of course subsequent to a more
detailed RFC, testing, discussion, proof of concept, etc.)?

On Thursday, March 11, 2010, Rutger Vos <rutgeraldo at gmail.com> wrote:
> Hi Toshiaki,
>
> great to hear there's already been a lot of discussion over this.
> (Well, I'd be surprised if there hadn't been :))
>
> It looks to me like some fairly major bookkeeping would need to be
> implemented high up in the inheritance tree if *all* bioruby objects
> are to be serialized into RDF. It also would require all of bioruby to
> be ontologized in one fell swoop.
>
> It is perhaps more likely that subdomains are going to be ontologized
> more or less independently from one another (as you mention,
> references->RDF, or in my case phylogenetics->RDF) based implicitly on
> intermediate data formats (pubmed records and nexml, respectively).
>
> That is probably OK, we do things as needs arise.
>
> But what would be handy if the API was at least general enough so that
> this was extensible and we can make additional statements *about*
> objects when we serialize them to RDF. For example, in your pubmed
> turtle file, the subject is always
> <http://togows.dbcls.jp/entry/ncbi-pubmed/16381885>. Is there a way,
> programmatically, where I can add additional statements about
> <http://togows.dbcls.jp/entry/ncbi-pubmed/16381885>?
>
> Rutger
>
> On Wed, Mar 10, 2010 at 2:21 PM, Toshiaki Katayama <ktym at hgc.jp> wrote:
>> Hi Rutger,
>>
>> Thank you for your inputs on GSoC 2010!
>>
>>> * is there a way to express triples in BioRuby?
>>> * if there is not, what would be a good design to express triples in
>>> BioRuby so that this would be more useful than just for NeXML?
>>
>> This is what we discussed during the pre-BioHackathon 2010.
>>
>> http://hackathon3.dbcls.jp/wiki/BioRuby
>>
>> My first idea was to make all BioRuby object have common output
>> method to render the object contents in various formats
>> (such as RDF/XML, Turtle, HTML, GFF, FASTA etc. if appropriate).
>>
>> Then, we tried to separate view from logic using erb, but as you
>> see in the above page, it still looks ugly. It is mainly because
>> view formatting itself requires some additional codes, specific
>> to each format.
>>
>> Therefore, we don't have a solid conclusion on this yet, unfortunately.
>>
>> Anyway, we already have PubMed to RDF converter written in Ruby as
>> the TogoWS REST API (http://togows.dbcls.jp/site/en/rest.html) at
>>
>> http://togows.dbcls.jp/entry/pubmed/16381885
>> --> http://togows.dbcls.jp/entry/pubmed/16381885.ttl
>>
>> and, we are also trying to support KEGG to RDF conversion in this
>> framework as well. I think we can put the code in BioRuby when we finished.
>>
>> Your suggestions are welcome. :)
>>
>> Regards,
>> Toshiaki
>>
>> On 2010/03/10, at 22:22, Rutger Vos wrote:
>>
>>> Dear BioRuby-ites,
>>>
>>> my apologies that my first email to this list is so long and
>>> tangential. I am trying to find out how to express RDF triples in
>>> BioRuby. In this email I'm explaining why I care enough to try to get
>>> funding for someone to work on this. If you don't care about any of
>>> this, you can stop reading now.
>>>
>>> The National Evolutionary Synthesis Center (NESCent.org) is planning
>>> to be a mentoring organization for the Google Summer of Code 2010. I
>>> have submitted a project idea to this: to develop NeXML I/O and -
>>> probably more importantly for you - RDF capabilities for BioRuby. If
>>> funded, a student/coder will work on this full time over the summer,
>>> under the shared supervision of Jan Aerts and myself. Here is the
>>> link: http://tinyurl.com/biorubynexml
>>>
>>> NeXML is a data format for phylogenetic data that can be read and
>>> written in perl, python, java and (to some extent) c++ and javascript.
>>> RDF is the cool "new" thing (as per BioHackathon2010), but as far as I
>>> can tell BioRuby isn't completely up to speed for it, yet.
>>>
>>> (As an aside: you might ask yourself why there is something like NeXML
>>> when there is PhyloXML for BioRuby. The answer is that NeXML solves a
>>> different problem: PhyloXML started essentially as a next generation
>>> of New Hampshire eXtended (NHX) to meet the annotation needs of
>>> comparative genomics, things such as gene duplications and other
>>> molecular evolution events, on phylogenetic trees; NeXML started as a
>>> complete XML representation of the NEXUS format, providing other
>>> comparative data types such as categorical and continuous character
>>> state matrices, restriction site matrices, and so on, in addition to
>>> trees, taxa, sequence alignments. There is obviously some overlap
>>> between the formats, but I guess that is not unique in bioinformatics
>>> :))
>>>
>>> NeXML has a semantic annotation facility that uses RDFa. This allows
>>> us to add additional metadata to a fundamental phylogenetic data
>>> object (a tree, taxon, character, etc.) to form a "triple": the
>>> fundamental data object is the triple Subject, and the Predicate and
>>> Object are added as RDFa attributes. Since NeXML can be transformed
>>> using a standard XSL stylesheet to RDF/XML, we can express a limitless
>>> number of statements about phylogenetics. H

-- 
Dr. Rutger A. Vos
School of Biological Sciences
Philip Lyle Building, Level 4
University of Reading
Reading
RG6 6BX
United Kingdom
Tel: +44 (0) 118 378 7535
http://www.nexml.org
http://rutgervos.blogspot.com



More information about the BioRuby mailing list