[DAS2] Mark Gibson on Apollo writeback to Chado

Mon Apr 3 16:29:55 UTC 2006

Ive attached a powerpoint presentation that is probably easier to  
glance at than reading through this whole email. The first half of it  
is about apollo transactions.

Mark

-------------- next part --------------
A non-text attachment was scrubbed...
Name: gmod-sri-13.ppt
Type: application/vnd.ms-powerpoint
Size: 599552 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/das2/attachments/20060403/bcb3bd34/attachment-0001.ppt>
-------------- next part --------------

On Mar 27, 2006, at 2:42 PM, Nomi Harris wrote:

> mark gibson said that he plans to attend next monday's DAS/2
> teleconference.  he also gave me permission to forward this message  
> that
> he wrote recently in response to a group that is adapting apollo and
> wondered what he thought about direct-to-chado writeback vs. the  
> use of
> chadoxml as an intermediate storage format.  FlyBase Harvard  
> prefers to
> use the latter approach because (we gather) they worry about possibly
> corrupting the database by having clients write directly to it.  if
> anyone from harvard is reading this and feels that mark has
> misrepresented their approach, please set us straight!
>
>                Nomi
>
> On 10 March 2006, Mark Gibson wrote:
>> Im rather biased as a I wrote the chado jdbc adapter [for Apollo],  
>> but let me put forth my
>> view of chado jdbc vs chado xml.
>>
>> The chado Jdbc adapter is transactional, the chado xml adapter is  
>> not. What this
>> means is jdbc only makes changes in the database that reflect what  
>> has actually
>> been changed in the apollo session, like updating a row in a  
>> table; with chado
>> xml you just get the whole dump. So if a synonym has been added  
>> jdbc will add a
>> row to the synonym table. For xml you will get the whole dump of  
>> the region you
>> were editing (probably a gene) no matter how small the edit.
>>
>> What I believe Harvard/Flybase then does (with chado xml) is wipe  
>> out the gene
>> from the database and reinsert the gene from the chado xml. The  
>> problem with
>> this approach is if you have data in the db thats not associated  
>> with apollo
>> (for flybase this would be phenotype data) then that will get  
>> wiped out as well,
>> and there has to be some way of reinstating non-apollo data. If  
>> you dont have
>> non-apollo data and dont intend on having it in the future this  
>> isnt a huge
>> issue I suppose. I think Harvard is integrating non-apollo data  
>> into their chado
>> database.
>>
>> I think what they are going to do is actually figure out all of  
>> the transactions
>> by comparing the chado xml with the chado database, which is what  
>> apollo already
>> does, but I'm not sure as Im not so in touch with them these days  
>> (as Im not
>> working with apollo these days - waiting for new grant to kick in).
>>
>> Since the paradigm with chado xml is wipe out & reload, then  
>> apollo has to make
>> sure it preserves every bit of the chado xml that came in. Theres  
>> a bunch of
>> stuff thats in chado/chado xml that the apollo datamodel is  
>> unconcerned with,
>> and has no need to be concerned with as its stuff that it doesnt  
>> visualize. In
>> other words apollos data model is solely for apollos task of  
>> visualizing data,
>> not for roundtripping what we call non-apollo data. In writing the  
>> chado xml
>> adapter for FlyBase, Nomi Harris had a heck of a time with these  
>> issues, and she
>> can elaborate on this I suppose.
>>
>> I'm personally not fond of chado xml because its basically a  
>> relational database
>> dump, so its extremely verbose. It redundantly has information for  
>> lots of joins
>> to data in other tables - like a cvterm entry can take 10 or 20  
>> lines of chado
>> xml, and a given cvterm may be used a zillion times in a given  
>> chado xml file
>> (as every feature has a cvterm). So these files can get rather large.
>>
>> The solution for this verbose output is to use what I call macros  
>> in chado xml.
>> Macros are supported by xort. They take the 15 line cvterm entry  
>> and reduce it
>> to a line or 2 making the file size much more reasonable. The  
>> apollo chado xml
>> adapter does not support macros, so you have to use unmacro'd  
>> chado xml for
>> apollo purposes. Nomi Harris had a hard enough time getting the  
>> chado xml
>> adapter working for flybase(and did a great job with a harrowing  
>> task), that she
>> did not have time to take on the macro issue. If you wanted macros  
>> (and smaller
>> file sizes) you would have to add this functionality to the chado  
>> xml adapter
>> (are there java programmers in your group?).
>>
>> One of the arguments against the jdbc adapter is that its  
>> dangerous because it
>> goes straight into the database so if there are any bugs in the  
>> data adapter
>> then the database could get corrupted - some groups find this a  
>> bit precarious.
>> This is a valid argument. I think theres 2 solutions here. One is  
>> to thoroughly
>> test the adapter out against a test database until you are  
>> confident that bugs
>> are hammered out.
>>
>> Another solution is to not go straight from apollo to the  
>> database. You can use
>> an interim format and actually use apollo to get that interim  
>> format into the
>> database. Of course one choice for interim format is chado xml and  
>> then you are
>> at the the chado xml solution. The other choice for file format is  
>> GAME xml. You
>> can then use apollo to load game into the chado database, and this  
>> can be done
>> at the command line (with batching) so you dont have to bring up  
>> the gui to do
>> it. Also chado xml can be loaded into chado via apollo as well (of  
>> course xort
>> does this as well but not with transactions)
>>
>> So then the question is if Im not going to go straight into the  
>> database, why
>> would I choose game over chado xml?  Or if Im using chado xml  
>> should I use
>> apollo or xort to load into chado. I think if you are using chado  
>> xml it makes
>> sense to use xort as it is the tried & true technology for chado  
>> xml. The
>> advantage of going through apollo is that it also uses the  
>> transactions from
>> apollo (theres a transaction xml file) and thus writes back the  
>> edits in a
>> transactional way as mentioned above rather than in a wipe out &  
>> reload fashion.
>>
>> Also Game is a tried & true technology that has been used with  
>> apollo in
>> production at flybase (before chado came along) for many years  
>> now. One
>> criticism of it has been that DTD/XSD/schema has been a moving  
>> target, nor has
>> it been described. That is not as true anymore. Nomi Harris has  
>> made a xsd for
>> it as well as a rng. But I must confess that I have recently added  
>> the ability
>> to have one level annotations in game (previously 1 levels had to  
>> be hacked as 3
>> levels). Also game is a lot less verbose than un-macro'd chado  
>> xml, as it more
>> or less fits with the apollo datamodel. One advantage of chado xml  
>> over game xml
>> is that it is more flexible in terms of taking on features of  
>> arbitrary depth.
>>
>> The chado xml adapter was developed for FlyBase and as far as I  
>> know has not
>> been taken on by any other groups yet. Nomi can elaborate on this,  
>> but I think
>> what this might mean is that there are places where things are  
>> FlyBase specific.
>> If you went with chado xml the adapter would have to be  
>> generalized. Its a good
>> exercise for the adapter to go through, but it will take a bit of  
>> work. Nomi can
>> probably comment on how hard generalizing might be. I could be  
>> wrong about this
>> but I think the current status with the chado xml adapter is that  
>> Harvard has
>> done a bunch of testing on it but they havent put it into  
>> production yet.
>>
>> The jdbc adapter is being used by several groups so has been  
>> forced to be
>> generalized. One thing I have found is that chado databases vary  
>> all too much
>> from mod to mod (ontologies change). There is a configuration file  
>> for the jdbc
>> adapter that has settings for the differences that I encountered.  
>> I initially
>> wrote it for cold spring harbors rice database that will be used  
>> in classrooms.
>> Its working for rice in theory, but they havent actually used it  
>> much in the
>> classroom yet. For rice the model is to save to game and use  
>> apollo command line
>> to save game & transactions back to chado.
>>
>> Cyril Pommier, at the INRA - URGI - Bioinformatique, has taken on  
>> the jdbc
>> adapter for his group. I have cc'd him on this email as I think he  
>> will have a
>> lot to say about the jdbc adapter. Cyril has uncovered many bugs  
>> and has fixed a
>> lot of them (thank you cyril) as hes a very savvy java programmer.  
>> And he has
>> also forced the adapter to generalize and brought about the  
>> evolution of the
>> config file to adapt to chado differences. But as Cyril can attest  
>> (Cyril feel
>> free to elaborate) it has been a lot of work to get jdbc working  
>> for him. There
>> were a lot of bugs to fix that we both went after. Hopefully now  
>> its a bit more
>> stable and the next db/mod wont have as many problems. I think  
>> Cyril is still at
>> the test phase and hasn't gone into production (Cyril?)
>>
>> Berkeley is using the jdbc adapter for an in house project. They  
>> are using the
>> jdbc reader to load up game files (as the straight jdbc reader is  
>> slow as the
>> chado db is rather slow) which are then loaded by a curator. They  
>> are saving
>> game, and then I think chris mungall is xslting game to chado xml  
>> which is then
>> saved with xort - or he is somehow writing game in another way -  
>> not actually
>> sure. The Berkeley group drove the need for 1 level annotations(in  
>> jdbc,game,&
>> apollo datmodel)
>>
>> Jonathan Crabtree at TIGR wrote the jdbc read adapter, and they  
>> use it there. I
>> believe they are intending to use the write adapter but dont yet  
>> do so (Jonathan?).
>>
>> I should mention that reading jdbc straight from chado tends to be  
>> slow, as I
>> find that chado is a slow database, at least for Berkeley. It  
>> really depends on
>> the db vendor and the amount of data. TIGRs reading is actually  
>> really zippy.
>> The workaround for slow chados is to dump game files that read in  
>> pretty fast.
>>
>> In all fairness, you should probably email with FlyBase (& Chris  
>> Mungall) and
>> get the pros of using chado xml & xort, which they can give a far  
>> better answer
>> on than I.
>>
>> Hope this helps,
>> Mark
>
> _______________________________________________
> DAS2 mailing list
> DAS2 at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/das2