[DAS2] Mark Gibson on Apollo writeback to Chado

Mon Mar 27 19:42:59 UTC 2006

mark gibson said that he plans to attend next monday's DAS/2
teleconference.  he also gave me permission to forward this message that
he wrote recently in response to a group that is adapting apollo and
wondered what he thought about direct-to-chado writeback vs. the use of
chadoxml as an intermediate storage format.  FlyBase Harvard prefers to
use the latter approach because (we gather) they worry about possibly
corrupting the database by having clients write directly to it.  if
anyone from harvard is reading this and feels that mark has
misrepresented their approach, please set us straight!

               Nomi

On 10 March 2006, Mark Gibson wrote:
 > Im rather biased as a I wrote the chado jdbc adapter [for Apollo], but let me put forth my 
 > view of chado jdbc vs chado xml.
 > 
 > The chado Jdbc adapter is transactional, the chado xml adapter is not. What this 
 > means is jdbc only makes changes in the database that reflect what has actually 
 > been changed in the apollo session, like updating a row in a table; with chado 
 > xml you just get the whole dump. So if a synonym has been added jdbc will add a 
 > row to the synonym table. For xml you will get the whole dump of the region you 
 > were editing (probably a gene) no matter how small the edit.
 > 
 > What I believe Harvard/Flybase then does (with chado xml) is wipe out the gene 
 > from the database and reinsert the gene from the chado xml. The problem with 
 > this approach is if you have data in the db thats not associated with apollo 
 > (for flybase this would be phenotype data) then that will get wiped out as well, 
 > and there has to be some way of reinstating non-apollo data. If you dont have 
 > non-apollo data and dont intend on having it in the future this isnt a huge 
 > issue I suppose. I think Harvard is integrating non-apollo data into their chado 
 > database.
 > 
 > I think what they are going to do is actually figure out all of the transactions 
 > by comparing the chado xml with the chado database, which is what apollo already 
 > does, but I'm not sure as Im not so in touch with them these days (as Im not 
 > working with apollo these days - waiting for new grant to kick in).
 > 
 > Since the paradigm with chado xml is wipe out & reload, then apollo has to make 
 > sure it preserves every bit of the chado xml that came in. Theres a bunch of 
 > stuff thats in chado/chado xml that the apollo datamodel is unconcerned with, 
 > and has no need to be concerned with as its stuff that it doesnt visualize. In 
 > other words apollos data model is solely for apollos task of visualizing data, 
 > not for roundtripping what we call non-apollo data. In writing the chado xml 
 > adapter for FlyBase, Nomi Harris had a heck of a time with these issues, and she 
 > can elaborate on this I suppose.
 > 
 > I'm personally not fond of chado xml because its basically a relational database 
 > dump, so its extremely verbose. It redundantly has information for lots of joins 
 > to data in other tables - like a cvterm entry can take 10 or 20 lines of chado 
 > xml, and a given cvterm may be used a zillion times in a given chado xml file 
 > (as every feature has a cvterm). So these files can get rather large.
 > 
 > The solution for this verbose output is to use what I call macros in chado xml. 
 > Macros are supported by xort. They take the 15 line cvterm entry and reduce it 
 > to a line or 2 making the file size much more reasonable. The apollo chado xml 
 > adapter does not support macros, so you have to use unmacro'd chado xml for 
 > apollo purposes. Nomi Harris had a hard enough time getting the chado xml 
 > adapter working for flybase(and did a great job with a harrowing task), that she 
 > did not have time to take on the macro issue. If you wanted macros (and smaller 
 > file sizes) you would have to add this functionality to the chado xml adapter 
 > (are there java programmers in your group?).
 > 
 > One of the arguments against the jdbc adapter is that its dangerous because it 
 > goes straight into the database so if there are any bugs in the data adapter 
 > then the database could get corrupted - some groups find this a bit precarious. 
 > This is a valid argument. I think theres 2 solutions here. One is to thoroughly 
 > test the adapter out against a test database until you are confident that bugs 
 > are hammered out.
 > 
 > Another solution is to not go straight from apollo to the database. You can use 
 > an interim format and actually use apollo to get that interim format into the 
 > database. Of course one choice for interim format is chado xml and then you are 
 > at the the chado xml solution. The other choice for file format is GAME xml. You 
 > can then use apollo to load game into the chado database, and this can be done 
 > at the command line (with batching) so you dont have to bring up the gui to do 
 > it. Also chado xml can be loaded into chado via apollo as well (of course xort 
 > does this as well but not with transactions)
 > 
 > So then the question is if Im not going to go straight into the database, why 
 > would I choose game over chado xml?  Or if Im using chado xml should I use 
 > apollo or xort to load into chado. I think if you are using chado xml it makes 
 > sense to use xort as it is the tried & true technology for chado xml. The 
 > advantage of going through apollo is that it also uses the transactions from 
 > apollo (theres a transaction xml file) and thus writes back the edits in a 
 > transactional way as mentioned above rather than in a wipe out & reload fashion.
 > 
 > Also Game is a tried & true technology that has been used with apollo in 
 > production at flybase (before chado came along) for many years now. One 
 > criticism of it has been that DTD/XSD/schema has been a moving target, nor has 
 > it been described. That is not as true anymore. Nomi Harris has made a xsd for 
 > it as well as a rng. But I must confess that I have recently added the ability 
 > to have one level annotations in game (previously 1 levels had to be hacked as 3 
 > levels). Also game is a lot less verbose than un-macro'd chado xml, as it more 
 > or less fits with the apollo datamodel. One advantage of chado xml over game xml 
 > is that it is more flexible in terms of taking on features of arbitrary depth.
 > 
 > The chado xml adapter was developed for FlyBase and as far as I know has not 
 > been taken on by any other groups yet. Nomi can elaborate on this, but I think 
 > what this might mean is that there are places where things are FlyBase specific. 
 > If you went with chado xml the adapter would have to be generalized. Its a good 
 > exercise for the adapter to go through, but it will take a bit of work. Nomi can 
 > probably comment on how hard generalizing might be. I could be wrong about this 
 > but I think the current status with the chado xml adapter is that Harvard has 
 > done a bunch of testing on it but they havent put it into production yet.
 > 
 > The jdbc adapter is being used by several groups so has been forced to be 
 > generalized. One thing I have found is that chado databases vary all too much 
 > from mod to mod (ontologies change). There is a configuration file for the jdbc 
 > adapter that has settings for the differences that I encountered. I initially 
 > wrote it for cold spring harbors rice database that will be used in classrooms. 
 > Its working for rice in theory, but they havent actually used it much in the 
 > classroom yet. For rice the model is to save to game and use apollo command line 
 > to save game & transactions back to chado.
 > 
 > Cyril Pommier, at the INRA - URGI - Bioinformatique, has taken on the jdbc 
 > adapter for his group. I have cc'd him on this email as I think he will have a 
 > lot to say about the jdbc adapter. Cyril has uncovered many bugs and has fixed a 
 > lot of them (thank you cyril) as hes a very savvy java programmer. And he has 
 > also forced the adapter to generalize and brought about the evolution of the 
 > config file to adapt to chado differences. But as Cyril can attest (Cyril feel 
 > free to elaborate) it has been a lot of work to get jdbc working for him. There 
 > were a lot of bugs to fix that we both went after. Hopefully now its a bit more 
 > stable and the next db/mod wont have as many problems. I think Cyril is still at 
 > the test phase and hasn't gone into production (Cyril?)
 > 
 > Berkeley is using the jdbc adapter for an in house project. They are using the 
 > jdbc reader to load up game files (as the straight jdbc reader is slow as the 
 > chado db is rather slow) which are then loaded by a curator. They are saving 
 > game, and then I think chris mungall is xslting game to chado xml which is then 
 > saved with xort - or he is somehow writing game in another way - not actually 
 > sure. The Berkeley group drove the need for 1 level annotations(in jdbc,game,& 
 > apollo datmodel)
 > 
 > Jonathan Crabtree at TIGR wrote the jdbc read adapter, and they use it there. I 
 > believe they are intending to use the write adapter but dont yet do so (Jonathan?).
 > 
 > I should mention that reading jdbc straight from chado tends to be slow, as I 
 > find that chado is a slow database, at least for Berkeley. It really depends on 
 > the db vendor and the amount of data. TIGRs reading is actually really zippy. 
 > The workaround for slow chados is to dump game files that read in pretty fast.
 > 
 > In all fairness, you should probably email with FlyBase (& Chris Mungall) and 
 > get the pros of using chado xml & xort, which they can give a far better answer 
 > on than I.
 > 
 > Hope this helps,
 > Mark