[DAS2] Mark Gibson on Apollo writeback to Chado
mark gibson
mgibson at bdgp.lbl.gov
Mon Apr 3 16:29:55 UTC 2006
Ive attached a powerpoint presentation that is probably easier to
glance at than reading through this whole email. The first half of it
is about apollo transactions.
Mark
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gmod-sri-13.ppt
Type: application/vnd.ms-powerpoint
Size: 599552 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/das2/attachments/20060403/bcb3bd34/attachment-0001.ppt>
-------------- next part --------------
On Mar 27, 2006, at 2:42 PM, Nomi Harris wrote:
> mark gibson said that he plans to attend next monday's DAS/2
> teleconference. he also gave me permission to forward this message
> that
> he wrote recently in response to a group that is adapting apollo and
> wondered what he thought about direct-to-chado writeback vs. the
> use of
> chadoxml as an intermediate storage format. FlyBase Harvard
> prefers to
> use the latter approach because (we gather) they worry about possibly
> corrupting the database by having clients write directly to it. if
> anyone from harvard is reading this and feels that mark has
> misrepresented their approach, please set us straight!
>
> Nomi
>
> On 10 March 2006, Mark Gibson wrote:
>> Im rather biased as a I wrote the chado jdbc adapter [for Apollo],
>> but let me put forth my
>> view of chado jdbc vs chado xml.
>>
>> The chado Jdbc adapter is transactional, the chado xml adapter is
>> not. What this
>> means is jdbc only makes changes in the database that reflect what
>> has actually
>> been changed in the apollo session, like updating a row in a
>> table; with chado
>> xml you just get the whole dump. So if a synonym has been added
>> jdbc will add a
>> row to the synonym table. For xml you will get the whole dump of
>> the region you
>> were editing (probably a gene) no matter how small the edit.
>>
>> What I believe Harvard/Flybase then does (with chado xml) is wipe
>> out the gene
>> from the database and reinsert the gene from the chado xml. The
>> problem with
>> this approach is if you have data in the db thats not associated
>> with apollo
>> (for flybase this would be phenotype data) then that will get
>> wiped out as well,
>> and there has to be some way of reinstating non-apollo data. If
>> you dont have
>> non-apollo data and dont intend on having it in the future this
>> isnt a huge
>> issue I suppose. I think Harvard is integrating non-apollo data
>> into their chado
>> database.
>>
>> I think what they are going to do is actually figure out all of
>> the transactions
>> by comparing the chado xml with the chado database, which is what
>> apollo already
>> does, but I'm not sure as Im not so in touch with them these days
>> (as Im not
>> working with apollo these days - waiting for new grant to kick in).
>>
>> Since the paradigm with chado xml is wipe out & reload, then
>> apollo has to make
>> sure it preserves every bit of the chado xml that came in. Theres
>> a bunch of
>> stuff thats in chado/chado xml that the apollo datamodel is
>> unconcerned with,
>> and has no need to be concerned with as its stuff that it doesnt
>> visualize. In
>> other words apollos data model is solely for apollos task of
>> visualizing data,
>> not for roundtripping what we call non-apollo data. In writing the
>> chado xml
>> adapter for FlyBase, Nomi Harris had a heck of a time with these
>> issues, and she
>> can elaborate on this I suppose.
>>
>> I'm personally not fond of chado xml because its basically a
>> relational database
>> dump, so its extremely verbose. It redundantly has information for
>> lots of joins
>> to data in other tables - like a cvterm entry can take 10 or 20
>> lines of chado
>> xml, and a given cvterm may be used a zillion times in a given
>> chado xml file
>> (as every feature has a cvterm). So these files can get rather large.
>>
>> The solution for this verbose output is to use what I call macros
>> in chado xml.
>> Macros are supported by xort. They take the 15 line cvterm entry
>> and reduce it
>> to a line or 2 making the file size much more reasonable. The
>> apollo chado xml
>> adapter does not support macros, so you have to use unmacro'd
>> chado xml for
>> apollo purposes. Nomi Harris had a hard enough time getting the
>> chado xml
>> adapter working for flybase(and did a great job with a harrowing
>> task), that she
>> did not have time to take on the macro issue. If you wanted macros
>> (and smaller
>> file sizes) you would have to add this functionality to the chado
>> xml adapter
>> (are there java programmers in your group?).
>>
>> One of the arguments against the jdbc adapter is that its
>> dangerous because it
>> goes straight into the database so if there are any bugs in the
>> data adapter
>> then the database could get corrupted - some groups find this a
>> bit precarious.
>> This is a valid argument. I think theres 2 solutions here. One is
>> to thoroughly
>> test the adapter out against a test database until you are
>> confident that bugs
>> are hammered out.
>>
>> Another solution is to not go straight from apollo to the
>> database. You can use
>> an interim format and actually use apollo to get that interim
>> format into the
>> database. Of course one choice for interim format is chado xml and
>> then you are
>> at the the chado xml solution. The other choice for file format is
>> GAME xml. You
>> can then use apollo to load game into the chado database, and this
>> can be done
>> at the command line (with batching) so you dont have to bring up
>> the gui to do
>> it. Also chado xml can be loaded into chado via apollo as well (of
>> course xort
>> does this as well but not with transactions)
>>
>> So then the question is if Im not going to go straight into the
>> database, why
>> would I choose game over chado xml? Or if Im using chado xml
>> should I use
>> apollo or xort to load into chado. I think if you are using chado
>> xml it makes
>> sense to use xort as it is the tried & true technology for chado
>> xml. The
>> advantage of going through apollo is that it also uses the
>> transactions from
>> apollo (theres a transaction xml file) and thus writes back the
>> edits in a
>> transactional way as mentioned above rather than in a wipe out &
>> reload fashion.
>>
>> Also Game is a tried & true technology that has been used with
>> apollo in
>> production at flybase (before chado came along) for many years
>> now. One
>> criticism of it has been that DTD/XSD/schema has been a moving
>> target, nor has
>> it been described. That is not as true anymore. Nomi Harris has
>> made a xsd for
>> it as well as a rng. But I must confess that I have recently added
>> the ability
>> to have one level annotations in game (previously 1 levels had to
>> be hacked as 3
>> levels). Also game is a lot less verbose than un-macro'd chado
>> xml, as it more
>> or less fits with the apollo datamodel. One advantage of chado xml
>> over game xml
>> is that it is more flexible in terms of taking on features of
>> arbitrary depth.
>>
>> The chado xml adapter was developed for FlyBase and as far as I
>> know has not
>> been taken on by any other groups yet. Nomi can elaborate on this,
>> but I think
>> what this might mean is that there are places where things are
>> FlyBase specific.
>> If you went with chado xml the adapter would have to be
>> generalized. Its a good
>> exercise for the adapter to go through, but it will take a bit of
>> work. Nomi can
>> probably comment on how hard generalizing might be. I could be
>> wrong about this
>> but I think the current status with the chado xml adapter is that
>> Harvard has
>> done a bunch of testing on it but they havent put it into
>> production yet.
>>
>> The jdbc adapter is being used by several groups so has been
>> forced to be
>> generalized. One thing I have found is that chado databases vary
>> all too much
>> from mod to mod (ontologies change). There is a configuration file
>> for the jdbc
>> adapter that has settings for the differences that I encountered.
>> I initially
>> wrote it for cold spring harbors rice database that will be used
>> in classrooms.
>> Its working for rice in theory, but they havent actually used it
>> much in the
>> classroom yet. For rice the model is to save to game and use
>> apollo command line
>> to save game & transactions back to chado.
>>
>> Cyril Pommier, at the INRA - URGI - Bioinformatique, has taken on
>> the jdbc
>> adapter for his group. I have cc'd him on this email as I think he
>> will have a
>> lot to say about the jdbc adapter. Cyril has uncovered many bugs
>> and has fixed a
>> lot of them (thank you cyril) as hes a very savvy java programmer.
>> And he has
>> also forced the adapter to generalize and brought about the
>> evolution of the
>> config file to adapt to chado differences. But as Cyril can attest
>> (Cyril feel
>> free to elaborate) it has been a lot of work to get jdbc working
>> for him. There
>> were a lot of bugs to fix that we both went after. Hopefully now
>> its a bit more
>> stable and the next db/mod wont have as many problems. I think
>> Cyril is still at
>> the test phase and hasn't gone into production (Cyril?)
>>
>> Berkeley is using the jdbc adapter for an in house project. They
>> are using the
>> jdbc reader to load up game files (as the straight jdbc reader is
>> slow as the
>> chado db is rather slow) which are then loaded by a curator. They
>> are saving
>> game, and then I think chris mungall is xslting game to chado xml
>> which is then
>> saved with xort - or he is somehow writing game in another way -
>> not actually
>> sure. The Berkeley group drove the need for 1 level annotations(in
>> jdbc,game,&
>> apollo datmodel)
>>
>> Jonathan Crabtree at TIGR wrote the jdbc read adapter, and they
>> use it there. I
>> believe they are intending to use the write adapter but dont yet
>> do so (Jonathan?).
>>
>> I should mention that reading jdbc straight from chado tends to be
>> slow, as I
>> find that chado is a slow database, at least for Berkeley. It
>> really depends on
>> the db vendor and the amount of data. TIGRs reading is actually
>> really zippy.
>> The workaround for slow chados is to dump game files that read in
>> pretty fast.
>>
>> In all fairness, you should probably email with FlyBase (& Chris
>> Mungall) and
>> get the pros of using chado xml & xort, which they can give a far
>> better answer
>> on than I.
>>
>> Hope this helps,
>> Mark
>
> _______________________________________________
> DAS2 mailing list
> DAS2 at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/das2
More information about the DAS2
mailing list