From aloraine at gmail.com Fri Sep 1 14:15:51 2006 From: aloraine at gmail.com (Ann Loraine) Date: Fri, 1 Sep 2006 13:15:51 -0500 Subject: [DAS2] dynamic das2 features In-Reply-To: <5c24dcc30608311911q38ac2520k24c166bb33c29e75@mail.gmail.com> References: <5c24dcc30608311911q38ac2520k24c166bb33c29e75@mail.gmail.com> Message-ID: <83722dde0609011115n20685623rcb3d971addfa4f67@mail.gmail.com> Hi Allen, This is great. Can I make a request? I'd vote for blastx, blastn, and blat to be top priority. (tblastx - not so much...) Best, Ann On 8/31/06, Allen Day wrote: > I have a prototype that will generate primer3 primers. Temporarily up here: > > http://jugular.ctrl.ucla.edu:3000/feature?type=primer3;seq=ATATCCTAAAAGCATAACTGATGCATCTTTAATCTTGTATGTGACACTACTCATACGAAGGGACTATATCTAGTCAAGACGATACTGTGATAGGTACGTTATTTAATAGGATCTATAACGAAATGTCAAATAATTTTACGGTAATATAACTTATCAGCGGCGTATACTAAAACGGACGTTACGATATTGTCTCACTTCATCTTACCACCCTCTATCTTATTGCTGATAGAACACTAACCCCTCAGCTTTATTTCTAGTTACAGTTACACAAAAAACTATGCCAACCCAGAAATCTTGATATTTTACGTGTCAAAAAATGAGGGTCTCTAAATGAGAGTTTGGTACCATGACTTGTAACTCGCACTGCCCTGATCTGCAATCTTGTTCTTAGAAGTGACGCATATTCTATACGGCCCGACGCGACGCGCCAAAAAATGAAAAACGAAGCAGCGACTCATTTTTATTTAAGGACAAAGGTTGCGAAGCCGCACATTTCCAATTTCATTGTTGTTTATTGGACATACACTGTTAGCTTTATTACCGTCCACGTTTTTTCTACAATAGTGTAGAAGTTTCTTTCTTATGTTCATCGTATTCATAAAATGCTTCACGAACACCGTCATTGATCAAATAGGTCTATAATATTAATATACATTTATATAATCTACGGTATTTATATCATCAAAAAAAAGTAGTTTTTTTATTTTATTTTGTTCGTTAATTTTCAATTTCTATGGAAACCCGTTCGTAAAATTGGCGTTTGTCTCTAGTTTGCGATAGTGTAGATACCGTCCTTGGATAGAGCACTGGAGATGGCTGGCTTTAATCTGCTGGAGTACCATGGAACACCGGTGATCATTCTGGTCACTTGGTCTGGAGCAATACCGGTCAACATGGTGGTGAAGTCACCGTAGTTGAAAACGGCTTCAG! > CAACTTCGACTGGGTAGGTTTCAGTTGGGTGGGCGGCTTGGAACATGTAGTATTGGGCTAAGTGAGCTCTGATATCAGAGACGTAGACACCCAATTCCACCAAGTTGACTCTTTCGTCAGATTGAGCTAGAGTGGTGGTTGCAGAAGCAGTAGCAGCGATGGCAGCGACACCAGCGGCGATTGAAGTTAATTTGACCATTGTATTTGTTTTGTTTGTTAGTGCTGATATAAGCTTAACAGGAAAGGAAAGAATAAAGACATATTCTCAAAGGCATATAGTTGAAGCAGCTCTATTTATACCCATTCCCTCATGGGTTGTTGCTATTTAAACGATCGCTGACTGGCACCAGTTCCTCATCAAATATTCTCTATATCTCATCTTTCACACAATCTCATTATCTCTATGGAGATGCTCTTGTTTCTGAACGAATCATAAATCTTTCATAGGTTTCGTATGTGGAGTACTGTTTTATGGCGCTTATGTGTATTCGTATGCGCAGAATGTGGGAATGCCAATTATAGGGGTGCCGAGGTGCCTTATAAAACCCTTTTCTGTGCCTGTGACATTTCCTTTTTCGGTCAAAAAGAATATCCGAATTTTAGATTTGGACCCTCGTACAGAAGCTTATTGTCTAAGCCTGAATTCAGTCTGCTTTAAACGGCTTCCGCGGAGGAAATATTTCCATCTCTTGAATTCGTACAACATTAAACGTGTGTTGGGAGTCGTATACTGTTAGGGTCTGTAAACTTGTGAACTCTCGGCAAATGCCTTGGTGCAATTACGTAATTTTAGCCGCTGAGAAGCGGATGGTAATGAGACAAGTTGATATCAAACAGATACATATTTAAAAGAGGGTACCGCTAATTTAGCAGGGCAGTATTATTGTAGTTTGATATGTACGGCTAACTGAACCTAAGTAGGGATATGAGAGTAAGAACGTTCGGCTACTCTTCTTTCTAAGTGGGATTTTTCTTAATCCTTGGATTC! > TTAAAAGGTTATTAAAGTTCCGCACAAAGAACGCTTGGAAATCGCATTCATCAAAGAACAACTCTTCGTT > TTCCAAACAATCTTCCCGAAAAAGTAGCCGTTCATTTCCCTTCCGATTTCATTCCTAGACTGCCAAATTTTTCTTGCTCATTTATAATGATTGATAAGAATTGTATTTGTGTCCCATTCTCGTAGATAAAATTCTTGGATGTTAAAAAATTATTATTTTCTTCATAAAGAAG > > I have all my ducks in a row to implement primer3 (done), blat, blastn, > tblastn, tblastx, genscan, and rePCR. I will do some reworking of the GET > params to allow specification of parameters (e.g. required primer size > range) using the property filter syntax. This server will require both a > type= and overlaps= filter for all requests so that it can do a backend GET > on the sequence from the main das server. > > Gregg, please take a look and let me know if this is roughly suitable for > Genoviz. > > -Allen > > _______________________________________________ > DAS2 mailing list > DAS2 at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/das2 > -- Ann Loraine Assistant Professor Section on Statistical Genetics University of Alabama at Birmingham http://www.ssg.uab.edu http://www.transvar.org From dalke at dalkescientific.com Mon Sep 11 08:44:07 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 11 Sep 2006 14:44:07 +0200 Subject: [DAS2] stylesheets today Message-ID: <37174817d074b52ff679706f10ba2cbd@dalkescientific.com> We're schedule to talk about stylesheets today, right? Andrew dalke at dalkescientific.com From Gregg_Helt at affymetrix.com Mon Sep 11 09:35:12 2006 From: Gregg_Helt at affymetrix.com (Helt,Gregg) Date: Mon, 11 Sep 2006 06:35:12 -0700 Subject: [DAS2] stylesheets today Message-ID: We're definitely having a DAS/2 teleconference at the usual time, 9:30 AM PST. I can't remember if stylesheets were already on the agenda for today, but if that's what's on your mind, sounds like a good topic to start with. Gregg > -----Original Message----- > From: das2-bounces at lists.open-bio.org [mailto:das2-bounces at lists.open- > bio.org] On Behalf Of Andrew Dalke > Sent: Monday, September 11, 2006 5:44 AM > To: DAS/2 > Subject: [DAS2] stylesheets today > > We're schedule to talk about stylesheets today, right? > > Andrew > dalke at dalkescientific.com > > _______________________________________________ > DAS2 mailing list > DAS2 at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/das2 From dalke at dalkescientific.com Mon Sep 11 10:30:20 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 11 Sep 2006 16:30:20 +0200 Subject: [DAS2] stylesheets today In-Reply-To: References: Message-ID: <4e7fc9cae0d8b1a83982023f600ea199@dalkescientific.com> Gregg: > I can't remember if stylesheets were already on the agenda for > today, but if that's what's on your mind, sounds like a good topic to > start with. I actually don't want to talk about it. I don't know enough about the topic. I have ideas and comments. It was more I remembered at the wrap-up for the sprint where someone (you?) proposed that the next meeting be on stylesheets. If so, I want to be prepared to talk about it. Ahh, I misremembered. From Steve's notes gh: Something we need todo: come back to stylesheet issue. ad: we should have impl in place before making spec work. [A] Discuss stylesheets when we have an impl in place In other action items, has anyone developed some use guidelines for generating DAS2 data? Eg, recommended ways to do alignments, BLAST hits, complex parent/part relationships? Also, guidelines in how to convert GFF3 into DAS2 feature xml? I don't know where all of the fields are supposed to go in the free-form area. Examples from flybase ID=FBti0020396;Name=Rt1c{}1472;Dbxref=FlyBase+Annotation+IDs: TE20396,FlyBase:FBt i0020396;cyto_range=102A1-102A1;gbunit=AE003845;synonym=TE20396; synonym_2nd=Rt1c {}1472 ID=FBgn0004859;Name=ci;Dbxref=FlyBase+Annotation+IDs:CG2125,FlyBase: FBan0002125,FlyBase:FBgn0004859;cyto_range=102A1-102A3; dbxref_2nd=FlyBase:FBgn0000314,FlyBase:FBgn0000315,FlyBase: FBgn0010154,FlyBase:FBgn0010155,FlyBase:FBgn0017411,FlyBase: FBgn0019831;gbunit=AE003845;synonym_2nd=Ce,Ci,CI,ci155,ciD,ci- D,CiD,CID,ciD,CiD,Cubitus+interruptus,cubitus- interruptus-Dominant,l(4)102ABc,l(4)13,l(4)17 What should I do with those? Dump them into the key/value property field? Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Mon Sep 11 13:52:35 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 11 Sep 2006 19:52:35 +0200 Subject: [DAS2] best practices / DAS2 format examples Message-ID: das2-teleconf-2006-03-16.txt > [A] Lincoln will provide use cases/examples of these features > scenarios: > - three or greater hierarchy features > - multiple parents > - alignments I really would like some real-world examples of these. I don't know enough to make decent examples for the documentation and I think it would be very useful so others can see how to model existing data in DAS2 XML. I looked at GFF3 examples to find existing properties which must be storable in a DAS2 feature document. Here are two example lines ID=FBti0020396;Name=Rt1c{}1472;Dbxref=FlyBase+Annotation+IDs: TE20396,FlyBase:FBt i0020396;cyto_range=102A1-102A1;gbunit=AE003845;synonym=TE20396; synonym_2nd=Rt1c {}1472 ID=FBgn0004859;Name=ci;Dbxref=FlyBase+Annotation+IDs:CG2125,FlyBase: FBan0002125,FlyBase:FBgn0004859;cyto_range=102A1-102A3; dbxref_2nd=FlyBase:FBgn0000314,FlyBase:FBgn0000315,FlyBase: FBgn0010154,FlyBase:FBgn0010155,FlyBase:FBgn0017411,FlyBase: FBgn0019831;gbunit=AE003845;synonym_2nd=Ce,Ci,CI,ci155,ciD,ci- D,CiD,CID,ciD,CiD,Cubitus+interruptus,cubitus- interruptus-Dominant,l(4)102ABc,l(4)13,l(4)17 I do not know this domain well enough. I do not how "cyto_range" should be stored in DAS2 XML nor gbunit. I don't know the difference between dbxref and dbxref_2nd. Nor can I find documentation on these properties. Looking around I came across names cyto_range Dbxref dbxref_2nd Name Parent species gbunit Alias but I don't know how those are best modeled in GFF3. For example, is species redundant given that we know that from the reference sequence? I want someone to be able to go to DAS and easily figure out how to convert existing data models into DAS's model. Here is an example of a real-world GFF3 complex annotation, which we're calling a "feature group" in DAS2. The top-level is a gene. It has one child which is an mRNA. The mRNA has children of CDS, exon, protein, and intron. I've added newlines for readability. 4 . gene 22335 23205 . - . ID=FBgn0052013; Name=CG32013;Dbxref=FlyBase+Annotation+IDs:CG32013,FlyBase: FBan0032013,FlyBase: FBgn0052013;cyto_range=101F1-101F1;gbunit=AE003845 4 . mRNA 22335 23205 . - . ID=FBtr0089183; Name=CG32013-RA;Parent=FBgn0052013;Dbxref=FlyBase+Annotation+IDs: CG32013-RA, FlyBase:FBtr0089183;cyto_range=101F1-101F1 4 . CDS 22335 22528 . - . Parent=FBtr0089183; Name=CG32013-cds;Dbxref=FlyBase+Annotation+IDs:CG32013-RA 4 . exon 22335 22528 . - . Parent=FBtr0089183 4 . protein 22338 23205 . - . ID=FBpp0088247; Name=CG32013-PA;Parent=FBtr0089183;Dbxref=FlyBase+Annotation+IDs: CG32013-PA, FlyBase:FBpp0088247,GB_protein:AAN06536.1,FlyBase+Annotation+IDs: CG32013-RA 4 . intron 22529 22616 . - . Parent=FBtr0089183; Name=CG32013-in 4 . CDS 22617 23205 . - . Parent=FBtr0089183; Name=CG32013-cds;Dbxref=FlyBase+Annotation+IDs:CG32013-RA 4 . exon 22617 23205 . - . Parent=FBtr0089183 The direct conversion to DAS2 xml the way I've been doing it is first defining a TYPES document like this (the das-private: identifiers are created upon server upload). Note that I'm storing the GFF3 fields in a PROP element so I can easily figure out which DAS2 types correspond to the GFF3 types (unique gff3 types is the pair (type, source) ) Given the types, the features document looks like. Note the change in start position because GFF3 is a "start with 1" numbering system while DAS2 is a "start with 0". Note also that I've used the Name property from GFF3 to populate the title field in DAS2. While I have ideas on what to do with the rest (eg, populate the dbxref DAS2 element), I don't know what to do with all of the fields and would like advice. Andrew dalke at dalkescientific.com From Steve_Chervitz at affymetrix.com Mon Sep 11 14:11:04 2006 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Mon, 11 Sep 2006 11:11:04 -0700 Subject: [DAS2] Notes from the weekly DAS/2 teleconference, 11 Sep 2006 Message-ID: Notes from the weekly DAS/2 teleconference, 11 Sep 2006 $Id: das2-teleconf-2006-09-11.txt,v 1.1 2006/09/11 18:10:11 sac Exp $ Note taker: Steve Chervitz Attendees: Affy: Steve Chervitz, Ed Erwin, Gregg Helt Dalke Scientific: Andrew Dalke UCLA: Allen Day, Brian O'Connor (sc, aday, bo calling in from Seattle at MGED9 jamboree) Action items are flagged with '[A]'. These notes are checked into the biodas.org CVS repository at das/das2/notes/2006. Instructions on how to access this repository are at http://biodas.org DISCLAIMER: The note taker aims for completeness and accuracy, but these goals are not always achievable, given the desire to get the notes out with a rapid turnaround. So don't consider these notes as complete minutes from the meeting, but rather abbreviated, summarized versions of what was discussed. There may be errors of commission and omission. Participants are welcome to post comments and/or corrections to these as they see fit. Agenda: -------- * grant update * status reports Topic: Grant update ------------------- gh: p good says funding outlook for getting funding for sep '06 to may '07. $250K. not completely official, but more so. no grant to be submitted in october. still major issues to resolve: rewriting, pi decision. size was a concern. decision about what to drop (6 sections). ad: new project starting dec/jan for 1 year. Can't work on das/2 past end of this year. product for chemical informatics. gh: can you put more time before then. full time? 2-3 mos. ad: need to look at my schedule. will get back to you [A] andrew talk with gregg re: increasing his das/2 time committment Topic: Status reports (and general discussion) --------------------- gh: client to do curation in igb, write back to test server. impl thing I drew on board back at last code sprint. editing curations. making sure undo/redo capabilities in igb works. will translate into what writeback needs are. turned off in igb by default. prefs -> turn on exptl curations. can edit things, but can't connect to server. must modify code, but don't ee: gff3 parser. trouble: gff3 files in wild don't follow spec. refseq website, repository, all three fails in different ways. ucsc mailing list helped, but it wasn't their files. aday: failed on validator? ee: yes gh: the only request we had ee: not trying to write a full gff3 parser. just need gene, exon, cds, mRNA. ignore other lines and it seems compliant. but a second problem: very flexible exon parent can be mRNA, gene, or nothing. jibes with igb data model. also worked on: released new igb version. graph support handing, parsing affy files. ad: flybase files are gff3 compliant, parent/part relationship requires full file parsing. 800mb file. had to insert marker mid-file to inform parser. ee: space reduction during parsing. they have a recommended canonical rep of gene, but not required to do it. haven't found an example that follows the rec. gh: the wormbase stuff should be canonical, since lincoln did gff3 and wormbase. ad: more people writing gff3 than reading ee: ucsc discussion: grant to support more mod orgs, to include gff3 parser support. gh: that's the kind of grant we'd like to fold das grant work into if we don't do a separate das/2 grant [A] gregg look into ucsc grant, possibly fold das stuff into it ad: gff3 -> das2xml converter. some things in gff3 i don't know how to handle. key-value. Need to figure out why things aren't passing validator. [A] andrew will write up questions, post to list, discuss there and/or with lincoln at the next das/2 teleconf. ad: modeling alignments. need a recommended way to model alignments. gh: when to use locations vs subfeatures. aday: why care about gff3? ee: igb ad: people need to convert data for das2xml. aday: need a model mapping doc. we can hash it out next week with lincoln. ad: working with berkeley xml database. liking it alot. gh: also cool: SOLR - java thing built on top of lucene and xml db stuff. cool thing is that it layers on top of that a rest-ful approach to retrieving and writing data to a db. thru http urls . queries are gets all writes/updates/delete are posts. ad: xQuery aday: generalization of xpath ad: xslt is another generalization. sc: there was a poster at MGED9 meeting from stanford group using Berkeley XML db to map between 'flavors' of MAGE-ML, since organizations use different ways to represent the same thing in MAGE-ML. Represented the transformation using pairs of xQueries, one targetting for format A, other for format B. All the smarts about the format was confined to the xqueries. nice. ad: I want to get feedback regarding modeling for das2, recommendation to store certain data (alignments, gff3). gh: gff3 - too open ended. lots of stuff can be in there ad: given flybase, what is the recommended way to post gff3 data. gh: i can answer your alignments issue, can't do gff3. [A] andrew will contact folks as needed regarding gff3/flybase modeling issues: suzi, chris mungall, lincoln, scott cain Other status: ------------- sc: no major progress given Netaffx update work, MGED travel. Plan is to update das/2 server code on affy server, load it with some exon array design data using gregg's new parser which is more memory efficient, and test it out. Then we'll need to migrate it off the das/1 server where the exon data hogs lots of memory, and then migrate Netaffx links to use das/2. gh: new box end of october with das grant money. have run das2 server on 64bit. on 32bit have gotten 8g in single java process. riva. should be able to get 16g in one process. or have 2x8g bo: allen updated assay portion, bringing igb ibjects upto date. mark carlson is updating hyrax client to retrieve microarry data back. he's taking das/2 client makeing it embedable. eg., into the MeV tool from John Quackenbush at Harvard (java). should be embedable in igb to browse celsius to d/l data. plan to have webstart for it. aday: updating assay portion of server. mage-ml to be inline with changes. adding/modifying element attribs, lowercase 'uri'. data loaders to get ncbi data into server for micoarray expts. client lib in R for talking to das server. requires parsing xml. extremely slow, uses lots of memory, so eg., viz bed files in R, genomic location. good plotting support in R. look at distribution. regarding writeback server: on hold until you report any problems. basic stuff is working. let me know. gh: read part: caching improvements? aday: no more work on that since jamboree. public server doesn't have these improvements. plan to rewrite controller and view part. junk on this end. want to integrate block mechanism into that as well. not sure when it will happen. time estimate: maybe 1-1.5 months with bo and i working half time. bo: thie rewrite will help a lot. aday: lots of little things changed, 'segment' etc. server domain source, capabilities, formats. huge mess. need more looking before i can get an accurate time estimate for patching vs. rewriting. think the rewrite wouldn't be that expensive. gh: machine? aday: dual core opteron, maybe 16g ram? load is increasing, may move off to a dedicated server. webserver is the issue, not db. Next teleconf: -------------- In two weeks. 25 Sep 2006 Special dedication: ------------------- To those who tragically lost their lives on this day five years ago... From Gregg_Helt at affymetrix.com Mon Sep 11 16:07:05 2006 From: Gregg_Helt at affymetrix.com (Helt,Gregg) Date: Mon, 11 Sep 2006 13:07:05 -0700 Subject: [DAS2] best practices / DAS2 format examples Message-ID: > -----Original Message----- > From: das2-bounces at lists.open-bio.org [mailto:das2-bounces at lists.open- > bio.org] On Behalf Of Andrew Dalke > Sent: Monday, September 11, 2006 10:53 AM > To: DAS/2 > Subject: [DAS2] best practices / DAS2 format examples > > das2-teleconf-2006-03-16.txt > > [A] Lincoln will provide use cases/examples of these features > > scenarios: > > - three or greater hierarchy features > > - multiple parents > > - alignments > > I really would like some real-world examples of these. I don't know > enough to make decent examples for the documentation and I think it > would be very useful so others can see how to model existing data > in DAS2 XML. I found a previous post from Lincoln with attached alignment examples: > -----Original Message----- > From: das2-bounces at lists.open-bio.org [mailto:das2-bounces at lists.open- > bio.org] On Behalf Of Lincoln Stein > Sent: Monday, June 05, 2006 7:32 AM > To: Andrew Dalke > Cc: DAS/2 > Subject: [DAS2] Example alignments > > Hi Andrew, > > I'm truly sorry at how long it has taken me to get these examples to you. > I hope that the example alignments in the enclosure makes sense to you. > > Unfortunately I found that I had to add a new "target" attribute to > in order to make the cigar string semantics unambiguous. Otherwise you > wouldn't be able to tell how to interpret the gaps. > > Lincoln > CASE #1. A SIMPLE PAIRWISE ALIGNMENT. A simple alignment is one in which the alignment is represented as a single feature with no subfeatures. This is the preferred representation to be used when the entire alignment shares the same set of properties. This is an alignment between Chr3 (the reference) and EST23 (the target). Both aligned sequences are in the forward (+) direction. We represent this as a single alignment Chr4 100 CAAGACCTAAA-CTGGAATTCCAATCGCAACTCCTGGACC-TATCTATA 147 |||||||X||| ||||| ||||||| ||||X||| |||||||| EST23 1 CAAGACCAAAATCTGGA-TTCCAAT-------CCTGCACCCTATCTATA 41 This has a CIGAR gap string of M11 I1 M5 D1 M7 D7 M8 I1 M8: M11 match 11 bp I1 insert 1 gap into the reference sequence M5 match 5 bp D1 insert 1 gap into the target sequence M7 match 7 bp D7 insert 7 gaps into the target M8 match 8 bp I1 insert 1 gap into the reference M8 match 8 bp Content-Type: application/x-das-features+xml NOTE: I've had to introduce a new attribute named "target" in order to distinguish the reference sequence from the target sequence. This is necessary for the CIGAR string concepts to work. Perhaps it would be better to have a "role" attribute whose values are one of "ref" and "target?" From dalke at dalkescientific.com Mon Sep 11 16:14:07 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 11 Sep 2006 22:14:07 +0200 Subject: [DAS2] best practices / DAS2 format examples In-Reply-To: References: Message-ID: Gregg: > I found a previous post from Lincoln with attached alignment examples: D'oh! My apologies for having forgotten that. Lincoln: > NOTE: I've had to introduce a new attribute named "target" in > order to distinguish the reference sequence from the target > sequence. This is necessary for the CIGAR string concepts to work. > Perhaps it would be better to have a "role" attribute whose values are > one of "ref" and "target? Anyone have comments on that? Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Mon Sep 11 21:44:09 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Tue, 12 Sep 2006 03:44:09 +0200 Subject: [DAS2] feature group assembly; proposals for simplification Message-ID: <795173f891edcdfead41a676f741d8b0@dalkescientific.com> I've been working on a writeback server. It will verify that the feature groups have no cycles, that if X is a parent to Y then Y is a part of X, and that all groups has a single root. I'm having a hard time with that, and harder than I expected. GFF3 had only one direction of relationship. As such it's impossible to assemble a feature group until the end of the file or the marker that no lookahead is needed. We changed that in DAS2. The feature xml is bidirectional so in theory it's possible to know when the feature group is complete. But it's tricky. It's tricky enough that I want to change things slightly so that people don't need to handle the trickiness. The trickiness comes when parsing the list of features into feature groups. For example, consider [F3] / \ [F4] [F2] | | [F1] [F5] where the features are in the order F1, F2, ... F5. After F2 the system looks like there are two feature groups. [F3?] | [F4?] [F2] | | [F1] [F5?] Only after F3 can those be merged together. This requires some non-trivial bookkeeping, unless I've forgotten something simple from undergraduate data structures. Of course it's simple if you know that a feature is the last feature in a feature group either through reaching EOF or a special marker. But then what's the point of having bidirectional links if the result is no better than GFF3's only-list-parent solution. If there is a simple algorithm, please let me know. === Solution #1 === Another solution is to require that complex feature groups (groups with more than one feature) must also have a link to the root element of the feature group. I bought this up before but agreed with others that that there wasn't a need for it. Now I think there is. Here's an example. By using a 'root' attribute, detecting the end of a feature group is almost trivial: a FeatureGroup contains: - list of seen urls # duplicates are not allowed - set of urls which must be seen # duplicates are ignored let feature_groups := mapping {root uri -> FeatureGroup } for feature in features: if feature does not have a @root attribute: make a new FeatureGroup add the feature to the FeatureGroup as being seen let feature_groups[feature's @uri attribute] := the new FeatureGroup else: if the features's @root attribute does not exist in features_group: # first time this feature group was seen create a new FeatureGroup let feature_groups[feature's @root attribute] := the new FeatureGroup get feature_group[feature's @root attribute] add this feature to the FeatureGroup as a seen url for each uri in (feature's @uri attribute, the parent uris, the part uris): add the uri to the FeatureGroup's "must be seen" set if count(seen urls) == count(must be seen urls): the feature group is complete / assemble the links Assembly of a feature group occurs as soon as all the features are available, rather than waiting for the end. This makes life much simpler for the writeback, and I assume also for the client code. Assuming the client code doesn't just wait until the end of the input before it does anything. Gregg? Do you wait until the end of the XML to assemble hierarchical features? If so, do you need parent/part or will parent suffice? Or do you do all the bookkeeping as you get the data? How complex is the code? There are other solutions: === Solution #2 === - require that the features are ordered so that parents comes before a part I think this is hard because it relational databases aren't naturally ordered. The normal trick is to put an extra field and "sort by", but then the server has to maintain the correct ordering. It's fragile. === Solution #3 === - put all elements for a given feature group explicitly inside of a element. Eg, (in this case simple features not part of a complex parent/part relationship need not be in a FEATURE_GROUP.) This is the easiest solution. I like it because it's the easiest to figure out. Even the algorithm above is hard by comparison. If I had my choice we would do this instead of determining the feature group by analysis of the parent/part linkages. Note that with this change there's no longer need for the PART element. We would only need PARENT. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Thu Sep 14 12:08:52 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 14 Sep 2006 18:08:52 +0200 Subject: [DAS2] feature group assembly; proposals for simplification In-Reply-To: <795173f891edcdfead41a676f741d8b0@dalkescientific.com> References: <795173f891edcdfead41a676f741d8b0@dalkescientific.com> Message-ID: <43a8414f574e1bd8205bfb16b6adbe93@dalkescientific.com> Me [Andrew]: > I've been working on a writeback server. It will verify that the > feature groups have no cycles, that if X is a parent to Y then Y > is a part of X, and that all groups has a single root. > > I'm having a hard time with that, and harder than I expected. I listed three alternatives and am hoping for feedback by email rather than waiting another 10 days for the next phone conference. They are: FEATURE elements add a "root" attribute pointing to the top-level feature for the feature group FEATURE elements must be listed in top-down order Features in the same feature group are inside a new element, as in Or, someone can show me a simple O(n) algorithm for building up the feature group such that complete groups can be processed before reaching the end of the feature data set. Any comments? Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Sun Sep 17 04:20:49 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Sun, 17 Sep 2006 10:20:49 +0200 Subject: [DAS2] feature group assembly; proposals for simplification In-Reply-To: <6dce9a0b0609151608m3b06881at79127b95d08cd40c@mail.gmail.com> References: <795173f891edcdfead41a676f741d8b0@dalkescientific.com> <6dce9a0b0609151608m3b06881at79127b95d08cd40c@mail.gmail.com> Message-ID: <41498edc722a0faf292380b221733e55@dalkescientific.com> On Sep 16, 2006, at 1:08 AM, Lincoln Stein wrote: > Hi Andrew, > > Grouping them into a set is almost equivalent to the > "end of > feature set" marker in GFF3, which is why I favor that solution. If > we do this, should we adopt the same convention for the GET requests > as well? If so, should we get rid of bidirection references? (I did notice that the GFF3 data sets I found, like wormbase, don't have the "end of feature set" marker. My GFF3 parser has about 10x memory overhead so parsing a 80MB input file thrashed my 1GB laptop. Adding a single marker in the middle, by hand, made it much happier.) If we have a such that features in that group are all connected to other and only to each other, then I have no problem getting rid of the child link. It adds no benefits in that case but does cause the verification overhead of checking that both directions are correct. Andrew dalke at dalkescientific.com From Ed_Erwin at affymetrix.com Mon Sep 18 12:59:55 2006 From: Ed_Erwin at affymetrix.com (Erwin, Ed) Date: Mon, 18 Sep 2006 09:59:55 -0700 Subject: [DAS2] feature group assembly; proposals for simplification Message-ID: I think the simplest solution for parsing is while parsing the file, read objects F1,F2,F3,F4,F5 into memory but don't even try to hook-up parents and children yet. After finished reading the file, and getting rid of all the memory overhead associated with XML-parsing, loop through the objects that you've read and link parents to children. Their order no longer matters because they are all in memory, probably in a hashmap linking ID to object. -----Original Message----- From: das2-bounces at lists.open-bio.org [mailto:das2-bounces at lists.open-bio.org] On Behalf Of Andrew Dalke Sent: Monday, September 11, 2006 6:44 PM To: DAS/2 Subject: [DAS2] feature group assembly; proposals for simplification I've been working on a writeback server. It will verify that the feature groups have no cycles, that if X is a parent to Y then Y is a part of X, and that all groups has a single root. I'm having a hard time with that, and harder than I expected. GFF3 had only one direction of relationship. As such it's impossible to assemble a feature group until the end of the file or the marker that no lookahead is needed. We changed that in DAS2. The feature xml is bidirectional so in theory it's possible to know when the feature group is complete. But it's tricky. It's tricky enough that I want to change things slightly so that people don't need to handle the trickiness. The trickiness comes when parsing the list of features into feature groups. For example, consider [F3] / \ [F4] [F2] | | [F1] [F5] where the features are in the order F1, F2, ... F5. After F2 the system looks like there are two feature groups. [F3?] | [F4?] [F2] | | [F1] [F5?] Only after F3 can those be merged together. This requires some non-trivial bookkeeping, unless I've forgotten something simple from undergraduate data structures. Of course it's simple if you know that a feature is the last feature in a feature group either through reaching EOF or a special marker. But then what's the point of having bidirectional links if the result is no better than GFF3's only-list-parent solution. If there is a simple algorithm, please let me know. From Ed_Erwin at affymetrix.com Mon Sep 18 12:54:59 2006 From: Ed_Erwin at affymetrix.com (Erwin, Ed) Date: Mon, 18 Sep 2006 09:54:59 -0700 Subject: [DAS2] feature group assembly; proposals for simplification Message-ID: Andrew, I'm having trouble understanding where all this memory overhead comes from in your parsing of GFF3 files. I've recently written a GFF3 parser for IGB. I've found that the presence or absence of the "end of feature set" marker "###" has little effect on the amount of memory required. The procedure is quite simple. For each line in the GFF3 file, create an object in memory. Add that object to a list. If the object has an ID, store the "ID to object" mapping in a hashmap. At the end of file (or each "###" mark) Loop through the complete list of objects. For each one claiming to have one or more Parent_ID's, find those parents in the hashmap, add it as a child of those parents and remove it from the original list (which will then contain only parentless objects). That is all. At the end you can throw away the hashmap. During processing you have to have one hashmap. But I don't see how that adds a whole lot to the memory overhead. In our model, each of the memory objects representing one feature keeps a list of pointers to its children. While first reading the file, those pointers are left null, then the lists are constructed on the second pass (after the "###" marks). (In IGB, the final destination of the data is some in-memory objects. If your final destination is a database, then you can be writing each line to the database as it is read and then check for consistency of parents and children later. You don't even need the in-memory hashmap then, because you can use a database table.) So basically, I just don't understand what problem you are trying to solve. I don't object to adding , and I don't much care whether there are bi-directional references. Bi-directional references do not seem necessary to me, and really just seems like a likely place for the users to make mistakes, but I don't see any reason to change the spec now. If there are bi-directional references, you can proceed exactly as above. The primary references are references to the parents. But when hooking a feature up to its parent, you can then check that the parent has listed this child as one of its expected children. (You in fact get a bit of a boost because since each parent knows how many children it expects, you can set-up the child List objects with the correct size from the beginning.) Ed -----Original Message----- From: das2-bounces at lists.open-bio.org [mailto:das2-bounces at lists.open-bio.org] On Behalf Of Andrew Dalke Sent: Sunday, September 17, 2006 1:21 AM To: lincoln.stein at gmail.com Cc: DAS/2 Subject: Re: [DAS2] feature group assembly; proposals for simplification On Sep 16, 2006, at 1:08 AM, Lincoln Stein wrote: > Hi Andrew, > > Grouping them into a set is almost equivalent to the > "end of > feature set" marker in GFF3, which is why I favor that solution. If > we do this, should we adopt the same convention for the GET requests > as well? If so, should we get rid of bidirection references? (I did notice that the GFF3 data sets I found, like wormbase, don't have the "end of feature set" marker. My GFF3 parser has about 10x memory overhead so parsing a 80MB input file thrashed my 1GB laptop. Adding a single marker in the middle, by hand, made it much happier.) If we have a such that features in that group are all connected to other and only to each other, then I have no problem getting rid of the child link. It adds no benefits in that case but does cause the verification overhead of checking that both directions are correct. Andrew dalke at dalkescientific.com _______________________________________________ DAS2 mailing list DAS2 at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/das2 From lstein at cshl.edu Mon Sep 18 13:23:38 2006 From: lstein at cshl.edu (Lincoln Stein) Date: Mon, 18 Sep 2006 17:23:38 +0000 Subject: [DAS2] feature group assembly; proposals for simplification In-Reply-To: References: Message-ID: <6dce9a0b0609181023rc9aebe9sa452c4b4f7cd8d7b@mail.gmail.com> Hi, My GFF3 parser works in a similar manner. As each feature comes in, it is parsed, turned into an object, and sent to a disk-based database. The parent link is kept in an in-memory data structure. At the end of the parse, the parent link data structure is traversed and then the table of parent/child relationships is written out to disk. Lincoln On 9/18/06, Erwin, Ed wrote: > > > Andrew, > > I'm having trouble understanding where all this memory overhead comes > from in your parsing of GFF3 files. I've recently written a GFF3 parser > for IGB. I've found that the presence or absence of the "end of feature > set" marker "###" has little effect on the amount of memory required. > > The procedure is quite simple. > > For each line in the GFF3 file, create an object in memory. > Add that object to a list. > If the object has an ID, store the "ID to object" mapping in a hashmap. > > At the end of file (or each "###" mark) > Loop through the complete list of objects. > For each one claiming to have one or more Parent_ID's, find those > parents in the hashmap, add it as a child of those parents and remove it > from the original list (which will then contain only parentless > objects). > > > That is all. At the end you can throw away the hashmap. > > During processing you have to have one hashmap. But I don't see how > that adds a whole lot to the memory overhead. In our model, each of the > memory objects representing one feature keeps a list of pointers to its > children. While first reading the file, those pointers are left null, > then the lists are constructed on the second pass (after the "###" > marks). > > (In IGB, the final destination of the data is some in-memory objects. > If your final destination is a database, then you can be writing each > line to the database as it is read and then check for consistency of > parents and children later. You don't even need the in-memory hashmap > then, because you can use a database table.) > > So basically, I just don't understand what problem you are trying to > solve. I don't object to adding , and I don't much care > whether there are bi-directional references. Bi-directional references > do not seem necessary to me, and really just seems like a likely place > for the users to make mistakes, but I don't see any reason to change the > spec now. > > If there are bi-directional references, you can proceed exactly as > above. The primary references are references to the parents. But when > hooking a feature up to its parent, you can then check that the parent > has listed this child as one of its expected children. (You in fact get > a bit of a boost because since each parent knows how many children it > expects, you can set-up the child List objects with the correct size > from the beginning.) > > Ed > > > -----Original Message----- > From: das2-bounces at lists.open-bio.org > [mailto:das2-bounces at lists.open-bio.org] On Behalf Of Andrew Dalke > Sent: Sunday, September 17, 2006 1:21 AM > To: lincoln.stein at gmail.com > Cc: DAS/2 > Subject: Re: [DAS2] feature group assembly; proposals for simplification > > On Sep 16, 2006, at 1:08 AM, Lincoln Stein wrote: > > Hi Andrew, > > > > Grouping them into a set is almost equivalent to the > > "end of > > feature set" marker in GFF3, which is why I favor that solution. If > > we do this, should we adopt the same convention for the GET requests > > as well? If so, should we get rid of bidirection references? > > (I did notice that the GFF3 data sets I found, like wormbase, don't have > the "end of feature set" marker. My GFF3 parser has about 10x memory > overhead > so parsing a 80MB input file thrashed my 1GB laptop. Adding a single > marker in the middle, by hand, made it much happier.) > > If we have a such that features in that group are all > connected to other and only to each other, then I have no problem > getting > rid of the child link. It adds no benefits in that case but does cause > the verification overhead of checking that both directions are correct. > > Andrew > dalke at dalkescientific.com > > _______________________________________________ > DAS2 mailing list > DAS2 at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/das2 > > _______________________________________________ > DAS2 mailing list > DAS2 at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/das2 > -- Lincoln D. Stein Cold Spring Harbor Laboratory 1 Bungtown Road Cold Spring Harbor, NY 11724 (516) 367-8380 (voice) (516) 367-8389 (fax) FOR URGENT MESSAGES & SCHEDULING, PLEASE CONTACT MY ASSISTANT, SANDRA MICHELSEN, AT michelse at cshl.edu From dalke at dalkescientific.com Mon Sep 18 15:11:03 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 18 Sep 2006 21:11:03 +0200 Subject: [DAS2] feature group assembly; proposals for simplification In-Reply-To: References: Message-ID: Ed: > I'm having trouble understanding where all this memory overhead comes > from in your parsing of GFF3 files. I've recently written a GFF3 > parser > for IGB. I've found that the presence or absence of the "end of > feature > set" marker "###" has little effect on the amount of memory required. How big was the data set? dmel-3R-r4.3.gff from flybase is 68,685,595 bytes. Strange though now that I look at it. I shouldn't have a 10x overhead. I'm looking at the memory use now. I estimate my data structures used roughly 340 bytes per feature. Each line averages 80 characters so 4.25x overhead and not 10x. Very strange. I'll need to dig in to that some more. I did find that I wasted a lot of space with small data structures. class Location(object): def __init__(self, id, start, end): self.id, self.start, self.end = id, start, end takes 190 bytes per instance. When I change it to use slots instead of a dictionary for attribute storage (deep Python trickery) class Location(object): __slots__ = ["id", "start", "end"] def __init__(self, id, start, end): self.id, self.start, self.end = id, start, end I use about 48 bytes per object. That'll save about 122MB and take me away from the edge of memory use. I had used that trick on my other data objects - I somehow missed Location. I suspect the other big memory use in in the attribute table for things like ID=80799wgsext-hsp;Name=80799wgsext Each string has 16 bytes of overhead, I think, so 32 bytes for each use of "ID" and "Name". By interning those two frequent strings I can save about 20 bytes per record (70% of flybase records have ID, 55% have Name) or 19MB. > The procedure is quite simple. That's the first step. For sanity checking you should do cycle detection, and likely check that the structure is single-rooted. > During processing you have to have one hashmap. But I don't see how > that adds a whole lot to the memory overhead. It wasn't. It was the per-record overhead. > So basically, I just don't understand what problem you are trying to > solve. The reason for bidirectional links was to allow processing while receiving data rather than waiting until the end. With bi-di you can in principle determine that a feature group is complete when the last feature in the group arrives. > I don't object to adding , and I don't much care > whether there are bi-directional references. Bi-directional references > do not seem necessary to me, and really just seems like a likely place > for the users to make mistakes, but I don't see any reason to change > the > spec now. If it's error prone (I agree that it is) and it's hard to use (which I now believe) and no one will use it for it's intended goal (likely?) and it breaks no code to remove it then I see little reason to keep it. If processing while downloading is desirable than the easiest solution to use is a , but the solution with the least change to the existing spec is a "root=" attribute. If processing while downloading is not sufficiently desirable then there's no need for bi-di links and we can drop the element and have the data structure be closer to GFF3. > (You in fact get > a bit of a boost because since each parent knows how many children it > expects, you can set-up the child List objects with the correct size > from the beginning.) Only if the parents are listed first. Otherwise there's no hint for the correct size. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Mon Sep 18 15:11:10 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 18 Sep 2006 21:11:10 +0200 Subject: [DAS2] feature group assembly; proposals for simplification In-Reply-To: References: Message-ID: <282d4a79f3ffd9343b6538ec2f33b4a0@dalkescientific.com> Ed: > I'm having trouble understanding where all this memory overhead comes > from in your parsing of GFF3 files. I've recently written a GFF3 > parser > for IGB. I've found that the presence or absence of the "end of > feature > set" marker "###" has little effect on the amount of memory required. What size file are you using? dmel-3R-r4.3.gff from flybase is 68,685,595 bytes. Strange though now that I look at it. I shouldn't have a 10x overhead. I'm looking at the memory use now. I estimate my data structures used roughly 340 bytes per feature. Each line averages 80 characters so 4.25x overhead and not 10x. Very strange. I'll need to dig in to that some more. I did find that I wasted a lot of space with small data structures. class Location(object): def __init__(self, id, start, end): self.id, self.start, self.end = id, start, end takes 190 bytes per instance. When I change it to use slots instead of a dictionary for attribute storage (deep Python trickery) class Location(object): __slots__ = ["id", "start", "end"] def __init__(self, id, start, end): self.id, self.start, self.end = id, start, end I use about 48 bytes per object. That'll save about 122MB and take me away from the edge of memory use. I had used that trick on my other data objects - I somehow missed Location. I suspect the other big memory use in in the attribute table for things like ID=80799wgsext-hsp;Name=80799wgsext Each string has 16 bytes of overhead, I think, so 32 bytes for each use of "ID" and "Name". By interning those two frequent strings I can save about 20 bytes per record (70% of flybase records have ID, 55% have Name) or 19MB. > The procedure is quite simple. That's the first step. For sanity checking you should do cycle detection, and likely check that the structure is single-rooted. > During processing you have to have one hashmap. But I don't see how > that adds a whole lot to the memory overhead. It wasn't. It was the per-record overhead. > So basically, I just don't understand what problem you are trying to > solve. The reason for bidirectional links was to allow processing while receiving data rather than waiting until the end. With bi-di you can in principle determine that a feature group is complete when the last feature in the group arrives. > I don't object to adding , and I don't much care > whether there are bi-directional references. Bi-directional references > do not seem necessary to me, and really just seems like a likely place > for the users to make mistakes, but I don't see any reason to change > the > spec now. If it's error prone (I agree that it is) and it's hard to use (which I now believe) and no one will use it for it's intended goal (likely?) and it breaks no code to remove it then I see little reason to keep it. If processing while downloading is desirable than the easiest solution to use is a , but the solution with the least change to the existing spec is a "root=" attribute. If processing while downloading is not sufficiently desirable then there's no need for bi-di links and we can drop the element and have the data structure be closer to GFF3. > (You in fact get > a bit of a boost because since each parent knows how many children it > expects, you can set-up the child List objects with the correct size > from the beginning.) Only if the parents are listed first. Otherwise there's no hint for the correct size. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Mon Sep 18 15:20:51 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 18 Sep 2006 21:20:51 +0200 Subject: [DAS2] feature group assembly; proposals for simplification In-Reply-To: <6dce9a0b0609181023rc9aebe9sa452c4b4f7cd8d7b@mail.gmail.com> References: <6dce9a0b0609181023rc9aebe9sa452c4b4f7cd8d7b@mail.gmail.com> Message-ID: <05439c737629728d1531fd57f5f1b195@dalkescientific.com> Lincoln: > My GFF3 parser works in a similar manner. As each feature comes in, it > is > parsed, turned into an object, and sent to a disk-based database. I was writing a GFF3 to DAS2XML converter. With bi-di links each record needs link data for both directions before writing the record. I could do intermediate saves to the disk, but that's more work than I wanted to do. I can change my converter to use less memory - quite a bit less with a bit more work. I've not optimized much for memory, mostly for speed. Another solution is to get rid of bi-di links and have only parent links. In that case the conversion is trivial, excepting the steps to check for cycles and single-rooted groups. But that's only if people don't sufficiently want the ability to process complete features while other features are being up/downloaded. Andrew dalke at dalkescientific.com From Ed_Erwin at affymetrix.com Mon Sep 18 18:01:19 2006 From: Ed_Erwin at affymetrix.com (Erwin, Ed) Date: Mon, 18 Sep 2006 15:01:19 -0700 Subject: [DAS2] feature group assembly; proposals for simplification Message-ID: I have mostly used smaller examples from NCBI, but I've downloaded that wormbase one to play with as a good test of a big file. I took file "3R.gff" from here ftp://flybase.net/genomes/Drosophila_melanogaster/current/gff/ I need something a little more than 2x the filesize to store that data and to store the graphical objects used to represent it. (I haven't looked at exactly how much is data vs. graphics.) Since IGB keeps everything in memory, we have optimized for memory rather than speed. One of the tricks here is that I don't create a hashmap for the attributes. I simply store the attributes string as a string. I then have to do some regex processing each time I want to extract a property value, but that isn't very often and I intentionally chose memory efficiency over speed. The bigger problem seems to be that every GFF3 file I've seen in the wild has violated the specification. Every file I've tried has failed the validator, and it isn't even a very strict validator. In this case, one of the big things is that almost every feature has "ID=-". If I interpret that literally, then all those lines should be joined into one big feature. (I assume what was intended in this case is that these are features without an ID, so I've added a special case to handle that.) This is getting off topic of DAS/2, but I'm trying to collect a list of questionable things I've seen in GFF3 files and I'll try to get Lincoln to rule on whether they are valid. Ed -----Original Message----- From: das2-bounces at lists.open-bio.org [mailto:das2-bounces at lists.open-bio.org] On Behalf Of Andrew Dalke Sent: Monday, September 18, 2006 12:11 PM To: DAS/2 Subject: Re: [DAS2] feature group assembly; proposals for simplification Ed: > I'm having trouble understanding where all this memory overhead comes > from in your parsing of GFF3 files. I've recently written a GFF3 > parser > for IGB. I've found that the presence or absence of the "end of > feature > set" marker "###" has little effect on the amount of memory required. How big was the data set? dmel-3R-r4.3.gff from flybase is 68,685,595 bytes. Strange though now that I look at it. I shouldn't have a 10x overhead. .... > The procedure is quite simple. That's the first step. For sanity checking you should do cycle detection, and likely check that the structure is single-rooted. .... > (You in fact get > a bit of a boost because since each parent knows how many children it > expects, you can set-up the child List objects with the correct size > from the beginning.) Only if the parents are listed first. Otherwise there's no hint for the correct size. From dalke at dalkescientific.com Mon Sep 18 18:59:51 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Tue, 19 Sep 2006 00:59:51 +0200 Subject: [DAS2] feature group assembly; proposals for simplification In-Reply-To: References: Message-ID: Ed: > I took file "3R.gff" from here > ftp://flybase.net/genomes/Drosophila_melanogaster/current/gff/ "current" is dmel_r4.3_20060303 . I'm also using "4.3" but I have different data. It has 3R sim4 na_transcript_dmel_r31 380 1913 . + . ID=-; where I have 3R sim4:na_transcript_dmel_r31 match 380 1913 . + . ID=:315834 > Since IGB keeps everything in memory, we have optimized for memory > rather than speed. One of the tricks here is that I don't create a > hashmap for the attributes. Hmmm. My parser doesn't handle that, at least not without a bit of monkey patching. Thinking about it some .. that defers errors until latter .. what errors? .. ahh, if a field doesn't have a "=" in it then my code will raise an exception. > I simply store the attributes string as a > string. I then have to do some regex processing each time I want to > extract a property value, but that isn't very often and I intentionally > chose memory efficiency over speed. I didn't think regexps were the right solution for that. Well, not unless you're using them for single character search. For example, URL escaping rules are used for tags or values containing the following characters: ",=;". means that you can't search for "ID=" attributes using the pattern "ID=([^;])+" because "ID" could be written as "%49%44 > The bigger problem seems to be that every GFF3 file I've seen in the > wild has violated the specification. Every file I've tried has failed > the validator, and it isn't even a very strict validator. That's why I suspect GFF3 isn't used as input. Otherwise these would have been noticed and fixed. > In this case, one of the big things is that almost every feature has > "ID=-". If I interpret that literally, then all those lines should be > joined into one big feature. (I assume what was intended in this case > is that these are features without an ID, so I've added a special case > to handle that.) In my version of the data set there can be IDs. What I found from looking at other data sets is the ID can be duplicated but I don't complain until assembling the complex feature and only when there is a "parent" which uses a duplicate id. A small part of my memory overhead (about 70 bytes per record) tracks those duplicates. I had forgotten about this in my previous calculations. > This is getting off topic of DAS/2, but I'm trying to collect a list of > questionable things I've seen in GFF3 files and I'll try to get Lincoln > to rule on whether they are valid. I sent others to him last spring and he replied to me. Here they are in summary. Some were requests for clarification. Q. Can the start and end position be '.' A. Yes, and it's allowed in the spec Q. Can the seqid be "."? A. "This is allowed by the spec, but I hope it would never happen. It means there is a floating feature that has no location. It should probably be forbidden for seqid to be . and start and end to be defined. Shall I modify the GFF3 spec to state so? I see now I didn't respond: "yes" is my answer Q. Can the 9th field be "."? A. This is ok. Q. Are zero length tags allowed? Eg, an attribute field of "=5". [...] I use a dictionary key of "". A. Allowed. Q. Should parsers raise an exception if the two characters after the '%' are not hex characters? A. Yes (Note that my parser currently does not catch that error.) Q. Are duplicate attribute tags allowed, as in Parent=AB123;Parent=XY987 If so, is it equivalent to Parent=AB123,XY987 A. Absolutely! This is allowed and encouraged. Andrew dalke at dalkescientific.com From allenday at ucla.edu Tue Sep 19 13:06:24 2006 From: allenday at ucla.edu (Allen Day) Date: Tue, 19 Sep 2006 10:06:24 -0700 Subject: [DAS2] feature group assembly; proposals for simplification In-Reply-To: <795173f891edcdfead41a676f741d8b0@dalkescientific.com> References: <795173f891edcdfead41a676f741d8b0@dalkescientific.com> Message-ID: <5c24dcc30609191006p44b955d4kea78bd81c42b8c26@mail.gmail.com> I wrote a writeback parser for the current XML style, and although I did not add code to reject multi-rooted groups (which may not be appropriate anyway), didn't find the book-keeping to be particularly onerous. If I understand clearly, the complaint isn't the book-keeping itself, but rather the memory requirements imposed by the book-keeping. Why not just give HTTP 413 (request entity too large) if you don't like the size of the file being uploaded? Gregg and I had a discussion about likely writeback document sizes during the last code sprint, and Genoviz is likely to be giving documents in the 1-50KB range -- nowhere near 80MB of GFF3 worth of features. -Allen On 9/11/06, Andrew Dalke wrote: > > I've been working on a writeback server. It will verify that the > feature groups have no cycles, that if X is a parent to Y then Y > is a part of X, and that all groups has a single root. > > I'm having a hard time with that, and harder than I expected. > > GFF3 had only one direction of relationship. As such it's impossible > to assemble a feature group until the end of the file or the > marker that no lookahead is needed. > > We changed that in DAS2. The feature xml is bidirectional so > in theory it's possible to know when the feature group is complete. > But it's tricky. It's tricky enough that I want to change things > slightly so that people don't need to handle the trickiness. > > The trickiness comes when parsing the list of features into > feature groups. For example, consider > > [F3] > / \ > [F4] [F2] > | | > [F1] [F5] > > where the features are in the order F1, F2, ... F5. After F2 > the system looks like there are two feature groups. > > [F3?] > | > [F4?] [F2] > | | > [F1] [F5?] > > Only after F3 can those be merged together. This requires some > non-trivial bookkeeping, unless I've forgotten something simple > from undergraduate data structures. > > Of course it's simple if you know that a feature is the last feature > in a feature group either through reaching EOF or a special marker. > But then what's the point of having bidirectional links if the > result is no better than GFF3's only-list-parent solution. > > If there is a simple algorithm, please let me know. > > === Solution #1 === > > Another solution is to require that complex feature groups (groups > with more than one feature) must also have a link to the root element > of the feature group. I bought this up before but agreed with others > that that there wasn't a need for it. Now I think there is. > > Here's an example. > > > > > > > By using a 'root' attribute, detecting the end of a feature group is > almost trivial: > > a FeatureGroup contains: > - list of seen urls # duplicates are not allowed > - set of urls which must be seen # duplicates are ignored > > let feature_groups := mapping {root uri -> FeatureGroup } > > for feature in features: > if feature does not have a @root attribute: > make a new FeatureGroup > add the feature to the FeatureGroup as being seen > let feature_groups[feature's @uri attribute] := the new > FeatureGroup > > else: > if the features's @root attribute does not exist in > features_group: > # first time this feature group was seen > create a new FeatureGroup > let feature_groups[feature's @root attribute] := the new > FeatureGroup > > get feature_group[feature's @root attribute] > add this feature to the FeatureGroup as a seen url > for each uri in (feature's @uri attribute, the parent uris, the > part uris): > add the uri to the FeatureGroup's "must be seen" set > if count(seen urls) == count(must be seen urls): > the feature group is complete / assemble the links > > Assembly of a feature group occurs as soon as all the features are > available, > rather than waiting for the end. > > This makes life much simpler for the writeback, and I assume also for > the client code. Assuming the client code doesn't just wait until the > end of the input before it does anything. > > Gregg? Do you wait until the end of the XML to assemble hierarchical > features? > If so, do you need parent/part or will parent suffice? Or do you do > all the > bookkeeping as you get the data? How complex is the code? > > There are other solutions: > > === Solution #2 === > - require that the features are ordered so that parents comes before a > part > > I think this is hard because it relational databases aren't naturally > ordered. > The normal trick is to put an extra field and "sort by", but then the > server > has to maintain the correct ordering. It's fragile. > > === Solution #3 === > > - put all elements for a given feature group explicitly > inside of a element. > > Eg, > > > > > > > > /> > > (in this case simple features not part of a complex parent/part > relationship > need not be in a FEATURE_GROUP.) > > This is the easiest solution. I like it because it's the easiest to > figure > out. Even the algorithm above is hard by comparison. > > If I had my choice we would do this instead of determining the feature > group > by analysis of the parent/part linkages. > > Note that with this change there's no longer need for the PART element. > We would only need PARENT. > > Andrew > dalke at dalkescientific.com > > _______________________________________________ > DAS2 mailing list > DAS2 at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/das2 > From allenday at ucla.edu Thu Sep 21 03:30:52 2006 From: allenday at ucla.edu (Allen Day) Date: Thu, 21 Sep 2006 00:30:52 -0700 Subject: [DAS2] das2 diagrams, questions Message-ID: <5c24dcc30609210030k5324378fy18990dc41a1f1b1e@mail.gmail.com> Hi, I am getting ready to do a server-side rewrite, so I took some time to diagram out where we are from the current spec documents. See the attached file, particularly pages 5-6. I have a few questions, mostly targeted at Andrew, regarding the current HTML version of the spec on the biodas.org site. It hasn't been updated in about 5 months, and looks pretty out of date. * Is the HTML document in sync with the "new_spec.txt" document in CVS? * There is mention of a "fasta" command, and its fragment is linked from the ToC of the genome retrievals document, but it does not appear in the document. Does this command exist? My understanding from conference calls is that the sequence/fasta/segment/dna stuff has all merged into the "segment" response. Is this correct? * The "property" command seems to have disappeared. Is that correct? Are property keys no longer URIs? Also the "prop-*" feature filters could be better described, it is not clear to me if they are meant as some sort of replacement for the property command. This document also contains a few diagrams on pages 1-4 describing how the writeback, block caching/flushing, and dynamic feature generation (a.k.a. "analysis DAS") all fit together. -Allen -------------- next part -------------- A non-text attachment was scrubbed... Name: DAS2_overview.pdf Type: application/pdf Size: 362762 bytes Desc: not available URL: From rowankuiper at hotmail.com Sun Sep 24 04:17:38 2006 From: rowankuiper at hotmail.com (Rowan Kuiper) Date: Sun, 24 Sep 2006 08:17:38 +0000 Subject: [DAS2] current status of DAS Message-ID: I have a few questions. I?m a bioinformatics student and for an internship at the Erasmus University in Holland I have to investigate the current status of DAS. I?ve been trying to work with DAS a couple of weeks now and the impression I get is that it is a bit messy. Perhaps this is because I don?t understand DAS very well and can you explain it to me. - First of all, will DAS2 ever be finished. I saw on the biodas site that the 2 year development started in 2004. But when I looked at sites that should propagate the development, DAS seems to be out of focus. You think DAS is still alive or is there something else that took its place? - Why don?t all servers support all commands. Some reference servers for example don?t support the entry_point command. How do I request features when I don?t know which segments the server contains? I imagine that these great differences in how to use different servers could be very problematic when implementing a viewer. - It seems that the only way to retrieve information from a server is to do a request for a certain region. Is there a way to ask for a specific features. - Is the Sanger Registry Server reliable or is it something of the past? It would be very nice if all available sources where listed there but just a small part of the sources I found where in the list. - When I have to serve data that needs some extension on the XML structure, would it be a problem to just do it. How would clients handle these extensions. Ignore them or somehow parse them? - And last, one of the goals of DAS is to be able to integrate biological data. When I for example want to compare my data to EnsEMBL features I will have to set up my own server that serves features referenced to the same genome as the EnsEMBL features. So I wonder if there exists reference servers that contain the current genomes of EnsEMBL, NCBI or UCSC. I found http://das.ensembl.org/das/ensembl_Homo_sapiens_core_38_36 which always replies an out of memory error even with the entry_points command and http://das.ensembl.org/das/ensembl1834 which seems to work properly but its transcript server ens1834trans also returns out of memory errors. I think that there are some people here that can tell me their view on the subject. Thanks in advance, Rowan Kuiper From ap3 at sanger.ac.uk Mon Sep 25 05:29:24 2006 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Mon, 25 Sep 2006 10:29:24 +0100 Subject: [DAS2] current status of DAS In-Reply-To: References: Message-ID: <1be3231582c70efe24e848c94409b674@sanger.ac.uk> Hi Rowan! > - Is the Sanger Registry Server reliable or is it something of the > past? It > would be very nice if all available sources where listed there but > just a > small part of the sources I found where in the list. I am the administrator of the DAS - registration server. DAS is a collaborative approach to share biological data. It is actively being used by many institutions around the world. The DAS registry was developed in order to make it easier to discover DAS servers. DAS does not force anybody to get their servers registered. That's why some servers might not be listed. Usually, if I learn about a server that is not there yet, I will contact the adminstrator and invite him/her to register. If you know of any servers that are not registered in the DAS registry, please let me know and I will take care of it. Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 From ak at ebi.ac.uk Mon Sep 25 06:07:23 2006 From: ak at ebi.ac.uk (Andreas Kahari) Date: Mon, 25 Sep 2006 11:07:23 +0100 Subject: [DAS2] current status of DAS In-Reply-To: References: Message-ID: <20060925100723.GE31706@ebi.ac.uk> On Sun, Sep 24, 2006 at 08:17:38AM +0000, Rowan Kuiper wrote: > I have a few questions. I?m a bioinformatics student and for an internship > at the Erasmus University in Holland I have to investigate the current > status of DAS. I?ve been trying to work with DAS a couple of weeks now and > the impression I get is that it is a bit messy. Perhaps this is because I > don?t understand DAS very well and can you explain it to me. DAS is a specification of a communication protocol originally intended to provide a web service for serving GFF-like annotation data. > - First of all, will DAS2 ever be finished. I saw on the biodas site that > the 2 year development started in 2004. But when I looked at sites that > should propagate the development, DAS seems to be out of focus. You think > DAS is still alive or is there something else that took its place? The stagnation of the DAS/2 developments that you refer to is outside of what I know very much about (the frequent telephone conference mailings on this list suggests it's not stagnated at all). I work full time with a large number of research groups who are using DAS/1 as a tool for data integration in various ways. So in Europe, at least, DAS/1 is very much alive. Also, within the Ensembl Genome Browser (www.ensembl.org), more things are done through DAS than what you might think. > - Why don?t all servers support all commands. Some reference servers for > example don?t support the entry_point command. How do I request features > when I don?t know which segments the server contains? I imagine that these > great differences in how to use different servers could be very problematic > when implementing a viewer. Lazy maintainers, possibly? Could you please provide us with concrete examples of these reference servers? If any of them are within my control, this would give me a chance to fix them. > - It seems that the only way to retrieve information from a server is to do > a request for a certain region. Is there a way to ask for a specific > features. This is an artefact of the way genomic annotation viewers work. They provide the user with a view of a genomic region at a time. According to the specification (DAS/1), the 'features' request may be tailored to only return certain feature IDs on a given segment using the 'feature_id=ID' argument. Whether this capability is implemented by a particular server or not should be evident from the HTTP headers sent back from the server. Again, since people are lazy (me too), and since clients never, as far as I am aware of, make use of this capability, it is seldom implemented. > - Is the Sanger Registry Server reliable or is it something of the past? It > would be very nice if all available sources where listed there but just a > small part of the sources I found where in the list. I'll leave this one for Andreas Prlic. > - When I have to serve data that needs some extension on the XML structure, > would it be a problem to just do it. How would clients handle these > extensions. Ignore them or somehow parse them? You're free to add whatever XML you feel a need to add. A well behaved DAS client will ignore it. If the response still contains the necessary bits and bobs, then it is in my opinion still DAS, otherwise you've broken the protocol and the response will be unusable by any existing client. There is no magic in clients that will tell them to look for XML structures that are not specified as being part of the DAS response. > - And last, one of the goals of DAS is to be able to integrate biological > data. When I for example want to compare my data to EnsEMBL features I will > have to set up my own server that serves features referenced to the same > genome as the EnsEMBL features. So I wonder if there exists reference > servers that contain the current genomes of EnsEMBL, NCBI or UCSC. I found > http://das.ensembl.org/das/ensembl_Homo_sapiens_core_38_36 which always > replies an out of memory error even with the entry_points command and > http://das.ensembl.org/das/ensembl1834 which seems to work properly but its > transcript server ens1834trans also returns out of memory errors. If you wish to do numerical (not visual) comparisons of data against Ensembl, I believe this would be easier with the help of the Ensembl Perl API. Ensembl nowadays serve reference sources from the www.ensembl.org/das server. See Eugene's reply for examples. > I think that there are some people here that can tell me their view on the > subject. > Thanks in advance, > Rowan Kuiper Regards, Andreas -- Andreas K?h?ri Ensembl Software Developer European Bioinformatics Institute (EMBL-EBI) From ak at ebi.ac.uk Mon Sep 25 08:35:04 2006 From: ak at ebi.ac.uk (Andreas Kahari) Date: Mon, 25 Sep 2006 13:35:04 +0100 Subject: [DAS2] current status of DAS Message-ID: <20060925123504.GG31706@ebi.ac.uk> Sent to list on behalf of Eugene Kulesha (non-subscriber). - Andreas K. ----- Forwarded message from Eugene Kulesha ----- Subject: Re: [DAS2] current status of DAS Date: Mon, 25 Sep 2006 10:46:48 +0100 From: Eugene Kulesha To: Rowan Kuiper CC: das2 at lists.open-bio.org References: >- First of all, will DAS2 ever be finished. good question :) although it stopped worrying me a long time ago ;) >DAS is still alive or is there something else that took its place? it certainly is in Ensembl >- Why don?t all servers support all commands. Some reference servers for >example don?t support the entry_point command. How do I request features >when I don?t know which segments the server contains? I imagine that these >great differences in how to use different servers could be very problematic >when implementing a viewer. i guess it was done in part so das could be adopted quicker, but I have to admit that I was very much frustrated by the fact that very few sources implement 'entry_points' command >- It seems that the only way to retrieve information from a server is to do >a request for a certain region. Is there a way to ask for a specific >features. yes, features?feature_id=XXXX would give you the feature ( if feature_id is implemented ) http://www.ensembl.org/das/Homo_sapiens.NCBI36.transcripts/features?feature_id=ENSE00001253754 >- When I have to serve data that needs some extension on the XML structure, >would it be a problem to just do it. How would clients handle these >extensions. Ignore them or somehow parse them? I'm not quite sure what Bio::DasLite ( this is what we use to parse DAS responses) would do .. but even if it parses the extension Ok, Ensembl will ignore the unknown properties .. >- And last, one of the goals of DAS is to be able to integrate biological >data. When I for example want to compare my data to EnsEMBL features I will >have to set up my own server that serves features referenced to the same >genome as the EnsEMBL features. So I wonder if there exists reference >servers that contain the current genomes of EnsEMBL, NCBI or UCSC. I >found http://das.ensembl.org/das/ensembl_Homo_sapiens_core_38_36 which >always replies an out of memory error even with the entry_points command >and http://das.ensembl.org/das/ensembl1834 which seems to work properly but >its transcript server ens1834trans also returns out of memory errors. http://www.ensembl.org/das/dsn have the list of all the sources that we serve from internal ensembl data amongst them there are reference sources, e.g http://www.ensembl.org/das/Homo_sapiens.NCBI36.reference http://www.ensembl.org/das/Mus_musculus.NCBIM36.reference Cheers Eugene Kulesha ----- End forwarded message ----- -- Andreas K?h?ri Ensembl Software Developer European Bioinformatics Institute (EMBL-EBI) From Steve_Chervitz at affymetrix.com Mon Sep 25 13:39:41 2006 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Mon, 25 Sep 2006 10:39:41 -0700 Subject: [DAS2] Notes from the weekly DAS/2 teleconference, 25 Sep 2006 Message-ID: Notes from the weekly DAS/2 teleconference, 25 Sep 2006 $Id: das2-teleconf-2006-09-25.txt,v 1.2 2006/09/25 17:38:57 sac Exp $ Note taker: Steve Chervitz Attendees: Affy: Steve Chervitz, Ed Erwin, Gregg Helt UCLA: Allen Day Action items are flagged with '[A]'. These notes are checked into the biodas.org CVS repository at das/das2/notes/2006. Instructions on how to access this repository are at http://biodas.org DISCLAIMER: The note taker aims for completeness and accuracy, but these goals are not always achievable, given the desire to get the notes out with a rapid turnaround. So don't consider these notes as complete minutes from the meeting, but rather abbreviated, summarized versions of what was discussed. There may be errors of commission and omission. Participants are welcome to post comments and/or corrections to these as they see fit. Agenda ------- * Spec issues * Grant status * Status reports Topic: Spec issues ------------------ aday: what is the status of the currently posted spec w/r/t fasta format, seq segments? The fasta description in the spec seems not up to date. gh: as I recall from the last code sprint (aug 14-18 2006), we had decided to return a das2 segments document in which you could specify fasta as an available format for receiving seq data. sc: that's my recollection too. [A] andrew will work more on keeping the online spec up to date. Topic: Grant status ------------------- gh: We have received official word of approval of $250K for extending funding from now thru May 2007. Allen and Ed will be at same amt, steve down a bit - based on current billing, gregg up 40-50%. This will allow me to put more focus on grant. Funding will also be put towards equipment improvements for affy das/2 server on our colo. andrew will start a full time job in 2007, will ramp up till end of year. hoping he can get a lot of the spec issues put to rest before he goes. probably it will be me (gregg) taking up the spec when he leaves, and hoping I won't have much to do on the spec docs. Topic: status reports --------------------- gh: have worked on the das/2 budget last 2 weeks. now should be able to get back to coding. has allen and brian received their reimbursements from code sprint? aday: brian got his, not me yet. gh: should get yours soon. ee: putting out a new release of igb this week. minor release. sc: helped straighted out file/dir permissions at biodas.org, lincoln was posting an update the to Bio::Das section on the biodas.org ftp site. gh: his new das/2 client in perl? he's been working on that. sc: not sure, possibly. sc: also talked with gregg regarding my time commitment for the das/2 extension period. will be able to devote a solid block of time (~4wks) sometime in Dec or Jan. aday: diagraming to get the current state of the spec. getting ready to do major server side rewrite, implemented block caching strategy, to allow same data source to do reads and writes. going with custom caching rather than apache mod proxy, gives us more control of operations. Performance improvements I did on the chado db can then be removed since everything will be cached. working on uml diags. gh: are you doing from scratch caching? aday: we have a mvc app. model layer talks to db, inherits from abstract db. that will stay the same. handles conversion of query string into sql. maybe trim it down and simplify based on spec cruft losses. for view and controller components, we use templates to generate xml. that will stay same, will use the catalyst web frame work, much like ruby on rails, executable scripts that generate code for v and c layers. will replace the current hand code with the catalyst generated stuff. sc: so this is like Ruby on Rails for perl? aday: yes aday: question on hw budget on this current round of funding? gh: yes. we originally discussed more hw for ucla or cshl. now looking more doubtful. would like to address towards end of year or jan. based on prev estimates, we never spend as much as budgetted for. if more left over, we can look into putting more hw. the affy hw is a sure thing. is need critical? aday: there is pressure on our hw to do upgrades, used by rest of lab as well now. maybe $5K would do it. dual or quad 4. gh: do able sooner rather than later. send some figures... [A] Allen will send gregg estimates on needed das-related hw upgrades at ucla. From allenday at ucla.edu Fri Sep 1 02:11:59 2006 From: allenday at ucla.edu (Allen Day) Date: Thu, 31 Aug 2006 19:11:59 -0700 Subject: [DAS2] dynamic das2 features Message-ID: <5c24dcc30608311911q38ac2520k24c166bb33c29e75@mail.gmail.com> I have a prototype that will generate primer3 primers. Temporarily up here: http://jugular.ctrl.ucla.edu:3000/feature?type=primer3;seq=ATATCCTAAAAGCATAACTGATGCATCTTTAATCTTGTATGTGACACTACTCATACGAAGGGACTATATCTAGTCAAGACGATACTGTGATAGGTACGTTATTTAATAGGATCTATAACGAAATGTCAAATAATTTTACGGTAATATAACTTATCAGCGGCGTATACTAAAACGGACGTTACGATATTGTCTCACTTCATCTTACCACCCTCTATCTTATTGCTGATAGAACACTAACCCCTCAGCTTTATTTCTAGTTACAGTTACACAAAAAACTATGCCAACCCAGAAATCTTGATATTTTACGTGTCAAAAAATGAGGGTCTCTAAATGAGAGTTTGGTACCATGACTTGTAACTCGCACTGCCCTGATCTGCAATCTTGTTCTTAGAAGTGACGCATATTCTATACGGCCCGACGCGACGCGCCAAAAAATGAAAAACGAAGCAGCGACTCATTTTTATTTAAGGACAAAGGTTGCGAAGCCGCACATTTCCAATTTCATTGTTGTTTATTGGACATACACTGTTAGCTTTATTACCGTCCACGTTTTTTCTACAATAGTGTAGAAGTTTCTTTCTTATGTTCATCGTATTCATAAAATGCTTCACGAACACCGTCATTGATCAAATAGGTCTATAATATTAATATACATTTATATAATCTACGGTATTTATATCATCAAAAAAAAGTAGTTTTTTTATTTTATTTTGTTCGTTAATTTTCAATTTCTATGGAAACCCGTTCGTAAAATTGGCGTTTGTCTCTAGTTTGCGATAGTGTAGATACCGTCCTTGGATAGAGCACTGGAGATGGCTGGCTTTAATCTGCTGGAGTACCATGGAACACCGGTGATCATTCTGGTCACTTGGTCTGGAGCAATACCGGTCAACATGGTGGTGAAGTCACCGTAGTTGAAAACGGCTTCAGCAACTTCGACTGGGTAGGTTTCAGTTGGGTGGGCGGCTTGGAACATGTAGTATTGGGCTAAGTGAGCTCTGATATCAGAGACGTAGACACCCAATTCCACCAAGTTGACTCTTTCGTCAGATTGAGCTAGAGTGGTGGTTGCAGAAGCAGTAGCAGCGATGGCAGCGACACCAGCGGCGATTGAAGTTAATTTGACCATTGTATTTGTTTTGTTTGTTAGTGCTGATATAAGCTTAACAGGAAAGGAAAGAATAAAGACATATTCTCAAAGGCATATAGTTGAAGCAGCTCTATTTATACCCATTCCCTCATGGGTTGTTGCTATTTAAACGATCGCTGACTGGCACCAGTTCCTCATCAAATATTCTCTATATCTCATCTTTCACACAATCTCATTATCTCTATGGAGATGCTCTTGTTTCTGAACGAATCATAAATCTTTCATAGGTTTCGTATGTGGAGTACTGTTTTATGGCGCTTATGTGTATTCGTATGCGCAGAATGTGGGAATGCCAATTATAGGGGTGCCGAGGTGCCTTATAAAACCCTTTTCTGTGCCTGTGACATTTCCTTTTTCGGTCAAAAAGAATATCCGAATTTTAGATTTGGACCCTCGTACAGAAGCTTATTGTCTAAGCCTGAATTCAGTCTGCTTTAAACGGCTTCCGCGGAGGAAATATTTCCATCTCTTGAATTCGTACAACATTAAACGTGTGTTGGGAGTCGTATACTGTTAGGGTCTGTAAACTTGTGAACTCTCGGCAAATGCCTTGGTGCAATTACGTAATTTTAGCCGCTGAGAAGCGGATGGTAATGAGACAAGTTGATATCAAACAGATACATATTTAAAAGAGGGTACCGCTAATTTAGCAGGGCAGTATTATTGTAGTTTGATATGTACGGCTAACTGAACCTAAGTAGGGATATGAGAGTAAGAACGTTCGGCTACTCTTCTTTCTAAGTGGGATTTTTCTTAATCCTTGGATTCTTAAAAGGTTATTAAAGTTCCGCACAAAGAACGCTTGGAAATCGCATTCATCAAAGAACAACTCTTCGTTTTCCAAACAATCTTCCCGAAAAAGTAGCCGTTCATTTCCCTTCCGATTTCATTCCTAGACTGCCAAATTTTTCTTGCTCATTTATAATGATTGATAAGAATTGTATTTGTGTCCCATTCTCGTAGATAAAATTCTTGGATGTTAAAAAATTATTATTTTCTTCATAAAGAAG I have all my ducks in a row to implement primer3 (done), blat, blastn, tblastn, tblastx, genscan, and rePCR. I will do some reworking of the GET params to allow specification of parameters (e.g. required primer size range) using the property filter syntax. This server will require both a type= and overlaps= filter for all requests so that it can do a backend GET on the sequence from the main das server. Gregg, please take a look and let me know if this is roughly suitable for Genoviz. -Allen From aloraine at gmail.com Fri Sep 1 18:15:51 2006 From: aloraine at gmail.com (Ann Loraine) Date: Fri, 1 Sep 2006 13:15:51 -0500 Subject: [DAS2] dynamic das2 features In-Reply-To: <5c24dcc30608311911q38ac2520k24c166bb33c29e75@mail.gmail.com> References: <5c24dcc30608311911q38ac2520k24c166bb33c29e75@mail.gmail.com> Message-ID: <83722dde0609011115n20685623rcb3d971addfa4f67@mail.gmail.com> Hi Allen, This is great. Can I make a request? I'd vote for blastx, blastn, and blat to be top priority. (tblastx - not so much...) Best, Ann On 8/31/06, Allen Day wrote: > I have a prototype that will generate primer3 primers. Temporarily up here: > > http://jugular.ctrl.ucla.edu:3000/feature?type=primer3;seq=ATATCCTAAAAGCATAACTGATGCATCTTTAATCTTGTATGTGACACTACTCATACGAAGGGACTATATCTAGTCAAGACGATACTGTGATAGGTACGTTATTTAATAGGATCTATAACGAAATGTCAAATAATTTTACGGTAATATAACTTATCAGCGGCGTATACTAAAACGGACGTTACGATATTGTCTCACTTCATCTTACCACCCTCTATCTTATTGCTGATAGAACACTAACCCCTCAGCTTTATTTCTAGTTACAGTTACACAAAAAACTATGCCAACCCAGAAATCTTGATATTTTACGTGTCAAAAAATGAGGGTCTCTAAATGAGAGTTTGGTACCATGACTTGTAACTCGCACTGCCCTGATCTGCAATCTTGTTCTTAGAAGTGACGCATATTCTATACGGCCCGACGCGACGCGCCAAAAAATGAAAAACGAAGCAGCGACTCATTTTTATTTAAGGACAAAGGTTGCGAAGCCGCACATTTCCAATTTCATTGTTGTTTATTGGACATACACTGTTAGCTTTATTACCGTCCACGTTTTTTCTACAATAGTGTAGAAGTTTCTTTCTTATGTTCATCGTATTCATAAAATGCTTCACGAACACCGTCATTGATCAAATAGGTCTATAATATTAATATACATTTATATAATCTACGGTATTTATATCATCAAAAAAAAGTAGTTTTTTTATTTTATTTTGTTCGTTAATTTTCAATTTCTATGGAAACCCGTTCGTAAAATTGGCGTTTGTCTCTAGTTTGCGATAGTGTAGATACCGTCCTTGGATAGAGCACTGGAGATGGCTGGCTTTAATCTGCTGGAGTACCATGGAACACCGGTGATCATTCTGGTCACTTGGTCTGGAGCAATACCGGTCAACATGGTGGTGAAGTCACCGTAGTTGAAAACGGCTTCAG! > CAACTTCGACTGGGTAGGTTTCAGTTGGGTGGGCGGCTTGGAACATGTAGTATTGGGCTAAGTGAGCTCTGATATCAGAGACGTAGACACCCAATTCCACCAAGTTGACTCTTTCGTCAGATTGAGCTAGAGTGGTGGTTGCAGAAGCAGTAGCAGCGATGGCAGCGACACCAGCGGCGATTGAAGTTAATTTGACCATTGTATTTGTTTTGTTTGTTAGTGCTGATATAAGCTTAACAGGAAAGGAAAGAATAAAGACATATTCTCAAAGGCATATAGTTGAAGCAGCTCTATTTATACCCATTCCCTCATGGGTTGTTGCTATTTAAACGATCGCTGACTGGCACCAGTTCCTCATCAAATATTCTCTATATCTCATCTTTCACACAATCTCATTATCTCTATGGAGATGCTCTTGTTTCTGAACGAATCATAAATCTTTCATAGGTTTCGTATGTGGAGTACTGTTTTATGGCGCTTATGTGTATTCGTATGCGCAGAATGTGGGAATGCCAATTATAGGGGTGCCGAGGTGCCTTATAAAACCCTTTTCTGTGCCTGTGACATTTCCTTTTTCGGTCAAAAAGAATATCCGAATTTTAGATTTGGACCCTCGTACAGAAGCTTATTGTCTAAGCCTGAATTCAGTCTGCTTTAAACGGCTTCCGCGGAGGAAATATTTCCATCTCTTGAATTCGTACAACATTAAACGTGTGTTGGGAGTCGTATACTGTTAGGGTCTGTAAACTTGTGAACTCTCGGCAAATGCCTTGGTGCAATTACGTAATTTTAGCCGCTGAGAAGCGGATGGTAATGAGACAAGTTGATATCAAACAGATACATATTTAAAAGAGGGTACCGCTAATTTAGCAGGGCAGTATTATTGTAGTTTGATATGTACGGCTAACTGAACCTAAGTAGGGATATGAGAGTAAGAACGTTCGGCTACTCTTCTTTCTAAGTGGGATTTTTCTTAATCCTTGGATTC! > TTAAAAGGTTATTAAAGTTCCGCACAAAGAACGCTTGGAAATCGCATTCATCAAAGAACAACTCTTCGTT > TTCCAAACAATCTTCCCGAAAAAGTAGCCGTTCATTTCCCTTCCGATTTCATTCCTAGACTGCCAAATTTTTCTTGCTCATTTATAATGATTGATAAGAATTGTATTTGTGTCCCATTCTCGTAGATAAAATTCTTGGATGTTAAAAAATTATTATTTTCTTCATAAAGAAG > > I have all my ducks in a row to implement primer3 (done), blat, blastn, > tblastn, tblastx, genscan, and rePCR. I will do some reworking of the GET > params to allow specification of parameters (e.g. required primer size > range) using the property filter syntax. This server will require both a > type= and overlaps= filter for all requests so that it can do a backend GET > on the sequence from the main das server. > > Gregg, please take a look and let me know if this is roughly suitable for > Genoviz. > > -Allen > > _______________________________________________ > DAS2 mailing list > DAS2 at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/das2 > -- Ann Loraine Assistant Professor Section on Statistical Genetics University of Alabama at Birmingham http://www.ssg.uab.edu http://www.transvar.org From dalke at dalkescientific.com Mon Sep 11 12:44:07 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 11 Sep 2006 14:44:07 +0200 Subject: [DAS2] stylesheets today Message-ID: <37174817d074b52ff679706f10ba2cbd@dalkescientific.com> We're schedule to talk about stylesheets today, right? Andrew dalke at dalkescientific.com From Gregg_Helt at affymetrix.com Mon Sep 11 13:35:12 2006 From: Gregg_Helt at affymetrix.com (Helt,Gregg) Date: Mon, 11 Sep 2006 06:35:12 -0700 Subject: [DAS2] stylesheets today Message-ID: We're definitely having a DAS/2 teleconference at the usual time, 9:30 AM PST. I can't remember if stylesheets were already on the agenda for today, but if that's what's on your mind, sounds like a good topic to start with. Gregg > -----Original Message----- > From: das2-bounces at lists.open-bio.org [mailto:das2-bounces at lists.open- > bio.org] On Behalf Of Andrew Dalke > Sent: Monday, September 11, 2006 5:44 AM > To: DAS/2 > Subject: [DAS2] stylesheets today > > We're schedule to talk about stylesheets today, right? > > Andrew > dalke at dalkescientific.com > > _______________________________________________ > DAS2 mailing list > DAS2 at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/das2 From dalke at dalkescientific.com Mon Sep 11 14:30:20 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 11 Sep 2006 16:30:20 +0200 Subject: [DAS2] stylesheets today In-Reply-To: References: Message-ID: <4e7fc9cae0d8b1a83982023f600ea199@dalkescientific.com> Gregg: > I can't remember if stylesheets were already on the agenda for > today, but if that's what's on your mind, sounds like a good topic to > start with. I actually don't want to talk about it. I don't know enough about the topic. I have ideas and comments. It was more I remembered at the wrap-up for the sprint where someone (you?) proposed that the next meeting be on stylesheets. If so, I want to be prepared to talk about it. Ahh, I misremembered. From Steve's notes gh: Something we need todo: come back to stylesheet issue. ad: we should have impl in place before making spec work. [A] Discuss stylesheets when we have an impl in place In other action items, has anyone developed some use guidelines for generating DAS2 data? Eg, recommended ways to do alignments, BLAST hits, complex parent/part relationships? Also, guidelines in how to convert GFF3 into DAS2 feature xml? I don't know where all of the fields are supposed to go in the free-form area. Examples from flybase ID=FBti0020396;Name=Rt1c{}1472;Dbxref=FlyBase+Annotation+IDs: TE20396,FlyBase:FBt i0020396;cyto_range=102A1-102A1;gbunit=AE003845;synonym=TE20396; synonym_2nd=Rt1c {}1472 ID=FBgn0004859;Name=ci;Dbxref=FlyBase+Annotation+IDs:CG2125,FlyBase: FBan0002125,FlyBase:FBgn0004859;cyto_range=102A1-102A3; dbxref_2nd=FlyBase:FBgn0000314,FlyBase:FBgn0000315,FlyBase: FBgn0010154,FlyBase:FBgn0010155,FlyBase:FBgn0017411,FlyBase: FBgn0019831;gbunit=AE003845;synonym_2nd=Ce,Ci,CI,ci155,ciD,ci- D,CiD,CID,ciD,CiD,Cubitus+interruptus,cubitus- interruptus-Dominant,l(4)102ABc,l(4)13,l(4)17 What should I do with those? Dump them into the key/value property field? Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Mon Sep 11 17:52:35 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 11 Sep 2006 19:52:35 +0200 Subject: [DAS2] best practices / DAS2 format examples Message-ID: das2-teleconf-2006-03-16.txt > [A] Lincoln will provide use cases/examples of these features > scenarios: > - three or greater hierarchy features > - multiple parents > - alignments I really would like some real-world examples of these. I don't know enough to make decent examples for the documentation and I think it would be very useful so others can see how to model existing data in DAS2 XML. I looked at GFF3 examples to find existing properties which must be storable in a DAS2 feature document. Here are two example lines ID=FBti0020396;Name=Rt1c{}1472;Dbxref=FlyBase+Annotation+IDs: TE20396,FlyBase:FBt i0020396;cyto_range=102A1-102A1;gbunit=AE003845;synonym=TE20396; synonym_2nd=Rt1c {}1472 ID=FBgn0004859;Name=ci;Dbxref=FlyBase+Annotation+IDs:CG2125,FlyBase: FBan0002125,FlyBase:FBgn0004859;cyto_range=102A1-102A3; dbxref_2nd=FlyBase:FBgn0000314,FlyBase:FBgn0000315,FlyBase: FBgn0010154,FlyBase:FBgn0010155,FlyBase:FBgn0017411,FlyBase: FBgn0019831;gbunit=AE003845;synonym_2nd=Ce,Ci,CI,ci155,ciD,ci- D,CiD,CID,ciD,CiD,Cubitus+interruptus,cubitus- interruptus-Dominant,l(4)102ABc,l(4)13,l(4)17 I do not know this domain well enough. I do not how "cyto_range" should be stored in DAS2 XML nor gbunit. I don't know the difference between dbxref and dbxref_2nd. Nor can I find documentation on these properties. Looking around I came across names cyto_range Dbxref dbxref_2nd Name Parent species gbunit Alias but I don't know how those are best modeled in GFF3. For example, is species redundant given that we know that from the reference sequence? I want someone to be able to go to DAS and easily figure out how to convert existing data models into DAS's model. Here is an example of a real-world GFF3 complex annotation, which we're calling a "feature group" in DAS2. The top-level is a gene. It has one child which is an mRNA. The mRNA has children of CDS, exon, protein, and intron. I've added newlines for readability. 4 . gene 22335 23205 . - . ID=FBgn0052013; Name=CG32013;Dbxref=FlyBase+Annotation+IDs:CG32013,FlyBase: FBan0032013,FlyBase: FBgn0052013;cyto_range=101F1-101F1;gbunit=AE003845 4 . mRNA 22335 23205 . - . ID=FBtr0089183; Name=CG32013-RA;Parent=FBgn0052013;Dbxref=FlyBase+Annotation+IDs: CG32013-RA, FlyBase:FBtr0089183;cyto_range=101F1-101F1 4 . CDS 22335 22528 . - . Parent=FBtr0089183; Name=CG32013-cds;Dbxref=FlyBase+Annotation+IDs:CG32013-RA 4 . exon 22335 22528 . - . Parent=FBtr0089183 4 . protein 22338 23205 . - . ID=FBpp0088247; Name=CG32013-PA;Parent=FBtr0089183;Dbxref=FlyBase+Annotation+IDs: CG32013-PA, FlyBase:FBpp0088247,GB_protein:AAN06536.1,FlyBase+Annotation+IDs: CG32013-RA 4 . intron 22529 22616 . - . Parent=FBtr0089183; Name=CG32013-in 4 . CDS 22617 23205 . - . Parent=FBtr0089183; Name=CG32013-cds;Dbxref=FlyBase+Annotation+IDs:CG32013-RA 4 . exon 22617 23205 . - . Parent=FBtr0089183 The direct conversion to DAS2 xml the way I've been doing it is first defining a TYPES document like this (the das-private: identifiers are created upon server upload). Note that I'm storing the GFF3 fields in a PROP element so I can easily figure out which DAS2 types correspond to the GFF3 types (unique gff3 types is the pair (type, source) ) Given the types, the features document looks like. Note the change in start position because GFF3 is a "start with 1" numbering system while DAS2 is a "start with 0". Note also that I've used the Name property from GFF3 to populate the title field in DAS2. While I have ideas on what to do with the rest (eg, populate the dbxref DAS2 element), I don't know what to do with all of the fields and would like advice. Andrew dalke at dalkescientific.com From Steve_Chervitz at affymetrix.com Mon Sep 11 18:11:04 2006 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Mon, 11 Sep 2006 11:11:04 -0700 Subject: [DAS2] Notes from the weekly DAS/2 teleconference, 11 Sep 2006 Message-ID: Notes from the weekly DAS/2 teleconference, 11 Sep 2006 $Id: das2-teleconf-2006-09-11.txt,v 1.1 2006/09/11 18:10:11 sac Exp $ Note taker: Steve Chervitz Attendees: Affy: Steve Chervitz, Ed Erwin, Gregg Helt Dalke Scientific: Andrew Dalke UCLA: Allen Day, Brian O'Connor (sc, aday, bo calling in from Seattle at MGED9 jamboree) Action items are flagged with '[A]'. These notes are checked into the biodas.org CVS repository at das/das2/notes/2006. Instructions on how to access this repository are at http://biodas.org DISCLAIMER: The note taker aims for completeness and accuracy, but these goals are not always achievable, given the desire to get the notes out with a rapid turnaround. So don't consider these notes as complete minutes from the meeting, but rather abbreviated, summarized versions of what was discussed. There may be errors of commission and omission. Participants are welcome to post comments and/or corrections to these as they see fit. Agenda: -------- * grant update * status reports Topic: Grant update ------------------- gh: p good says funding outlook for getting funding for sep '06 to may '07. $250K. not completely official, but more so. no grant to be submitted in october. still major issues to resolve: rewriting, pi decision. size was a concern. decision about what to drop (6 sections). ad: new project starting dec/jan for 1 year. Can't work on das/2 past end of this year. product for chemical informatics. gh: can you put more time before then. full time? 2-3 mos. ad: need to look at my schedule. will get back to you [A] andrew talk with gregg re: increasing his das/2 time committment Topic: Status reports (and general discussion) --------------------- gh: client to do curation in igb, write back to test server. impl thing I drew on board back at last code sprint. editing curations. making sure undo/redo capabilities in igb works. will translate into what writeback needs are. turned off in igb by default. prefs -> turn on exptl curations. can edit things, but can't connect to server. must modify code, but don't ee: gff3 parser. trouble: gff3 files in wild don't follow spec. refseq website, repository, all three fails in different ways. ucsc mailing list helped, but it wasn't their files. aday: failed on validator? ee: yes gh: the only request we had ee: not trying to write a full gff3 parser. just need gene, exon, cds, mRNA. ignore other lines and it seems compliant. but a second problem: very flexible exon parent can be mRNA, gene, or nothing. jibes with igb data model. also worked on: released new igb version. graph support handing, parsing affy files. ad: flybase files are gff3 compliant, parent/part relationship requires full file parsing. 800mb file. had to insert marker mid-file to inform parser. ee: space reduction during parsing. they have a recommended canonical rep of gene, but not required to do it. haven't found an example that follows the rec. gh: the wormbase stuff should be canonical, since lincoln did gff3 and wormbase. ad: more people writing gff3 than reading ee: ucsc discussion: grant to support more mod orgs, to include gff3 parser support. gh: that's the kind of grant we'd like to fold das grant work into if we don't do a separate das/2 grant [A] gregg look into ucsc grant, possibly fold das stuff into it ad: gff3 -> das2xml converter. some things in gff3 i don't know how to handle. key-value. Need to figure out why things aren't passing validator. [A] andrew will write up questions, post to list, discuss there and/or with lincoln at the next das/2 teleconf. ad: modeling alignments. need a recommended way to model alignments. gh: when to use locations vs subfeatures. aday: why care about gff3? ee: igb ad: people need to convert data for das2xml. aday: need a model mapping doc. we can hash it out next week with lincoln. ad: working with berkeley xml database. liking it alot. gh: also cool: SOLR - java thing built on top of lucene and xml db stuff. cool thing is that it layers on top of that a rest-ful approach to retrieving and writing data to a db. thru http urls . queries are gets all writes/updates/delete are posts. ad: xQuery aday: generalization of xpath ad: xslt is another generalization. sc: there was a poster at MGED9 meeting from stanford group using Berkeley XML db to map between 'flavors' of MAGE-ML, since organizations use different ways to represent the same thing in MAGE-ML. Represented the transformation using pairs of xQueries, one targetting for format A, other for format B. All the smarts about the format was confined to the xqueries. nice. ad: I want to get feedback regarding modeling for das2, recommendation to store certain data (alignments, gff3). gh: gff3 - too open ended. lots of stuff can be in there ad: given flybase, what is the recommended way to post gff3 data. gh: i can answer your alignments issue, can't do gff3. [A] andrew will contact folks as needed regarding gff3/flybase modeling issues: suzi, chris mungall, lincoln, scott cain Other status: ------------- sc: no major progress given Netaffx update work, MGED travel. Plan is to update das/2 server code on affy server, load it with some exon array design data using gregg's new parser which is more memory efficient, and test it out. Then we'll need to migrate it off the das/1 server where the exon data hogs lots of memory, and then migrate Netaffx links to use das/2. gh: new box end of october with das grant money. have run das2 server on 64bit. on 32bit have gotten 8g in single java process. riva. should be able to get 16g in one process. or have 2x8g bo: allen updated assay portion, bringing igb ibjects upto date. mark carlson is updating hyrax client to retrieve microarry data back. he's taking das/2 client makeing it embedable. eg., into the MeV tool from John Quackenbush at Harvard (java). should be embedable in igb to browse celsius to d/l data. plan to have webstart for it. aday: updating assay portion of server. mage-ml to be inline with changes. adding/modifying element attribs, lowercase 'uri'. data loaders to get ncbi data into server for micoarray expts. client lib in R for talking to das server. requires parsing xml. extremely slow, uses lots of memory, so eg., viz bed files in R, genomic location. good plotting support in R. look at distribution. regarding writeback server: on hold until you report any problems. basic stuff is working. let me know. gh: read part: caching improvements? aday: no more work on that since jamboree. public server doesn't have these improvements. plan to rewrite controller and view part. junk on this end. want to integrate block mechanism into that as well. not sure when it will happen. time estimate: maybe 1-1.5 months with bo and i working half time. bo: thie rewrite will help a lot. aday: lots of little things changed, 'segment' etc. server domain source, capabilities, formats. huge mess. need more looking before i can get an accurate time estimate for patching vs. rewriting. think the rewrite wouldn't be that expensive. gh: machine? aday: dual core opteron, maybe 16g ram? load is increasing, may move off to a dedicated server. webserver is the issue, not db. Next teleconf: -------------- In two weeks. 25 Sep 2006 Special dedication: ------------------- To those who tragically lost their lives on this day five years ago... From Gregg_Helt at affymetrix.com Mon Sep 11 20:07:05 2006 From: Gregg_Helt at affymetrix.com (Helt,Gregg) Date: Mon, 11 Sep 2006 13:07:05 -0700 Subject: [DAS2] best practices / DAS2 format examples Message-ID: > -----Original Message----- > From: das2-bounces at lists.open-bio.org [mailto:das2-bounces at lists.open- > bio.org] On Behalf Of Andrew Dalke > Sent: Monday, September 11, 2006 10:53 AM > To: DAS/2 > Subject: [DAS2] best practices / DAS2 format examples > > das2-teleconf-2006-03-16.txt > > [A] Lincoln will provide use cases/examples of these features > > scenarios: > > - three or greater hierarchy features > > - multiple parents > > - alignments > > I really would like some real-world examples of these. I don't know > enough to make decent examples for the documentation and I think it > would be very useful so others can see how to model existing data > in DAS2 XML. I found a previous post from Lincoln with attached alignment examples: > -----Original Message----- > From: das2-bounces at lists.open-bio.org [mailto:das2-bounces at lists.open- > bio.org] On Behalf Of Lincoln Stein > Sent: Monday, June 05, 2006 7:32 AM > To: Andrew Dalke > Cc: DAS/2 > Subject: [DAS2] Example alignments > > Hi Andrew, > > I'm truly sorry at how long it has taken me to get these examples to you. > I hope that the example alignments in the enclosure makes sense to you. > > Unfortunately I found that I had to add a new "target" attribute to > in order to make the cigar string semantics unambiguous. Otherwise you > wouldn't be able to tell how to interpret the gaps. > > Lincoln > CASE #1. A SIMPLE PAIRWISE ALIGNMENT. A simple alignment is one in which the alignment is represented as a single feature with no subfeatures. This is the preferred representation to be used when the entire alignment shares the same set of properties. This is an alignment between Chr3 (the reference) and EST23 (the target). Both aligned sequences are in the forward (+) direction. We represent this as a single alignment Chr4 100 CAAGACCTAAA-CTGGAATTCCAATCGCAACTCCTGGACC-TATCTATA 147 |||||||X||| ||||| ||||||| ||||X||| |||||||| EST23 1 CAAGACCAAAATCTGGA-TTCCAAT-------CCTGCACCCTATCTATA 41 This has a CIGAR gap string of M11 I1 M5 D1 M7 D7 M8 I1 M8: M11 match 11 bp I1 insert 1 gap into the reference sequence M5 match 5 bp D1 insert 1 gap into the target sequence M7 match 7 bp D7 insert 7 gaps into the target M8 match 8 bp I1 insert 1 gap into the reference M8 match 8 bp Content-Type: application/x-das-features+xml NOTE: I've had to introduce a new attribute named "target" in order to distinguish the reference sequence from the target sequence. This is necessary for the CIGAR string concepts to work. Perhaps it would be better to have a "role" attribute whose values are one of "ref" and "target?" From dalke at dalkescientific.com Mon Sep 11 20:14:07 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 11 Sep 2006 22:14:07 +0200 Subject: [DAS2] best practices / DAS2 format examples In-Reply-To: References: Message-ID: Gregg: > I found a previous post from Lincoln with attached alignment examples: D'oh! My apologies for having forgotten that. Lincoln: > NOTE: I've had to introduce a new attribute named "target" in > order to distinguish the reference sequence from the target > sequence. This is necessary for the CIGAR string concepts to work. > Perhaps it would be better to have a "role" attribute whose values are > one of "ref" and "target? Anyone have comments on that? Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Tue Sep 12 01:44:09 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Tue, 12 Sep 2006 03:44:09 +0200 Subject: [DAS2] feature group assembly; proposals for simplification Message-ID: <795173f891edcdfead41a676f741d8b0@dalkescientific.com> I've been working on a writeback server. It will verify that the feature groups have no cycles, that if X is a parent to Y then Y is a part of X, and that all groups has a single root. I'm having a hard time with that, and harder than I expected. GFF3 had only one direction of relationship. As such it's impossible to assemble a feature group until the end of the file or the marker that no lookahead is needed. We changed that in DAS2. The feature xml is bidirectional so in theory it's possible to know when the feature group is complete. But it's tricky. It's tricky enough that I want to change things slightly so that people don't need to handle the trickiness. The trickiness comes when parsing the list of features into feature groups. For example, consider [F3] / \ [F4] [F2] | | [F1] [F5] where the features are in the order F1, F2, ... F5. After F2 the system looks like there are two feature groups. [F3?] | [F4?] [F2] | | [F1] [F5?] Only after F3 can those be merged together. This requires some non-trivial bookkeeping, unless I've forgotten something simple from undergraduate data structures. Of course it's simple if you know that a feature is the last feature in a feature group either through reaching EOF or a special marker. But then what's the point of having bidirectional links if the result is no better than GFF3's only-list-parent solution. If there is a simple algorithm, please let me know. === Solution #1 === Another solution is to require that complex feature groups (groups with more than one feature) must also have a link to the root element of the feature group. I bought this up before but agreed with others that that there wasn't a need for it. Now I think there is. Here's an example. By using a 'root' attribute, detecting the end of a feature group is almost trivial: a FeatureGroup contains: - list of seen urls # duplicates are not allowed - set of urls which must be seen # duplicates are ignored let feature_groups := mapping {root uri -> FeatureGroup } for feature in features: if feature does not have a @root attribute: make a new FeatureGroup add the feature to the FeatureGroup as being seen let feature_groups[feature's @uri attribute] := the new FeatureGroup else: if the features's @root attribute does not exist in features_group: # first time this feature group was seen create a new FeatureGroup let feature_groups[feature's @root attribute] := the new FeatureGroup get feature_group[feature's @root attribute] add this feature to the FeatureGroup as a seen url for each uri in (feature's @uri attribute, the parent uris, the part uris): add the uri to the FeatureGroup's "must be seen" set if count(seen urls) == count(must be seen urls): the feature group is complete / assemble the links Assembly of a feature group occurs as soon as all the features are available, rather than waiting for the end. This makes life much simpler for the writeback, and I assume also for the client code. Assuming the client code doesn't just wait until the end of the input before it does anything. Gregg? Do you wait until the end of the XML to assemble hierarchical features? If so, do you need parent/part or will parent suffice? Or do you do all the bookkeeping as you get the data? How complex is the code? There are other solutions: === Solution #2 === - require that the features are ordered so that parents comes before a part I think this is hard because it relational databases aren't naturally ordered. The normal trick is to put an extra field and "sort by", but then the server has to maintain the correct ordering. It's fragile. === Solution #3 === - put all elements for a given feature group explicitly inside of a element. Eg, (in this case simple features not part of a complex parent/part relationship need not be in a FEATURE_GROUP.) This is the easiest solution. I like it because it's the easiest to figure out. Even the algorithm above is hard by comparison. If I had my choice we would do this instead of determining the feature group by analysis of the parent/part linkages. Note that with this change there's no longer need for the PART element. We would only need PARENT. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Thu Sep 14 16:08:52 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 14 Sep 2006 18:08:52 +0200 Subject: [DAS2] feature group assembly; proposals for simplification In-Reply-To: <795173f891edcdfead41a676f741d8b0@dalkescientific.com> References: <795173f891edcdfead41a676f741d8b0@dalkescientific.com> Message-ID: <43a8414f574e1bd8205bfb16b6adbe93@dalkescientific.com> Me [Andrew]: > I've been working on a writeback server. It will verify that the > feature groups have no cycles, that if X is a parent to Y then Y > is a part of X, and that all groups has a single root. > > I'm having a hard time with that, and harder than I expected. I listed three alternatives and am hoping for feedback by email rather than waiting another 10 days for the next phone conference. They are: FEATURE elements add a "root" attribute pointing to the top-level feature for the feature group FEATURE elements must be listed in top-down order Features in the same feature group are inside a new element, as in Or, someone can show me a simple O(n) algorithm for building up the feature group such that complete groups can be processed before reaching the end of the feature data set. Any comments? Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Sun Sep 17 08:20:49 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Sun, 17 Sep 2006 10:20:49 +0200 Subject: [DAS2] feature group assembly; proposals for simplification In-Reply-To: <6dce9a0b0609151608m3b06881at79127b95d08cd40c@mail.gmail.com> References: <795173f891edcdfead41a676f741d8b0@dalkescientific.com> <6dce9a0b0609151608m3b06881at79127b95d08cd40c@mail.gmail.com> Message-ID: <41498edc722a0faf292380b221733e55@dalkescientific.com> On Sep 16, 2006, at 1:08 AM, Lincoln Stein wrote: > Hi Andrew, > > Grouping them into a set is almost equivalent to the > "end of > feature set" marker in GFF3, which is why I favor that solution. If > we do this, should we adopt the same convention for the GET requests > as well? If so, should we get rid of bidirection references? (I did notice that the GFF3 data sets I found, like wormbase, don't have the "end of feature set" marker. My GFF3 parser has about 10x memory overhead so parsing a 80MB input file thrashed my 1GB laptop. Adding a single marker in the middle, by hand, made it much happier.) If we have a such that features in that group are all connected to other and only to each other, then I have no problem getting rid of the child link. It adds no benefits in that case but does cause the verification overhead of checking that both directions are correct. Andrew dalke at dalkescientific.com From Ed_Erwin at affymetrix.com Mon Sep 18 16:59:55 2006 From: Ed_Erwin at affymetrix.com (Erwin, Ed) Date: Mon, 18 Sep 2006 09:59:55 -0700 Subject: [DAS2] feature group assembly; proposals for simplification Message-ID: I think the simplest solution for parsing is while parsing the file, read objects F1,F2,F3,F4,F5 into memory but don't even try to hook-up parents and children yet. After finished reading the file, and getting rid of all the memory overhead associated with XML-parsing, loop through the objects that you've read and link parents to children. Their order no longer matters because they are all in memory, probably in a hashmap linking ID to object. -----Original Message----- From: das2-bounces at lists.open-bio.org [mailto:das2-bounces at lists.open-bio.org] On Behalf Of Andrew Dalke Sent: Monday, September 11, 2006 6:44 PM To: DAS/2 Subject: [DAS2] feature group assembly; proposals for simplification I've been working on a writeback server. It will verify that the feature groups have no cycles, that if X is a parent to Y then Y is a part of X, and that all groups has a single root. I'm having a hard time with that, and harder than I expected. GFF3 had only one direction of relationship. As such it's impossible to assemble a feature group until the end of the file or the marker that no lookahead is needed. We changed that in DAS2. The feature xml is bidirectional so in theory it's possible to know when the feature group is complete. But it's tricky. It's tricky enough that I want to change things slightly so that people don't need to handle the trickiness. The trickiness comes when parsing the list of features into feature groups. For example, consider [F3] / \ [F4] [F2] | | [F1] [F5] where the features are in the order F1, F2, ... F5. After F2 the system looks like there are two feature groups. [F3?] | [F4?] [F2] | | [F1] [F5?] Only after F3 can those be merged together. This requires some non-trivial bookkeeping, unless I've forgotten something simple from undergraduate data structures. Of course it's simple if you know that a feature is the last feature in a feature group either through reaching EOF or a special marker. But then what's the point of having bidirectional links if the result is no better than GFF3's only-list-parent solution. If there is a simple algorithm, please let me know. From Ed_Erwin at affymetrix.com Mon Sep 18 16:54:59 2006 From: Ed_Erwin at affymetrix.com (Erwin, Ed) Date: Mon, 18 Sep 2006 09:54:59 -0700 Subject: [DAS2] feature group assembly; proposals for simplification Message-ID: Andrew, I'm having trouble understanding where all this memory overhead comes from in your parsing of GFF3 files. I've recently written a GFF3 parser for IGB. I've found that the presence or absence of the "end of feature set" marker "###" has little effect on the amount of memory required. The procedure is quite simple. For each line in the GFF3 file, create an object in memory. Add that object to a list. If the object has an ID, store the "ID to object" mapping in a hashmap. At the end of file (or each "###" mark) Loop through the complete list of objects. For each one claiming to have one or more Parent_ID's, find those parents in the hashmap, add it as a child of those parents and remove it from the original list (which will then contain only parentless objects). That is all. At the end you can throw away the hashmap. During processing you have to have one hashmap. But I don't see how that adds a whole lot to the memory overhead. In our model, each of the memory objects representing one feature keeps a list of pointers to its children. While first reading the file, those pointers are left null, then the lists are constructed on the second pass (after the "###" marks). (In IGB, the final destination of the data is some in-memory objects. If your final destination is a database, then you can be writing each line to the database as it is read and then check for consistency of parents and children later. You don't even need the in-memory hashmap then, because you can use a database table.) So basically, I just don't understand what problem you are trying to solve. I don't object to adding , and I don't much care whether there are bi-directional references. Bi-directional references do not seem necessary to me, and really just seems like a likely place for the users to make mistakes, but I don't see any reason to change the spec now. If there are bi-directional references, you can proceed exactly as above. The primary references are references to the parents. But when hooking a feature up to its parent, you can then check that the parent has listed this child as one of its expected children. (You in fact get a bit of a boost because since each parent knows how many children it expects, you can set-up the child List objects with the correct size from the beginning.) Ed -----Original Message----- From: das2-bounces at lists.open-bio.org [mailto:das2-bounces at lists.open-bio.org] On Behalf Of Andrew Dalke Sent: Sunday, September 17, 2006 1:21 AM To: lincoln.stein at gmail.com Cc: DAS/2 Subject: Re: [DAS2] feature group assembly; proposals for simplification On Sep 16, 2006, at 1:08 AM, Lincoln Stein wrote: > Hi Andrew, > > Grouping them into a set is almost equivalent to the > "end of > feature set" marker in GFF3, which is why I favor that solution. If > we do this, should we adopt the same convention for the GET requests > as well? If so, should we get rid of bidirection references? (I did notice that the GFF3 data sets I found, like wormbase, don't have the "end of feature set" marker. My GFF3 parser has about 10x memory overhead so parsing a 80MB input file thrashed my 1GB laptop. Adding a single marker in the middle, by hand, made it much happier.) If we have a such that features in that group are all connected to other and only to each other, then I have no problem getting rid of the child link. It adds no benefits in that case but does cause the verification overhead of checking that both directions are correct. Andrew dalke at dalkescientific.com _______________________________________________ DAS2 mailing list DAS2 at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/das2 From lstein at cshl.edu Mon Sep 18 17:23:38 2006 From: lstein at cshl.edu (Lincoln Stein) Date: Mon, 18 Sep 2006 17:23:38 +0000 Subject: [DAS2] feature group assembly; proposals for simplification In-Reply-To: References: Message-ID: <6dce9a0b0609181023rc9aebe9sa452c4b4f7cd8d7b@mail.gmail.com> Hi, My GFF3 parser works in a similar manner. As each feature comes in, it is parsed, turned into an object, and sent to a disk-based database. The parent link is kept in an in-memory data structure. At the end of the parse, the parent link data structure is traversed and then the table of parent/child relationships is written out to disk. Lincoln On 9/18/06, Erwin, Ed wrote: > > > Andrew, > > I'm having trouble understanding where all this memory overhead comes > from in your parsing of GFF3 files. I've recently written a GFF3 parser > for IGB. I've found that the presence or absence of the "end of feature > set" marker "###" has little effect on the amount of memory required. > > The procedure is quite simple. > > For each line in the GFF3 file, create an object in memory. > Add that object to a list. > If the object has an ID, store the "ID to object" mapping in a hashmap. > > At the end of file (or each "###" mark) > Loop through the complete list of objects. > For each one claiming to have one or more Parent_ID's, find those > parents in the hashmap, add it as a child of those parents and remove it > from the original list (which will then contain only parentless > objects). > > > That is all. At the end you can throw away the hashmap. > > During processing you have to have one hashmap. But I don't see how > that adds a whole lot to the memory overhead. In our model, each of the > memory objects representing one feature keeps a list of pointers to its > children. While first reading the file, those pointers are left null, > then the lists are constructed on the second pass (after the "###" > marks). > > (In IGB, the final destination of the data is some in-memory objects. > If your final destination is a database, then you can be writing each > line to the database as it is read and then check for consistency of > parents and children later. You don't even need the in-memory hashmap > then, because you can use a database table.) > > So basically, I just don't understand what problem you are trying to > solve. I don't object to adding , and I don't much care > whether there are bi-directional references. Bi-directional references > do not seem necessary to me, and really just seems like a likely place > for the users to make mistakes, but I don't see any reason to change the > spec now. > > If there are bi-directional references, you can proceed exactly as > above. The primary references are references to the parents. But when > hooking a feature up to its parent, you can then check that the parent > has listed this child as one of its expected children. (You in fact get > a bit of a boost because since each parent knows how many children it > expects, you can set-up the child List objects with the correct size > from the beginning.) > > Ed > > > -----Original Message----- > From: das2-bounces at lists.open-bio.org > [mailto:das2-bounces at lists.open-bio.org] On Behalf Of Andrew Dalke > Sent: Sunday, September 17, 2006 1:21 AM > To: lincoln.stein at gmail.com > Cc: DAS/2 > Subject: Re: [DAS2] feature group assembly; proposals for simplification > > On Sep 16, 2006, at 1:08 AM, Lincoln Stein wrote: > > Hi Andrew, > > > > Grouping them into a set is almost equivalent to the > > "end of > > feature set" marker in GFF3, which is why I favor that solution. If > > we do this, should we adopt the same convention for the GET requests > > as well? If so, should we get rid of bidirection references? > > (I did notice that the GFF3 data sets I found, like wormbase, don't have > the "end of feature set" marker. My GFF3 parser has about 10x memory > overhead > so parsing a 80MB input file thrashed my 1GB laptop. Adding a single > marker in the middle, by hand, made it much happier.) > > If we have a such that features in that group are all > connected to other and only to each other, then I have no problem > getting > rid of the child link. It adds no benefits in that case but does cause > the verification overhead of checking that both directions are correct. > > Andrew > dalke at dalkescientific.com > > _______________________________________________ > DAS2 mailing list > DAS2 at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/das2 > > _______________________________________________ > DAS2 mailing list > DAS2 at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/das2 > -- Lincoln D. Stein Cold Spring Harbor Laboratory 1 Bungtown Road Cold Spring Harbor, NY 11724 (516) 367-8380 (voice) (516) 367-8389 (fax) FOR URGENT MESSAGES & SCHEDULING, PLEASE CONTACT MY ASSISTANT, SANDRA MICHELSEN, AT michelse at cshl.edu From dalke at dalkescientific.com Mon Sep 18 19:11:03 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 18 Sep 2006 21:11:03 +0200 Subject: [DAS2] feature group assembly; proposals for simplification In-Reply-To: References: Message-ID: Ed: > I'm having trouble understanding where all this memory overhead comes > from in your parsing of GFF3 files. I've recently written a GFF3 > parser > for IGB. I've found that the presence or absence of the "end of > feature > set" marker "###" has little effect on the amount of memory required. How big was the data set? dmel-3R-r4.3.gff from flybase is 68,685,595 bytes. Strange though now that I look at it. I shouldn't have a 10x overhead. I'm looking at the memory use now. I estimate my data structures used roughly 340 bytes per feature. Each line averages 80 characters so 4.25x overhead and not 10x. Very strange. I'll need to dig in to that some more. I did find that I wasted a lot of space with small data structures. class Location(object): def __init__(self, id, start, end): self.id, self.start, self.end = id, start, end takes 190 bytes per instance. When I change it to use slots instead of a dictionary for attribute storage (deep Python trickery) class Location(object): __slots__ = ["id", "start", "end"] def __init__(self, id, start, end): self.id, self.start, self.end = id, start, end I use about 48 bytes per object. That'll save about 122MB and take me away from the edge of memory use. I had used that trick on my other data objects - I somehow missed Location. I suspect the other big memory use in in the attribute table for things like ID=80799wgsext-hsp;Name=80799wgsext Each string has 16 bytes of overhead, I think, so 32 bytes for each use of "ID" and "Name". By interning those two frequent strings I can save about 20 bytes per record (70% of flybase records have ID, 55% have Name) or 19MB. > The procedure is quite simple. That's the first step. For sanity checking you should do cycle detection, and likely check that the structure is single-rooted. > During processing you have to have one hashmap. But I don't see how > that adds a whole lot to the memory overhead. It wasn't. It was the per-record overhead. > So basically, I just don't understand what problem you are trying to > solve. The reason for bidirectional links was to allow processing while receiving data rather than waiting until the end. With bi-di you can in principle determine that a feature group is complete when the last feature in the group arrives. > I don't object to adding , and I don't much care > whether there are bi-directional references. Bi-directional references > do not seem necessary to me, and really just seems like a likely place > for the users to make mistakes, but I don't see any reason to change > the > spec now. If it's error prone (I agree that it is) and it's hard to use (which I now believe) and no one will use it for it's intended goal (likely?) and it breaks no code to remove it then I see little reason to keep it. If processing while downloading is desirable than the easiest solution to use is a , but the solution with the least change to the existing spec is a "root=" attribute. If processing while downloading is not sufficiently desirable then there's no need for bi-di links and we can drop the element and have the data structure be closer to GFF3. > (You in fact get > a bit of a boost because since each parent knows how many children it > expects, you can set-up the child List objects with the correct size > from the beginning.) Only if the parents are listed first. Otherwise there's no hint for the correct size. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Mon Sep 18 19:11:10 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 18 Sep 2006 21:11:10 +0200 Subject: [DAS2] feature group assembly; proposals for simplification In-Reply-To: References: Message-ID: <282d4a79f3ffd9343b6538ec2f33b4a0@dalkescientific.com> Ed: > I'm having trouble understanding where all this memory overhead comes > from in your parsing of GFF3 files. I've recently written a GFF3 > parser > for IGB. I've found that the presence or absence of the "end of > feature > set" marker "###" has little effect on the amount of memory required. What size file are you using? dmel-3R-r4.3.gff from flybase is 68,685,595 bytes. Strange though now that I look at it. I shouldn't have a 10x overhead. I'm looking at the memory use now. I estimate my data structures used roughly 340 bytes per feature. Each line averages 80 characters so 4.25x overhead and not 10x. Very strange. I'll need to dig in to that some more. I did find that I wasted a lot of space with small data structures. class Location(object): def __init__(self, id, start, end): self.id, self.start, self.end = id, start, end takes 190 bytes per instance. When I change it to use slots instead of a dictionary for attribute storage (deep Python trickery) class Location(object): __slots__ = ["id", "start", "end"] def __init__(self, id, start, end): self.id, self.start, self.end = id, start, end I use about 48 bytes per object. That'll save about 122MB and take me away from the edge of memory use. I had used that trick on my other data objects - I somehow missed Location. I suspect the other big memory use in in the attribute table for things like ID=80799wgsext-hsp;Name=80799wgsext Each string has 16 bytes of overhead, I think, so 32 bytes for each use of "ID" and "Name". By interning those two frequent strings I can save about 20 bytes per record (70% of flybase records have ID, 55% have Name) or 19MB. > The procedure is quite simple. That's the first step. For sanity checking you should do cycle detection, and likely check that the structure is single-rooted. > During processing you have to have one hashmap. But I don't see how > that adds a whole lot to the memory overhead. It wasn't. It was the per-record overhead. > So basically, I just don't understand what problem you are trying to > solve. The reason for bidirectional links was to allow processing while receiving data rather than waiting until the end. With bi-di you can in principle determine that a feature group is complete when the last feature in the group arrives. > I don't object to adding , and I don't much care > whether there are bi-directional references. Bi-directional references > do not seem necessary to me, and really just seems like a likely place > for the users to make mistakes, but I don't see any reason to change > the > spec now. If it's error prone (I agree that it is) and it's hard to use (which I now believe) and no one will use it for it's intended goal (likely?) and it breaks no code to remove it then I see little reason to keep it. If processing while downloading is desirable than the easiest solution to use is a , but the solution with the least change to the existing spec is a "root=" attribute. If processing while downloading is not sufficiently desirable then there's no need for bi-di links and we can drop the element and have the data structure be closer to GFF3. > (You in fact get > a bit of a boost because since each parent knows how many children it > expects, you can set-up the child List objects with the correct size > from the beginning.) Only if the parents are listed first. Otherwise there's no hint for the correct size. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Mon Sep 18 19:20:51 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 18 Sep 2006 21:20:51 +0200 Subject: [DAS2] feature group assembly; proposals for simplification In-Reply-To: <6dce9a0b0609181023rc9aebe9sa452c4b4f7cd8d7b@mail.gmail.com> References: <6dce9a0b0609181023rc9aebe9sa452c4b4f7cd8d7b@mail.gmail.com> Message-ID: <05439c737629728d1531fd57f5f1b195@dalkescientific.com> Lincoln: > My GFF3 parser works in a similar manner. As each feature comes in, it > is > parsed, turned into an object, and sent to a disk-based database. I was writing a GFF3 to DAS2XML converter. With bi-di links each record needs link data for both directions before writing the record. I could do intermediate saves to the disk, but that's more work than I wanted to do. I can change my converter to use less memory - quite a bit less with a bit more work. I've not optimized much for memory, mostly for speed. Another solution is to get rid of bi-di links and have only parent links. In that case the conversion is trivial, excepting the steps to check for cycles and single-rooted groups. But that's only if people don't sufficiently want the ability to process complete features while other features are being up/downloaded. Andrew dalke at dalkescientific.com From Ed_Erwin at affymetrix.com Mon Sep 18 22:01:19 2006 From: Ed_Erwin at affymetrix.com (Erwin, Ed) Date: Mon, 18 Sep 2006 15:01:19 -0700 Subject: [DAS2] feature group assembly; proposals for simplification Message-ID: I have mostly used smaller examples from NCBI, but I've downloaded that wormbase one to play with as a good test of a big file. I took file "3R.gff" from here ftp://flybase.net/genomes/Drosophila_melanogaster/current/gff/ I need something a little more than 2x the filesize to store that data and to store the graphical objects used to represent it. (I haven't looked at exactly how much is data vs. graphics.) Since IGB keeps everything in memory, we have optimized for memory rather than speed. One of the tricks here is that I don't create a hashmap for the attributes. I simply store the attributes string as a string. I then have to do some regex processing each time I want to extract a property value, but that isn't very often and I intentionally chose memory efficiency over speed. The bigger problem seems to be that every GFF3 file I've seen in the wild has violated the specification. Every file I've tried has failed the validator, and it isn't even a very strict validator. In this case, one of the big things is that almost every feature has "ID=-". If I interpret that literally, then all those lines should be joined into one big feature. (I assume what was intended in this case is that these are features without an ID, so I've added a special case to handle that.) This is getting off topic of DAS/2, but I'm trying to collect a list of questionable things I've seen in GFF3 files and I'll try to get Lincoln to rule on whether they are valid. Ed -----Original Message----- From: das2-bounces at lists.open-bio.org [mailto:das2-bounces at lists.open-bio.org] On Behalf Of Andrew Dalke Sent: Monday, September 18, 2006 12:11 PM To: DAS/2 Subject: Re: [DAS2] feature group assembly; proposals for simplification Ed: > I'm having trouble understanding where all this memory overhead comes > from in your parsing of GFF3 files. I've recently written a GFF3 > parser > for IGB. I've found that the presence or absence of the "end of > feature > set" marker "###" has little effect on the amount of memory required. How big was the data set? dmel-3R-r4.3.gff from flybase is 68,685,595 bytes. Strange though now that I look at it. I shouldn't have a 10x overhead. .... > The procedure is quite simple. That's the first step. For sanity checking you should do cycle detection, and likely check that the structure is single-rooted. .... > (You in fact get > a bit of a boost because since each parent knows how many children it > expects, you can set-up the child List objects with the correct size > from the beginning.) Only if the parents are listed first. Otherwise there's no hint for the correct size. From dalke at dalkescientific.com Mon Sep 18 22:59:51 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Tue, 19 Sep 2006 00:59:51 +0200 Subject: [DAS2] feature group assembly; proposals for simplification In-Reply-To: References: Message-ID: Ed: > I took file "3R.gff" from here > ftp://flybase.net/genomes/Drosophila_melanogaster/current/gff/ "current" is dmel_r4.3_20060303 . I'm also using "4.3" but I have different data. It has 3R sim4 na_transcript_dmel_r31 380 1913 . + . ID=-; where I have 3R sim4:na_transcript_dmel_r31 match 380 1913 . + . ID=:315834 > Since IGB keeps everything in memory, we have optimized for memory > rather than speed. One of the tricks here is that I don't create a > hashmap for the attributes. Hmmm. My parser doesn't handle that, at least not without a bit of monkey patching. Thinking about it some .. that defers errors until latter .. what errors? .. ahh, if a field doesn't have a "=" in it then my code will raise an exception. > I simply store the attributes string as a > string. I then have to do some regex processing each time I want to > extract a property value, but that isn't very often and I intentionally > chose memory efficiency over speed. I didn't think regexps were the right solution for that. Well, not unless you're using them for single character search. For example, URL escaping rules are used for tags or values containing the following characters: ",=;". means that you can't search for "ID=" attributes using the pattern "ID=([^;])+" because "ID" could be written as "%49%44 > The bigger problem seems to be that every GFF3 file I've seen in the > wild has violated the specification. Every file I've tried has failed > the validator, and it isn't even a very strict validator. That's why I suspect GFF3 isn't used as input. Otherwise these would have been noticed and fixed. > In this case, one of the big things is that almost every feature has > "ID=-". If I interpret that literally, then all those lines should be > joined into one big feature. (I assume what was intended in this case > is that these are features without an ID, so I've added a special case > to handle that.) In my version of the data set there can be IDs. What I found from looking at other data sets is the ID can be duplicated but I don't complain until assembling the complex feature and only when there is a "parent" which uses a duplicate id. A small part of my memory overhead (about 70 bytes per record) tracks those duplicates. I had forgotten about this in my previous calculations. > This is getting off topic of DAS/2, but I'm trying to collect a list of > questionable things I've seen in GFF3 files and I'll try to get Lincoln > to rule on whether they are valid. I sent others to him last spring and he replied to me. Here they are in summary. Some were requests for clarification. Q. Can the start and end position be '.' A. Yes, and it's allowed in the spec Q. Can the seqid be "."? A. "This is allowed by the spec, but I hope it would never happen. It means there is a floating feature that has no location. It should probably be forbidden for seqid to be . and start and end to be defined. Shall I modify the GFF3 spec to state so? I see now I didn't respond: "yes" is my answer Q. Can the 9th field be "."? A. This is ok. Q. Are zero length tags allowed? Eg, an attribute field of "=5". [...] I use a dictionary key of "". A. Allowed. Q. Should parsers raise an exception if the two characters after the '%' are not hex characters? A. Yes (Note that my parser currently does not catch that error.) Q. Are duplicate attribute tags allowed, as in Parent=AB123;Parent=XY987 If so, is it equivalent to Parent=AB123,XY987 A. Absolutely! This is allowed and encouraged. Andrew dalke at dalkescientific.com From allenday at ucla.edu Tue Sep 19 17:06:24 2006 From: allenday at ucla.edu (Allen Day) Date: Tue, 19 Sep 2006 10:06:24 -0700 Subject: [DAS2] feature group assembly; proposals for simplification In-Reply-To: <795173f891edcdfead41a676f741d8b0@dalkescientific.com> References: <795173f891edcdfead41a676f741d8b0@dalkescientific.com> Message-ID: <5c24dcc30609191006p44b955d4kea78bd81c42b8c26@mail.gmail.com> I wrote a writeback parser for the current XML style, and although I did not add code to reject multi-rooted groups (which may not be appropriate anyway), didn't find the book-keeping to be particularly onerous. If I understand clearly, the complaint isn't the book-keeping itself, but rather the memory requirements imposed by the book-keeping. Why not just give HTTP 413 (request entity too large) if you don't like the size of the file being uploaded? Gregg and I had a discussion about likely writeback document sizes during the last code sprint, and Genoviz is likely to be giving documents in the 1-50KB range -- nowhere near 80MB of GFF3 worth of features. -Allen On 9/11/06, Andrew Dalke wrote: > > I've been working on a writeback server. It will verify that the > feature groups have no cycles, that if X is a parent to Y then Y > is a part of X, and that all groups has a single root. > > I'm having a hard time with that, and harder than I expected. > > GFF3 had only one direction of relationship. As such it's impossible > to assemble a feature group until the end of the file or the > marker that no lookahead is needed. > > We changed that in DAS2. The feature xml is bidirectional so > in theory it's possible to know when the feature group is complete. > But it's tricky. It's tricky enough that I want to change things > slightly so that people don't need to handle the trickiness. > > The trickiness comes when parsing the list of features into > feature groups. For example, consider > > [F3] > / \ > [F4] [F2] > | | > [F1] [F5] > > where the features are in the order F1, F2, ... F5. After F2 > the system looks like there are two feature groups. > > [F3?] > | > [F4?] [F2] > | | > [F1] [F5?] > > Only after F3 can those be merged together. This requires some > non-trivial bookkeeping, unless I've forgotten something simple > from undergraduate data structures. > > Of course it's simple if you know that a feature is the last feature > in a feature group either through reaching EOF or a special marker. > But then what's the point of having bidirectional links if the > result is no better than GFF3's only-list-parent solution. > > If there is a simple algorithm, please let me know. > > === Solution #1 === > > Another solution is to require that complex feature groups (groups > with more than one feature) must also have a link to the root element > of the feature group. I bought this up before but agreed with others > that that there wasn't a need for it. Now I think there is. > > Here's an example. > > > > > > > By using a 'root' attribute, detecting the end of a feature group is > almost trivial: > > a FeatureGroup contains: > - list of seen urls # duplicates are not allowed > - set of urls which must be seen # duplicates are ignored > > let feature_groups := mapping {root uri -> FeatureGroup } > > for feature in features: > if feature does not have a @root attribute: > make a new FeatureGroup > add the feature to the FeatureGroup as being seen > let feature_groups[feature's @uri attribute] := the new > FeatureGroup > > else: > if the features's @root attribute does not exist in > features_group: > # first time this feature group was seen > create a new FeatureGroup > let feature_groups[feature's @root attribute] := the new > FeatureGroup > > get feature_group[feature's @root attribute] > add this feature to the FeatureGroup as a seen url > for each uri in (feature's @uri attribute, the parent uris, the > part uris): > add the uri to the FeatureGroup's "must be seen" set > if count(seen urls) == count(must be seen urls): > the feature group is complete / assemble the links > > Assembly of a feature group occurs as soon as all the features are > available, > rather than waiting for the end. > > This makes life much simpler for the writeback, and I assume also for > the client code. Assuming the client code doesn't just wait until the > end of the input before it does anything. > > Gregg? Do you wait until the end of the XML to assemble hierarchical > features? > If so, do you need parent/part or will parent suffice? Or do you do > all the > bookkeeping as you get the data? How complex is the code? > > There are other solutions: > > === Solution #2 === > - require that the features are ordered so that parents comes before a > part > > I think this is hard because it relational databases aren't naturally > ordered. > The normal trick is to put an extra field and "sort by", but then the > server > has to maintain the correct ordering. It's fragile. > > === Solution #3 === > > - put all elements for a given feature group explicitly > inside of a element. > > Eg, > > > > > > > > /> > > (in this case simple features not part of a complex parent/part > relationship > need not be in a FEATURE_GROUP.) > > This is the easiest solution. I like it because it's the easiest to > figure > out. Even the algorithm above is hard by comparison. > > If I had my choice we would do this instead of determining the feature > group > by analysis of the parent/part linkages. > > Note that with this change there's no longer need for the PART element. > We would only need PARENT. > > Andrew > dalke at dalkescientific.com > > _______________________________________________ > DAS2 mailing list > DAS2 at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/das2 > From allenday at ucla.edu Thu Sep 21 07:30:52 2006 From: allenday at ucla.edu (Allen Day) Date: Thu, 21 Sep 2006 00:30:52 -0700 Subject: [DAS2] das2 diagrams, questions Message-ID: <5c24dcc30609210030k5324378fy18990dc41a1f1b1e@mail.gmail.com> Hi, I am getting ready to do a server-side rewrite, so I took some time to diagram out where we are from the current spec documents. See the attached file, particularly pages 5-6. I have a few questions, mostly targeted at Andrew, regarding the current HTML version of the spec on the biodas.org site. It hasn't been updated in about 5 months, and looks pretty out of date. * Is the HTML document in sync with the "new_spec.txt" document in CVS? * There is mention of a "fasta" command, and its fragment is linked from the ToC of the genome retrievals document, but it does not appear in the document. Does this command exist? My understanding from conference calls is that the sequence/fasta/segment/dna stuff has all merged into the "segment" response. Is this correct? * The "property" command seems to have disappeared. Is that correct? Are property keys no longer URIs? Also the "prop-*" feature filters could be better described, it is not clear to me if they are meant as some sort of replacement for the property command. This document also contains a few diagrams on pages 1-4 describing how the writeback, block caching/flushing, and dynamic feature generation (a.k.a. "analysis DAS") all fit together. -Allen -------------- next part -------------- A non-text attachment was scrubbed... Name: DAS2_overview.pdf Type: application/pdf Size: 362762 bytes Desc: not available URL: From rowankuiper at hotmail.com Sun Sep 24 08:17:38 2006 From: rowankuiper at hotmail.com (Rowan Kuiper) Date: Sun, 24 Sep 2006 08:17:38 +0000 Subject: [DAS2] current status of DAS Message-ID: I have a few questions. I?m a bioinformatics student and for an internship at the Erasmus University in Holland I have to investigate the current status of DAS. I?ve been trying to work with DAS a couple of weeks now and the impression I get is that it is a bit messy. Perhaps this is because I don?t understand DAS very well and can you explain it to me. - First of all, will DAS2 ever be finished. I saw on the biodas site that the 2 year development started in 2004. But when I looked at sites that should propagate the development, DAS seems to be out of focus. You think DAS is still alive or is there something else that took its place? - Why don?t all servers support all commands. Some reference servers for example don?t support the entry_point command. How do I request features when I don?t know which segments the server contains? I imagine that these great differences in how to use different servers could be very problematic when implementing a viewer. - It seems that the only way to retrieve information from a server is to do a request for a certain region. Is there a way to ask for a specific features. - Is the Sanger Registry Server reliable or is it something of the past? It would be very nice if all available sources where listed there but just a small part of the sources I found where in the list. - When I have to serve data that needs some extension on the XML structure, would it be a problem to just do it. How would clients handle these extensions. Ignore them or somehow parse them? - And last, one of the goals of DAS is to be able to integrate biological data. When I for example want to compare my data to EnsEMBL features I will have to set up my own server that serves features referenced to the same genome as the EnsEMBL features. So I wonder if there exists reference servers that contain the current genomes of EnsEMBL, NCBI or UCSC. I found http://das.ensembl.org/das/ensembl_Homo_sapiens_core_38_36 which always replies an out of memory error even with the entry_points command and http://das.ensembl.org/das/ensembl1834 which seems to work properly but its transcript server ens1834trans also returns out of memory errors. I think that there are some people here that can tell me their view on the subject. Thanks in advance, Rowan Kuiper From ap3 at sanger.ac.uk Mon Sep 25 09:29:24 2006 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Mon, 25 Sep 2006 10:29:24 +0100 Subject: [DAS2] current status of DAS In-Reply-To: References: Message-ID: <1be3231582c70efe24e848c94409b674@sanger.ac.uk> Hi Rowan! > - Is the Sanger Registry Server reliable or is it something of the > past? It > would be very nice if all available sources where listed there but > just a > small part of the sources I found where in the list. I am the administrator of the DAS - registration server. DAS is a collaborative approach to share biological data. It is actively being used by many institutions around the world. The DAS registry was developed in order to make it easier to discover DAS servers. DAS does not force anybody to get their servers registered. That's why some servers might not be listed. Usually, if I learn about a server that is not there yet, I will contact the adminstrator and invite him/her to register. If you know of any servers that are not registered in the DAS registry, please let me know and I will take care of it. Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 From ak at ebi.ac.uk Mon Sep 25 10:07:23 2006 From: ak at ebi.ac.uk (Andreas Kahari) Date: Mon, 25 Sep 2006 11:07:23 +0100 Subject: [DAS2] current status of DAS In-Reply-To: References: Message-ID: <20060925100723.GE31706@ebi.ac.uk> On Sun, Sep 24, 2006 at 08:17:38AM +0000, Rowan Kuiper wrote: > I have a few questions. I?m a bioinformatics student and for an internship > at the Erasmus University in Holland I have to investigate the current > status of DAS. I?ve been trying to work with DAS a couple of weeks now and > the impression I get is that it is a bit messy. Perhaps this is because I > don?t understand DAS very well and can you explain it to me. DAS is a specification of a communication protocol originally intended to provide a web service for serving GFF-like annotation data. > - First of all, will DAS2 ever be finished. I saw on the biodas site that > the 2 year development started in 2004. But when I looked at sites that > should propagate the development, DAS seems to be out of focus. You think > DAS is still alive or is there something else that took its place? The stagnation of the DAS/2 developments that you refer to is outside of what I know very much about (the frequent telephone conference mailings on this list suggests it's not stagnated at all). I work full time with a large number of research groups who are using DAS/1 as a tool for data integration in various ways. So in Europe, at least, DAS/1 is very much alive. Also, within the Ensembl Genome Browser (www.ensembl.org), more things are done through DAS than what you might think. > - Why don?t all servers support all commands. Some reference servers for > example don?t support the entry_point command. How do I request features > when I don?t know which segments the server contains? I imagine that these > great differences in how to use different servers could be very problematic > when implementing a viewer. Lazy maintainers, possibly? Could you please provide us with concrete examples of these reference servers? If any of them are within my control, this would give me a chance to fix them. > - It seems that the only way to retrieve information from a server is to do > a request for a certain region. Is there a way to ask for a specific > features. This is an artefact of the way genomic annotation viewers work. They provide the user with a view of a genomic region at a time. According to the specification (DAS/1), the 'features' request may be tailored to only return certain feature IDs on a given segment using the 'feature_id=ID' argument. Whether this capability is implemented by a particular server or not should be evident from the HTTP headers sent back from the server. Again, since people are lazy (me too), and since clients never, as far as I am aware of, make use of this capability, it is seldom implemented. > - Is the Sanger Registry Server reliable or is it something of the past? It > would be very nice if all available sources where listed there but just a > small part of the sources I found where in the list. I'll leave this one for Andreas Prlic. > - When I have to serve data that needs some extension on the XML structure, > would it be a problem to just do it. How would clients handle these > extensions. Ignore them or somehow parse them? You're free to add whatever XML you feel a need to add. A well behaved DAS client will ignore it. If the response still contains the necessary bits and bobs, then it is in my opinion still DAS, otherwise you've broken the protocol and the response will be unusable by any existing client. There is no magic in clients that will tell them to look for XML structures that are not specified as being part of the DAS response. > - And last, one of the goals of DAS is to be able to integrate biological > data. When I for example want to compare my data to EnsEMBL features I will > have to set up my own server that serves features referenced to the same > genome as the EnsEMBL features. So I wonder if there exists reference > servers that contain the current genomes of EnsEMBL, NCBI or UCSC. I found > http://das.ensembl.org/das/ensembl_Homo_sapiens_core_38_36 which always > replies an out of memory error even with the entry_points command and > http://das.ensembl.org/das/ensembl1834 which seems to work properly but its > transcript server ens1834trans also returns out of memory errors. If you wish to do numerical (not visual) comparisons of data against Ensembl, I believe this would be easier with the help of the Ensembl Perl API. Ensembl nowadays serve reference sources from the www.ensembl.org/das server. See Eugene's reply for examples. > I think that there are some people here that can tell me their view on the > subject. > Thanks in advance, > Rowan Kuiper Regards, Andreas -- Andreas K?h?ri Ensembl Software Developer European Bioinformatics Institute (EMBL-EBI) From ak at ebi.ac.uk Mon Sep 25 12:35:04 2006 From: ak at ebi.ac.uk (Andreas Kahari) Date: Mon, 25 Sep 2006 13:35:04 +0100 Subject: [DAS2] current status of DAS Message-ID: <20060925123504.GG31706@ebi.ac.uk> Sent to list on behalf of Eugene Kulesha (non-subscriber). - Andreas K. ----- Forwarded message from Eugene Kulesha ----- Subject: Re: [DAS2] current status of DAS Date: Mon, 25 Sep 2006 10:46:48 +0100 From: Eugene Kulesha To: Rowan Kuiper CC: das2 at lists.open-bio.org References: >- First of all, will DAS2 ever be finished. good question :) although it stopped worrying me a long time ago ;) >DAS is still alive or is there something else that took its place? it certainly is in Ensembl >- Why don?t all servers support all commands. Some reference servers for >example don?t support the entry_point command. How do I request features >when I don?t know which segments the server contains? I imagine that these >great differences in how to use different servers could be very problematic >when implementing a viewer. i guess it was done in part so das could be adopted quicker, but I have to admit that I was very much frustrated by the fact that very few sources implement 'entry_points' command >- It seems that the only way to retrieve information from a server is to do >a request for a certain region. Is there a way to ask for a specific >features. yes, features?feature_id=XXXX would give you the feature ( if feature_id is implemented ) http://www.ensembl.org/das/Homo_sapiens.NCBI36.transcripts/features?feature_id=ENSE00001253754 >- When I have to serve data that needs some extension on the XML structure, >would it be a problem to just do it. How would clients handle these >extensions. Ignore them or somehow parse them? I'm not quite sure what Bio::DasLite ( this is what we use to parse DAS responses) would do .. but even if it parses the extension Ok, Ensembl will ignore the unknown properties .. >- And last, one of the goals of DAS is to be able to integrate biological >data. When I for example want to compare my data to EnsEMBL features I will >have to set up my own server that serves features referenced to the same >genome as the EnsEMBL features. So I wonder if there exists reference >servers that contain the current genomes of EnsEMBL, NCBI or UCSC. I >found http://das.ensembl.org/das/ensembl_Homo_sapiens_core_38_36 which >always replies an out of memory error even with the entry_points command >and http://das.ensembl.org/das/ensembl1834 which seems to work properly but >its transcript server ens1834trans also returns out of memory errors. http://www.ensembl.org/das/dsn have the list of all the sources that we serve from internal ensembl data amongst them there are reference sources, e.g http://www.ensembl.org/das/Homo_sapiens.NCBI36.reference http://www.ensembl.org/das/Mus_musculus.NCBIM36.reference Cheers Eugene Kulesha ----- End forwarded message ----- -- Andreas K?h?ri Ensembl Software Developer European Bioinformatics Institute (EMBL-EBI) From Steve_Chervitz at affymetrix.com Mon Sep 25 17:39:41 2006 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Mon, 25 Sep 2006 10:39:41 -0700 Subject: [DAS2] Notes from the weekly DAS/2 teleconference, 25 Sep 2006 Message-ID: Notes from the weekly DAS/2 teleconference, 25 Sep 2006 $Id: das2-teleconf-2006-09-25.txt,v 1.2 2006/09/25 17:38:57 sac Exp $ Note taker: Steve Chervitz Attendees: Affy: Steve Chervitz, Ed Erwin, Gregg Helt UCLA: Allen Day Action items are flagged with '[A]'. These notes are checked into the biodas.org CVS repository at das/das2/notes/2006. Instructions on how to access this repository are at http://biodas.org DISCLAIMER: The note taker aims for completeness and accuracy, but these goals are not always achievable, given the desire to get the notes out with a rapid turnaround. So don't consider these notes as complete minutes from the meeting, but rather abbreviated, summarized versions of what was discussed. There may be errors of commission and omission. Participants are welcome to post comments and/or corrections to these as they see fit. Agenda ------- * Spec issues * Grant status * Status reports Topic: Spec issues ------------------ aday: what is the status of the currently posted spec w/r/t fasta format, seq segments? The fasta description in the spec seems not up to date. gh: as I recall from the last code sprint (aug 14-18 2006), we had decided to return a das2 segments document in which you could specify fasta as an available format for receiving seq data. sc: that's my recollection too. [A] andrew will work more on keeping the online spec up to date. Topic: Grant status ------------------- gh: We have received official word of approval of $250K for extending funding from now thru May 2007. Allen and Ed will be at same amt, steve down a bit - based on current billing, gregg up 40-50%. This will allow me to put more focus on grant. Funding will also be put towards equipment improvements for affy das/2 server on our colo. andrew will start a full time job in 2007, will ramp up till end of year. hoping he can get a lot of the spec issues put to rest before he goes. probably it will be me (gregg) taking up the spec when he leaves, and hoping I won't have much to do on the spec docs. Topic: status reports --------------------- gh: have worked on the das/2 budget last 2 weeks. now should be able to get back to coding. has allen and brian received their reimbursements from code sprint? aday: brian got his, not me yet. gh: should get yours soon. ee: putting out a new release of igb this week. minor release. sc: helped straighted out file/dir permissions at biodas.org, lincoln was posting an update the to Bio::Das section on the biodas.org ftp site. gh: his new das/2 client in perl? he's been working on that. sc: not sure, possibly. sc: also talked with gregg regarding my time commitment for the das/2 extension period. will be able to devote a solid block of time (~4wks) sometime in Dec or Jan. aday: diagraming to get the current state of the spec. getting ready to do major server side rewrite, implemented block caching strategy, to allow same data source to do reads and writes. going with custom caching rather than apache mod proxy, gives us more control of operations. Performance improvements I did on the chado db can then be removed since everything will be cached. working on uml diags. gh: are you doing from scratch caching? aday: we have a mvc app. model layer talks to db, inherits from abstract db. that will stay the same. handles conversion of query string into sql. maybe trim it down and simplify based on spec cruft losses. for view and controller components, we use templates to generate xml. that will stay same, will use the catalyst web frame work, much like ruby on rails, executable scripts that generate code for v and c layers. will replace the current hand code with the catalyst generated stuff. sc: so this is like Ruby on Rails for perl? aday: yes aday: question on hw budget on this current round of funding? gh: yes. we originally discussed more hw for ucla or cshl. now looking more doubtful. would like to address towards end of year or jan. based on prev estimates, we never spend as much as budgetted for. if more left over, we can look into putting more hw. the affy hw is a sure thing. is need critical? aday: there is pressure on our hw to do upgrades, used by rest of lab as well now. maybe $5K would do it. dual or quad 4. gh: do able sooner rather than later. send some figures... [A] Allen will send gregg estimates on needed das-related hw upgrades at ucla.