[DAS2] feature group assembly; proposals for simplification

Lincoln Stein lstein at cshl.edu
Mon Sep 18 17:23:38 UTC 2006


Hi,

My GFF3 parser works in a similar manner. As each feature comes in, it is
parsed, turned into an object, and sent to a disk-based database. The parent
link is kept in an in-memory data structure. At the end of the parse, the
parent link data structure is traversed and then the table of parent/child
relationships is written out to disk.

Lincoln

On 9/18/06, Erwin, Ed <Ed_Erwin at affymetrix.com> wrote:
>
>
> Andrew,
>
> I'm having trouble understanding where all this memory overhead comes
> from in your parsing of GFF3 files.  I've recently written a GFF3 parser
> for IGB.  I've found that the presence or absence of the "end of feature
> set" marker "###" has little effect on the amount of memory required.
>
> The procedure is quite simple.
>
> For each line in the GFF3 file, create an object in memory.
> Add that object to a list.
> If the object has an ID, store the "ID to object" mapping in a hashmap.
>
> At the end of file (or each "###" mark)
> Loop through the complete list of objects.
> For each one claiming to have one or more Parent_ID's, find those
> parents in the hashmap, add it as a child of those parents and remove it
> from the original list (which will then contain only parentless
> objects).
>
>
> That is all.  At the end you can throw away the hashmap.
>
> During processing you have to have one hashmap.  But I don't see how
> that adds a whole lot to the memory overhead.  In our model, each of the
> memory objects representing one feature keeps a list of pointers to its
> children.  While first reading the file, those pointers are left null,
> then the lists are constructed on the second pass (after the "###"
> marks).
>
> (In IGB, the final destination of the data is some in-memory objects.
> If your final destination is a database, then you can be writing each
> line to the database as it is read and then check for consistency of
> parents and children later.  You don't even need the in-memory hashmap
> then, because you can use a database table.)
>
> So basically, I just don't understand what problem you are trying to
> solve.  I don't object to adding <FEATURE_GROUP>, and I don't much care
> whether there are bi-directional references.  Bi-directional references
> do not seem necessary to me, and really just seems like a likely place
> for the users to make mistakes, but I don't see any reason to change the
> spec now.
>
> If there are bi-directional references, you can proceed exactly as
> above.  The primary references are references to the parents.  But when
> hooking a feature up to its parent, you can then check that the parent
> has listed this child as one of its expected children.  (You in fact get
> a bit of a boost because since each parent knows how many children it
> expects, you can set-up the child List objects with the correct size
> from the beginning.)
>
> Ed
>
>
> -----Original Message-----
> From: das2-bounces at lists.open-bio.org
> [mailto:das2-bounces at lists.open-bio.org] On Behalf Of Andrew Dalke
> Sent: Sunday, September 17, 2006 1:21 AM
> To: lincoln.stein at gmail.com
> Cc: DAS/2
> Subject: Re: [DAS2] feature group assembly; proposals for simplification
>
> On Sep 16, 2006, at 1:08 AM, Lincoln Stein wrote:
> > Hi Andrew,
> >
> >  Grouping them into a <FEATURE_GROUP> set is almost equivalent to the
> > "end of
> >  feature set" marker in GFF3, which is why I favor that solution. If
> > we do this, should we adopt the same convention for the GET requests
> > as well? If so, should we get rid of bidirection references?
>
> (I did notice that the GFF3 data sets I found, like wormbase, don't have
> the "end of feature set" marker.  My GFF3 parser has about 10x memory
> overhead
> so parsing a 80MB input file thrashed my 1GB laptop.  Adding a single
> marker in the middle, by hand, made it much happier.)
>
> If we have a <FEATURE_GROUP> such that features in that group are all
> connected to other and only to each other, then I have no problem
> getting
> rid of the child link.  It adds no benefits in that case but does cause
> the verification overhead of checking that both directions are correct.
>
>                                         Andrew
>                                         dalke at dalkescientific.com
>
> _______________________________________________
> DAS2 mailing list
> DAS2 at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/das2
>
> _______________________________________________
> DAS2 mailing list
> DAS2 at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/das2
>



-- 
Lincoln D. Stein
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
(516) 367-8380 (voice)
(516) 367-8389 (fax)
FOR URGENT MESSAGES & SCHEDULING,
PLEASE CONTACT MY ASSISTANT,
SANDRA MICHELSEN, AT michelse at cshl.edu



More information about the DAS2 mailing list