[DAS2] feature group assembly; proposals for simplification

Mon Sep 18 16:54:59 UTC 2006

Andrew,

I'm having trouble understanding where all this memory overhead comes
from in your parsing of GFF3 files.  I've recently written a GFF3 parser
for IGB.  I've found that the presence or absence of the "end of feature
set" marker "###" has little effect on the amount of memory required.

The procedure is quite simple.  

For each line in the GFF3 file, create an object in memory.
Add that object to a list.
If the object has an ID, store the "ID to object" mapping in a hashmap.

At the end of file (or each "###" mark)
Loop through the complete list of objects.
For each one claiming to have one or more Parent_ID's, find those
parents in the hashmap, add it as a child of those parents and remove it
from the original list (which will then contain only parentless
objects).

That is all.  At the end you can throw away the hashmap.

During processing you have to have one hashmap.  But I don't see how
that adds a whole lot to the memory overhead.  In our model, each of the
memory objects representing one feature keeps a list of pointers to its
children.  While first reading the file, those pointers are left null,
then the lists are constructed on the second pass (after the "###"
marks).

(In IGB, the final destination of the data is some in-memory objects.
If your final destination is a database, then you can be writing each
line to the database as it is read and then check for consistency of
parents and children later.  You don't even need the in-memory hashmap
then, because you can use a database table.)

So basically, I just don't understand what problem you are trying to
solve.  I don't object to adding <FEATURE_GROUP>, and I don't much care
whether there are bi-directional references.  Bi-directional references
do not seem necessary to me, and really just seems like a likely place
for the users to make mistakes, but I don't see any reason to change the
spec now.

If there are bi-directional references, you can proceed exactly as
above.  The primary references are references to the parents.  But when
hooking a feature up to its parent, you can then check that the parent
has listed this child as one of its expected children.  (You in fact get
a bit of a boost because since each parent knows how many children it
expects, you can set-up the child List objects with the correct size
from the beginning.)

Ed

-----Original Message-----
From: das2-bounces at lists.open-bio.org
[mailto:das2-bounces at lists.open-bio.org] On Behalf Of Andrew Dalke
Sent: Sunday, September 17, 2006 1:21 AM
To: lincoln.stein at gmail.com
Cc: DAS/2
Subject: Re: [DAS2] feature group assembly; proposals for simplification

On Sep 16, 2006, at 1:08 AM, Lincoln Stein wrote:
> Hi Andrew,
>
>  Grouping them into a <FEATURE_GROUP> set is almost equivalent to the 
> "end of
>  feature set" marker in GFF3, which is why I favor that solution. If 
> we do this, should we adopt the same convention for the GET requests 
> as well? If so, should we get rid of bidirection references?

(I did notice that the GFF3 data sets I found, like wormbase, don't have
the "end of feature set" marker.  My GFF3 parser has about 10x memory 
overhead
so parsing a 80MB input file thrashed my 1GB laptop.  Adding a single
marker in the middle, by hand, made it much happier.)

If we have a <FEATURE_GROUP> such that features in that group are all
connected to other and only to each other, then I have no problem 
getting
rid of the child link.  It adds no benefits in that case but does cause
the verification overhead of checking that both directions are correct.

					Andrew
					dalke at dalkescientific.com

_______________________________________________
DAS2 mailing list
DAS2 at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/das2