[DAS2] feature group assembly; proposals for simplification

Mon Sep 18 19:11:03 UTC 2006

Ed:
> I'm having trouble understanding where all this memory overhead comes
> from in your parsing of GFF3 files.  I've recently written a GFF3 
> parser
> for IGB.  I've found that the presence or absence of the "end of 
> feature
> set" marker "###" has little effect on the amount of memory required.

How big was the data set?  dmel-3R-r4.3.gff from flybase is 68,685,595
bytes.  Strange though now that I look at it.  I shouldn't have a 10x
overhead.

I'm looking at the memory use now.  I estimate my data structures
used roughly 340 bytes per feature.  Each line averages 80 characters
so 4.25x overhead and not 10x.  Very strange.  I'll need to dig in
to that some more.

I did find that I wasted a lot of space with small data structures.

class Location(object):
     def __init__(self, id, start, end):
         self.id, self.start, self.end = id, start, end

takes 190 bytes per instance.  When I change it to use slots
instead of a dictionary for attribute storage (deep Python trickery)

class Location(object):
     __slots__ = ["id", "start", "end"]
     def __init__(self, id, start, end):
         self.id, self.start, self.end = id, start, end

I use about 48 bytes per object.  That'll save about 122MB and
take me away from the edge of memory use.  I had used that
trick on my other data objects - I somehow missed Location.

I suspect the other big memory use in in the attribute table
for things like

     ID=80799wgsext-hsp;Name=80799wgsext

Each string has 16 bytes of overhead, I think, so 32 bytes
for each use of "ID" and "Name".  By interning those two
frequent strings I can save about 20 bytes per record (70%
of flybase records have ID, 55% have Name) or 19MB.

> The procedure is quite simple.

That's the first step. For sanity checking you should do
cycle detection, and likely check that the structure is
single-rooted.

> During processing you have to have one hashmap.  But I don't see how
> that adds a whole lot to the memory overhead.

It wasn't.  It was the per-record overhead.

> So basically, I just don't understand what problem you are trying to
> solve.

The reason for bidirectional links was to allow processing while
receiving data rather than waiting until the end.  With bi-di
you can in principle determine that a feature group is complete
when the last feature in the group arrives.

>   I don't object to adding <FEATURE_GROUP>, and I don't much care
> whether there are bi-directional references.  Bi-directional references
> do not seem necessary to me, and really just seems like a likely place
> for the users to make mistakes, but I don't see any reason to change 
> the
> spec now.

If it's error prone (I agree that it is) and it's hard to
use (which I now believe) and no one will use it for it's
intended goal (likely?) and it breaks no code to remove it
then I see little reason to keep it.

If processing while downloading is desirable than the easiest
solution to use is a <FEATURE_GROUP>, but the solution with the
least change to the existing spec is a "root=" attribute.

If processing while downloading is not sufficiently desirable
then there's no need for bi-di links and we can drop the <PART>
element and have the data structure be closer to GFF3.

> (You in fact get
> a bit of a boost because since each parent knows how many children it
> expects, you can set-up the child List objects with the correct size
> from the beginning.)

Only if the parents are listed first.  Otherwise there's no hint
for the correct size.

					Andrew
					dalke at dalkescientific.com