[DAS2] feature group assembly; proposals for simplification

Erwin, Ed Ed_Erwin at affymetrix.com
Mon Sep 18 22:01:19 UTC 2006


 
I have mostly used smaller examples from NCBI, but I've downloaded that
wormbase one to play with as a good test of a big file.

I took file "3R.gff" from here
ftp://flybase.net/genomes/Drosophila_melanogaster/current/gff/

I need something a little more than 2x the filesize to store that data
and to store the graphical objects used to represent it.  (I haven't
looked at exactly how much is data vs. graphics.)

Since IGB keeps everything in memory, we have optimized for memory
rather than speed.  One of the tricks here is that I don't create a
hashmap for the attributes.  I simply store the attributes string as a
string.  I then have to do some regex processing each time I want to
extract a property value, but that isn't very often and I intentionally
chose memory efficiency over speed.

The bigger problem seems to be that every GFF3 file I've seen in the
wild has violated the specification.  Every file I've tried has failed
the validator, and it isn't even a very strict validator.

In this case, one of the big things is that almost every feature has
"ID=-".  If I interpret that literally, then all those lines should be
joined into one big feature.  (I assume what was intended in this case
is that these are features without an ID, so I've added a special case
to handle that.)

This is getting off topic of DAS/2, but I'm trying to collect a list of
questionable things I've seen in GFF3 files and I'll try to get Lincoln
to rule on whether they are valid.


Ed


-----Original Message-----
From: das2-bounces at lists.open-bio.org
[mailto:das2-bounces at lists.open-bio.org] On Behalf Of Andrew Dalke
Sent: Monday, September 18, 2006 12:11 PM
To: DAS/2
Subject: Re: [DAS2] feature group assembly; proposals for simplification

Ed:
> I'm having trouble understanding where all this memory overhead comes
> from in your parsing of GFF3 files.  I've recently written a GFF3 
> parser
> for IGB.  I've found that the presence or absence of the "end of 
> feature
> set" marker "###" has little effect on the amount of memory required.

How big was the data set?  dmel-3R-r4.3.gff from flybase is 68,685,595
bytes.  Strange though now that I look at it.  I shouldn't have a 10x
overhead.

....

> The procedure is quite simple.

That's the first step. For sanity checking you should do
cycle detection, and likely check that the structure is
single-rooted.

....

> (You in fact get
> a bit of a boost because since each parent knows how many children it
> expects, you can set-up the child List objects with the correct size
> from the beginning.)

Only if the parents are listed first.  Otherwise there's no hint
for the correct size.




More information about the DAS2 mailing list