[DAS2] feature group assembly; proposals for simplification

Andrew Dalke dalke at dalkescientific.com
Mon Sep 18 22:59:51 UTC 2006


Ed:
> I took file "3R.gff" from here
> ftp://flybase.net/genomes/Drosophila_melanogaster/current/gff/

"current" is dmel_r4.3_20060303 .  I'm also using "4.3" but I
have different data.  It has

3R      sim4    na_transcript_dmel_r31  380     1913    .       +       
.
ID=-;

where I have

3R      sim4:na_transcript_dmel_r31     match   380     1913    .       
+
.       ID=:315834


> Since IGB keeps everything in memory, we have optimized for memory
> rather than speed.  One of the tricks here is that I don't create a
> hashmap for the attributes.

Hmmm.  My parser doesn't handle that, at least not without a
bit of monkey patching.  Thinking about it some .. that defers
errors until latter .. what errors? .. ahh, if a field doesn't
have a "=" in it then my code will raise an exception.

> I simply store the attributes string as a
> string.  I then have to do some regex processing each time I want to
> extract a property value, but that isn't very often and I intentionally
> chose memory efficiency over speed.

I didn't think regexps were the right solution for that.
Well, not unless you're using them for single character search.
For example,

     URL escaping rules are used for tags or values containing
     the following characters: ",=;".

means that you can't search for "ID=" attributes using the pattern
"ID=([^;])+" because "ID" could be written as "%49%44

> The bigger problem seems to be that every GFF3 file I've seen in the
> wild has violated the specification.  Every file I've tried has failed
> the validator, and it isn't even a very strict validator.

That's why I suspect GFF3 isn't used as input.  Otherwise these
would have been noticed and fixed.

> In this case, one of the big things is that almost every feature has
> "ID=-".  If I interpret that literally, then all those lines should be
> joined into one big feature.  (I assume what was intended in this case
> is that these are features without an ID, so I've added a special case
> to handle that.)

In my version of the data set there can be IDs.  What I found from
looking at other data sets is the ID can be duplicated but I don't
complain until assembling the complex feature and only when there is
a "parent" which uses a duplicate id.

A small part of my memory overhead (about 70 bytes per record)
tracks those duplicates.  I had forgotten about this in my previous
calculations.


> This is getting off topic of DAS/2, but I'm trying to collect a list of
> questionable things I've seen in GFF3 files and I'll try to get Lincoln
> to rule on whether they are valid.

I sent others to him last spring and he replied to me.  Here
they are in summary.  Some were requests for clarification.

  Q. Can the start and end position be '.'
  A. Yes, and it's allowed in the spec

  Q. Can the seqid be "."?
  A. "This is allowed by the spec, but I hope it would never happen.
     It means there is a floating feature that has no location. It
     should probably be forbidden for seqid to be . and start and end
     to be defined. Shall I modify the GFF3 spec to state so?

I see now I didn't respond:  "yes" is my answer


   Q. Can the 9th field be "."?
   A. This is ok.

   Q. Are zero length tags allowed?  Eg, an attribute field
    of "=5".  [...] I use a dictionary key of "".
   A. Allowed.

   Q. Should parsers raise an exception if the two
     characters after the '%' are not hex characters?
   A. Yes

(Note that my parser currently does not catch that error.)

   Q. Are duplicate attribute tags allowed, as in
         Parent=AB123;Parent=XY987
        If so, is it equivalent to
           Parent=AB123,XY987

   A. Absolutely! This is allowed and encouraged.



					Andrew
					dalke at dalkescientific.com




More information about the DAS2 mailing list