[DAS] 1.6 draft 7

Fri Oct 1 19:57:04 UTC 2010

    09/29/2010 08:57 AM, Andy Jenkinson wrote:
> In fact I did a little test: I randomly generated 100,000 features in a JSON format with 5 small string fields (indexed file size ~5 mb). It turns out that when uncompressed a keyed file is indeed much bigger (87.5%). After compression this went down to 25.8% though. When 10% of fields were empty (i.e. for the indexed style empty strings, and for the hashed style omitted pairs), this went down again to 16.8%. This last bit surprised me to be honest, I expected a much more modest effect. For 10% of fields in a dataset to be empty seems a reasonable expectation to me too, so if your data model has a variable "occupancy" of fields across rows it's worth considering I would suggest. In particular, using 'null' instead of empty strings would have an even bigger effect (I don't know what JBrowse does in such circumstances).

Hi, thanks for the thoughtful and detailed analysis; it's really nice to 
get numbers on this stuff.  16.8% is certainly less than I would have 
hoped for, but I think there may be some ways to win back a bit more space.

In JBrowse, the "schema" can vary by track; my assumption was that the 
set of populated attributes in an individual track would be pretty 
uniform.  Some tracks might not use the "phase" field, for example, but 
if a given track used phase information, then I figured that almost all 
of the features in that track would populate that field.

For DAS, I was assuming that the "schema" could similarly vary on a 
per-type or per-query basis.  In other words, I think if it being 
something that could be determined at server set-up time, or 
HTTP-request time rather than being fixed in the DAS standard (the 
standard could specify a controlled vocabulary of required/optional 
fields, but it doesn't have to specify array indexes).  I'm not sure how 
easy that would be to implement; I imagine it would be relatively 
straightforward for someone using something like the UCSC genome browser 
database schema to determine the fields used in a given query result 
based on the columns in the table being queried.  But for other types of 
DB schemas I imagine it could be harder to know what fields are going to 
be used, unless you write code to go through the query result and check.

And I'm sure you're right that using null rather than an empty string 
for unpopulated fields would use a bit more space.  One possibility 
would be to put the fields that are more likely to be empty at the end, 
and just make the array shorter if a given record doesn't have values 
for the trailing fields.

Also, javascript allows for array entries to be omitted entirely, like:

[10000, 15000, , "foo"]

JSON theoretically doesn't allow this; omitted entries become 
"undefined" in javascript, and the "official" JSON spec disallows 
"undefined" in an effort to facilitate interoperation with languages 
that don't have separate notions of "null" and "undefined" the way 
javascript does.  But I'm not sure how often people obey that 
restriction in practice; I personally think it was a silly choice.  If 
JSON doesn't want to include separate notions of "null" and "undefined", 
then it can specify that they both map to the same thing in other 
languages; e.g., in java they could both map to java's null.  I'd feel 
comfortable writing code that ignored that restriction and allowed for 
omitted array entries.  If you did that, then the cost of an empty field 
is just one byte for the comma, or nothing if the field is at the end of 
an array and you just send a shorter array.

Also, depending on the use case, I wonder if the difference in 
(de)compression time between indexed and keyed JSON would matter.  If 
you have your generated data handy still, I'd be curious to know what 
the difference is.

> Regarding memory, if you think about it there is nothing to stop you using arrays in code regardless of the file format, should memory be an issue.

True, although when I started I was paranoid about reducing the amount 
of work done in javascript in the browser, so I wanted to pre-digest 
things if I could rather than having the client translate.  Over the 
last few years, web browsers have gotten so much faster, but I think 
it's still important to support IE; a significant number of people will 
be using slow versions of IE (6-8) for a long time to come.

I recognize that DAS has a much broader variety of clients than just web 
browsers, though.

Regards,
Mitch