[DAS] 1.6 draft 7
Mitch Skinner
mitch_skinner at berkeley.edu
Fri Oct 1 19:57:04 UTC 2010
09/29/2010 08:57 AM, Andy Jenkinson wrote:
> In fact I did a little test: I randomly generated 100,000 features in a JSON format with 5 small string fields (indexed file size ~5 mb). It turns out that when uncompressed a keyed file is indeed much bigger (87.5%). After compression this went down to 25.8% though. When 10% of fields were empty (i.e. for the indexed style empty strings, and for the hashed style omitted pairs), this went down again to 16.8%. This last bit surprised me to be honest, I expected a much more modest effect. For 10% of fields in a dataset to be empty seems a reasonable expectation to me too, so if your data model has a variable "occupancy" of fields across rows it's worth considering I would suggest. In particular, using 'null' instead of empty strings would have an even bigger effect (I don't know what JBrowse does in such circumstances).
Hi, thanks for the thoughtful and detailed analysis; it's really nice to
get numbers on this stuff. 16.8% is certainly less than I would have
hoped for, but I think there may be some ways to win back a bit more space.
In JBrowse, the "schema" can vary by track; my assumption was that the
set of populated attributes in an individual track would be pretty
uniform. Some tracks might not use the "phase" field, for example, but
if a given track used phase information, then I figured that almost all
of the features in that track would populate that field.
For DAS, I was assuming that the "schema" could similarly vary on a
per-type or per-query basis. In other words, I think if it being
something that could be determined at server set-up time, or
HTTP-request time rather than being fixed in the DAS standard (the
standard could specify a controlled vocabulary of required/optional
fields, but it doesn't have to specify array indexes). I'm not sure how
easy that would be to implement; I imagine it would be relatively
straightforward for someone using something like the UCSC genome browser
database schema to determine the fields used in a given query result
based on the columns in the table being queried. But for other types of
DB schemas I imagine it could be harder to know what fields are going to
be used, unless you write code to go through the query result and check.
And I'm sure you're right that using null rather than an empty string
for unpopulated fields would use a bit more space. One possibility
would be to put the fields that are more likely to be empty at the end,
and just make the array shorter if a given record doesn't have values
for the trailing fields.
Also, javascript allows for array entries to be omitted entirely, like:
[10000, 15000, , "foo"]
JSON theoretically doesn't allow this; omitted entries become
"undefined" in javascript, and the "official" JSON spec disallows
"undefined" in an effort to facilitate interoperation with languages
that don't have separate notions of "null" and "undefined" the way
javascript does. But I'm not sure how often people obey that
restriction in practice; I personally think it was a silly choice. If
JSON doesn't want to include separate notions of "null" and "undefined",
then it can specify that they both map to the same thing in other
languages; e.g., in java they could both map to java's null. I'd feel
comfortable writing code that ignored that restriction and allowed for
omitted array entries. If you did that, then the cost of an empty field
is just one byte for the comma, or nothing if the field is at the end of
an array and you just send a shorter array.
Also, depending on the use case, I wonder if the difference in
(de)compression time between indexed and keyed JSON would matter. If
you have your generated data handy still, I'd be curious to know what
the difference is.
> Regarding memory, if you think about it there is nothing to stop you using arrays in code regardless of the file format, should memory be an issue.
True, although when I started I was paranoid about reducing the amount
of work done in javascript in the browser, so I wanted to pre-digest
things if I could rather than having the client translate. Over the
last few years, web browsers have gotten so much faster, but I think
it's still important to support IE; a significant number of people will
be using slow versions of IE (6-8) for a long time to come.
I recognize that DAS has a much broader variety of clients than just web
browsers, though.
Regards,
Mitch
More information about the DAS
mailing list