Overview

Fusing JSON decompression and parsing

Research project in the SYSTEMF@EPFL laboratory, by Ludovic Mermod.

The logs of the project are available here.

Table of Contents

Abstract

Web browsers and web servers typically perform decompression and decoding separately. As a result decoding does not benefit from redundancy in the input: a 10kB compressed JSON stream that decompresses into 10MB worth of JSON will cause 10MB worth of JSON to be fed into the JSON decoder, and 10MB worth of decoding work to be performed. This is a giant waste of time: there's really only 10kB worth of decoding work to do here; the rest is redundant.

Proposal: Explore opportunities to fuse decompression and decoding.

Here is a very simple example: suppose that “compression” is “run-length encoding” (0000000111111111100 → 07,110,0*2) and “decoding” is “counting the number of ones”.

Here is the decoder:

for bit in decompressed_stream:
    if bit == 1:
        count += 1

This is hopelessly inefficient compared to the fused decompressor/decoder that operates directly on the compressed stream:

for bit, repeats in compressed_stream:
    if bit == 1:
        count += repeats

We should be able to achieve similar savings with off-the-shelf formats and compression algorithms, such as gzip+JSON. For example, if a key is repeated 1000 times in a given JSON file, we ideally shouldn't have to re-decode it 1000 times. Instead, assuming (optimistically) that each repeated appearance of that key in the uncompressed stream is replaced in the compressed stream by a backpointer to the location of its first appearance, a fused decompressor-decoder should be able to decode the key only once (the first time it sees it) and then reuse the decoded version every time it sees a new backpointer to it.

(Of course, compressed block boundaries won't nicely align with JSON object boundaries — that's partly what makes the problem so interesting. I suppose it would be reasonable to start by exploring the case where we hack together a gzip compressor that's polite enough to put block boundaries in semantically meaningful places from JSON's perspective.) Context

Example [Clément Pit-Claudel]

I first ran into this issue in 2020 in Alectryon. The HTML and JSON that Alectryon generates for unfocused proofs with many open goals is highly redundant, with many literally repeated fragments throughout the output. The result is a web page that weighs 25MB but compresses down to 20kB… and still takes 3 seconds to render, because the HTML decoder has to go through 25MB of HTML after decompressing.

I fixed the issue in Alectryon by using a bespoke compression scheme that's trivial to fuse with decoding: I replace each redundant HTML node by a backpointer (using a custom HTML tag), let the browser decode the deduplicated HTML, and then use a 3-lines JavaScript program to replace each backpointer with a copy of the node that it points to (copying nodes is very cheap — much cheaper than decoding the corresponding HTML over and over).

Amazingly, it turned out in that case that the same savings apply to the HTML encoding side as well: instead of repeatedly re-encoding the same objects as HTML, we can memoize the HTML corresponding to repeated objects and perform the encoding only once, shaving off over 90% of the encoding time in some cases. (Again, this is easy because of the bespoke compression scheme.)

The aim of this project is to try to replicate these savings with off-the shelf compression algorithms.

References