Principles
json-packer is a JSON compression library whose core idea is to reduce storage space by analyzing the characteristics of JSON data and using multiple compression techniques. This document explains how it works from a principles perspective.
Why JSON Can Be Compressed
JSON data has several obvious characteristics that create opportunities for compression:
- Repeated key names: In complex JSON data, the same object keys appear repeatedly in large quantities
- Redundant type information: JSON's text format contains a lot of syntax characters like quotes, colons, commas, etc.
- Repeated string values: Certain string values appear repeatedly in the data
- Inefficient number representation: Numbers in JSON are stored in text form, taking up more space
Compression Strategies
1. Type Tags: Remove Syntax Redundancy
JSON's text format requires many syntax characters to distinguish different types:
{
"name": "Alice",
"age": 25,
"active": true,
"data": null
}
In this example, quotes, colons, commas, braces, and other characters take up a considerable proportion of space, but don't actually carry the information of the data itself.
json-packer's solution is to assign a 3-bit binary tag for each JSON type:
- null → 000
- false → 001
- true → 010
- integer → 011
- float → 100
- string → 101
- object → 110
- array → 111
This way, type information that originally required multiple characters now only needs 3 binary bits.
2. Huffman Encoding: Compress Repeated Key Names
Consider this JSON data:
[
{"name": "Alice", "age": 25, "city": "Beijing"},
{"name": "Bob", "age": 30, "city": "Shanghai"},
{"name": "Carol", "age": 28, "city": "Beijing"}
]
In this data, the three keys "name", "age", and "city" each appear 3 times. If stored conventionally, these strings would need to be saved completely each time.
The idea of Huffman encoding is: keys with high frequency use shorter codes, keys with low frequency use longer codes.
Specific process:
- Count frequency: Traverse the entire JSON and count the number of occurrences of each key
- Build Huffman tree: Build a binary tree based on frequency, with high-frequency keys closer to the root node
- Assign codes: The path of the tree is the code, left subtree is 0, right subtree is 1
For example, in the above case:
- "name", "age", "city" each appear 3 times and might be assigned 1-2 bit short codes
- If there are other keys that appear only once, they would be assigned longer codes
This way, keys that originally needed 4-5 bytes might now only need 1-2 binary bits.
3. Variable-Length Integer Encoding: Compress Numbers
Numbers stored in text form in JSON are very inefficient. For example, the number 123
requires 3 bytes in JSON, but in binary only needs 1 byte to represent the range 0-255.
json-packer uses LEB128 (Little Endian Base 128) encoding:
- Small numbers use fewer bytes: 0-127 only needs 1 byte, 128-16383 needs 2 bytes
- Large numbers automatically expand: Dynamically decide how many bytes to use based on number size
- Distinguish signed and unsigned: Positive integers and negative integers use different encoding strategies
4. String Pool: Deduplicate Values
In some JSON data, there are many repeated string values:
[
{"status": "connected", "message": "Successfully connected to server"},
{"status": "connected", "message": "Successfully connected to server"},
{"status": "disconnected", "message": "Connection lost"},
{"status": "connected", "message": "Successfully connected to server"}
]
Storing these repeated strings completely each time would cause great waste.
The idea of string pool is:
- Collect repeated strings: Find strings in the data that appear repeatedly and are long enough
- Establish index mapping: Extract these strings into a "pool" and assign each an ID
- References replace content: Store ID references in original positions instead of complete strings
For example, "Successfully connected to server", this 30-byte string, might only need 1 byte ID to reference.
Theoretical Optimum of Huffman Encoding
Theoretically, the benefit calculation of Huffman encoding should consider the length of the original string. The optimal Huffman encoding should be built based on the product of frequency and original length, because:
- Compression benefit of a key = original length × occurrence count - encoding length × occurrence count
- Therefore, we should prioritize assigning the shortest encoding to keys with the largest "original length × occurrence count" value
For example, if there is a key "extraLongKeyName" that appears 2 times, and "key" appears 3 times:
- "extraLongKeyName" has a theoretical benefit weight of 16 × 2 = 32
- "key" has a theoretical benefit weight of 3 × 3 = 9
According to theoretical optimum, we should prioritize assigning shorter encoding to "extraLongKeyName" because it can bring greater total benefit.
But in practice we found that for most JSON data scenarios, simple Huffman encoding based on frequency can already achieve good compression results. This is mainly due to the following reasons:
1. Characteristics of JSON Key Names
Key names are usually short and similar in length: Most JSON key names are 3-15 characters long, with little difference. Even with the theoretical optimal scheme, the difference in encoding lengths assigned to long and short keys is not significant.
High-frequency keys are often short keys themselves: Common keys like "id", "name", "type" are both high-frequency and short keys, so the simple frequency-based scheme is close to the theoretical optimal scheme.
2. Implementation Complexity and Performance Considerations
The theoretical optimal scheme requires additional calculation and storage of length information for each key, increasing algorithm complexity and encoding time. Since the difference is not significant in most scenarios, it was not introduced in the final design.
Why String Pool is Designed as Optional
String pool is not always beneficial, it depends on the characteristics of the data:
1. Data Characteristics Determine Benefits
Beneficial scenarios:
- Log data: Large amounts of repeated status codes and error messages
- Configuration data: Repeated paths, URLs, identifiers
- Database exports: Repeated categories, tags, descriptions
Unfavorable scenarios:
- User-generated content: Comments, messages, etc., with low repetition
- Time series data: Mainly numbers with fewer strings
- Small data volumes: The overhead of building a string pool may be greater than the benefits
2. Performance Trade-offs
Cost of enabling string pool:
- Need to traverse the entire data twice (first time for statistics, second time for compression)
- Need extra memory to store string pool and index mapping
- Compression and decompression speed will be slower
Advantages of disabling string pool:
- Only need to traverse data once
- Less memory usage
- Faster processing speed
- Simpler code logic
3. Flexibility Considerations
Different application scenarios have different requirements for compression ratio and performance:
- Real-time communication: Latency sensitive, might choose v1 format for faster processing speed
- Data storage: Space sensitive, might choose v2 format for higher compression ratio
- Batch processing: Can tolerate higher processing time, tends to choose v2 format
- Embedded devices: Memory constrained, might choose v1 format
4. Adaptive Strategy
json-packer allows users to configure based on data characteristics:
CompressOptions {
enable_value_pool: true, // Whether to enable string pool
pool_min_repeats: 3, // Must repeat at least 3 times to join pool
pool_min_string_len: 8, // Must be at least 8 bytes long to consider
}
This design allows users to tune based on actual data:
- If data contains mostly short strings, can increase
pool_min_string_len
- If memory is constrained, can increase
pool_min_repeats
to reduce pool size - If certain there are no repeated strings, directly disable
enable_value_pool
Bit-Level Optimization
To squeeze out the last compression space, json-packer performs optimization at the bit level:
1. Compact Bit Packing
- Type tags only use 3 bits, not wasting an entire byte
- Boolean values use 1 bit instead of an entire byte
- Multiple small pieces of information can be packed in the same byte
2. Byte Alignment Processing
- Perform byte alignment when needed to avoid read errors
- Smart bit buffer management to reduce memory allocation
Deterministic Design
To ensure the same input always produces the same output, json-packer uses deterministic strategies in all sorting operations:
- Key statistics sorting: Sort by lexicographical order instead of relying on hash table's random order
- Huffman tree construction: When frequencies are equal, decide priority by symbol order
- String pool sorting: First by frequency descending, then by lexicographical order ascending
Benefits of this design:
- Easy to debug and test
- Support incremental updates and difference comparison
- Improve cache hit rate