Implementation
json-packer is an efficient JSON binary compression library that achieves lossless compression through various optimization techniques while ensuring deterministic compression results. This document details its core principles and technical implementation.
Core Architecture
json-packer adopts a layered architecture design, including the following core modules:
- Encoder (encode.rs): Responsible for compressing JSON data into binary format
- Decoder (decode.rs): Responsible for restoring JSON from binary data
- Huffman Encoding (huffman.rs): Compresses JSON object keys
- String Pool (pool.rs): Deduplicates repeated string values
- Bitstream Processing (bitstream.rs): Efficient bit-level read/write operations
- Variable-Length Integer (varint.rs): Compact integer encoding
- Type System (types.rs): Defines JSON value type tags
Data Format Design
Overall Structure
The compressed binary data adopts the following format:
[Header] [Dictionary] [String Pool] [Data Section]
Header Structure
The header is located at the beginning of the data and contains format identification and version information:
// header.rs:3
pub const MAGIC: [u8; 4] = *b"JCPR"; // File identifier
pub const VERSION_V1: u8 = 0x01; // v1 format (no value pool)
pub const VERSION_V2: u8 = 0x02; // v2 format (enable value pool)
Header format: MAGIC(4 bytes) + VERSION(1 byte) + DICT_LEN(variable) + POOL_LEN(variable)
Dictionary Table
The dictionary table stores all JSON object keys and their frequency statistics for building Huffman encoding tree:
// dict.rs:37
pub fn write_dictionary(writer: &mut BitWriter, freq_map: &HashMap<String, u64>) {
varint::write_uleb128(writer, freq_map.len() as u64);
// Sort by lexicographical order to ensure deterministic output
let mut sorted_keys: Vec<_> = freq_map.iter().collect();
sorted_keys.sort_by(|a, b| a.0.cmp(b.0));
for (key, &freq) in sorted_keys {
// Key length + key content + key frequency
varint::write_uleb128(writer, key.len() as u64);
writer.write_bytes(key.as_bytes());
varint::write_uleb128(writer, freq);
}
}
String Pool (v2 format)
The string pool stores repeated string values for deduplication compression:
// pool.rs:55
pub fn write_string_pool(writer: &mut BitWriter, pool: &StringPool) {
for s in &pool.entries {
writer.write_bits(tag::STRING as u64, 3);
let bytes = s.as_bytes();
varint::write_uleb128(writer, bytes.len() as u64);
writer.write_bytes(bytes);
}
}
Data Section
The data section stores actual JSON values using type tags and corresponding encoding formats.
Type Encoding System
Type Tags
json-packer uses 3-bit type tags to identify JSON value types:
// types.rs:3-11
pub mod tag {
pub const NULL: u8 = 0b000; // null
pub const BOOL_FALSE: u8 = 0b001; // false
pub const BOOL_TRUE: u8 = 0b010; // true
pub const INT: u8 = 0b011; // integer
pub const FLOAT: u8 = 0b100; // float
pub const STRING: u8 = 0b101; // string
pub const OBJECT: u8 = 0b110; // object
pub const ARRAY: u8 = 0b111; // array
}
Numeric Encoding
Integer Encoding
Integer type adds 1-bit sign identifier after the type tag:
// encode.rs:12-29
Value::Number(n) => {
if let Some(i) = n.as_i64() {
writer.write_bits(tag::INT as u64, 3);
writer.write_bits(0, 1); // is_unsigned = 0
varint::write_sleb128(writer, i);
} else if let Some(u) = n.as_u64() {
writer.write_bits(tag::INT as u64, 3);
writer.write_bits(1, 1); // is_unsigned = 1
varint::write_uleb128(writer, u);
}
}
- Signed integer: Uses SLEB128 variable-length encoding
- Unsigned integer: Uses ULEB128 variable-length encoding
Float Encoding
Floats directly use IEEE 754 double precision format:
// encode.rs:23-27
else if let Some(f) = n.as_f64() {
if !f.is_finite() { return Err(Error::IllegalFloat); }
writer.write_bits(tag::FLOAT as u64, 3);
writer.write_bits(f.to_bits(), 64);
}
String Encoding
String encoding supports two modes:
v1 Format (Direct Encoding)
// encode.rs:31-36
Value::String(s) => {
writer.write_bits(tag::STRING as u64, 3);
let bytes = s.as_bytes();
varint::write_uleb128(writer, bytes.len() as u64);
writer.write_bytes(bytes);
}
v2 Format (Support Pool Reference)
// encode.rs:95-119
if let Some(pool) = string_pool {
if let Some(&id) = pool.index.get(s) {
writer.write_bits(tag::STRING as u64, 3);
writer.write_bits(1, 1); // is_pool_ref
varint::write_uleb128(writer, id);
return Ok(());
}
}
// Non-reference path
writer.write_bits(tag::STRING as u64, 3);
writer.write_bits(0, 1); // is_pool_ref
// ... write string content
Huffman Encoding Optimization
Canonical Huffman Encoding
json-packer uses Canonical Huffman encoding for JSON object keys to ensure the same input produces the same encoding result:
// huffman.rs:51-92
pub fn from_frequencies(freq_map: &HashMap<String, u64>) -> Result<Self, Error> {
// 1. Sort symbols by lexicographical order to ensure determinism
let mut symbols: Vec<(String, u64)> = freq_map.iter()
.map(|(k, &f)| (k.clone(), f))
.collect();
symbols.sort_by(|a, b| a.0.cmp(&b.0));
// 2. Build Huffman tree and calculate code lengths
let code_lengths = build_code_lengths(&symbols);
// 3. Canonical encoding: sort by (code length, lexicographical order)
let mut by_len: Vec<(usize, &str)> = symbols.iter()
.enumerate()
.map(|(i, (k, _))| (code_lengths[i], k.as_str()))
.collect();
by_len.sort_by(|a, b| a.0.cmp(&b.0).then(a.1.cmp(b.1)));
// 4. Assign codewords
let mut next_code = vec![0u32; max_len + 1];
let mut code: u32 = 0;
for bits in 1..=max_len {
code = (code + bl_count[bits - 1] as u32) << 1;
next_code[bits] = code;
}
}
Encoding Characteristics
- Determinism: The same key frequency distribution always produces the same encoding
- Optimality: Optimal prefix encoding based on frequency statistics
- LSB First: Encoding bit order uses least significant bit first for bitstream operations
String Pool Mechanism
Pooling Strategy
The string pool mechanism counts repetition frequency of string values and extracts qualified strings into the pool:
// pool.rs:25-53
pub fn collect_string_pool(root: &Value, cfg: PoolConfig) -> StringPool {
let mut counter: HashMap<String, u32> = HashMap::new();
// Traverse all string values
walk(root, &mut counter);
// Filter conditions: frequency >= min_repeats && length >= min_string_len
let mut candidates: Vec<(String, u32)> = counter.into_iter()
.filter(|(s, c)| *c >= cfg.min_repeats && s.len() >= cfg.min_string_len)
.collect();
// Sort: frequency descending, lexicographical ascending (ensure determinism)
candidates.sort_by(|a, b| b.1.cmp(&a.1).then(a.0.cmp(&b.0)));
}
Pooling Configuration
// pool.rs:15-23
#[derive(Debug, Clone, Copy)]
pub struct PoolConfig {
pub min_repeats: u32, // Minimum repetition count (default 3)
pub min_string_len: usize, // Minimum string length (default 8)
}
Bitstream Processing Mechanism
LSB First Bit Order
json-packer adopts LSB (Least Significant Bit first) bit order processing:
// bitstream.rs:15-36
pub fn write_bits(&mut self, mut value: u64, mut n_bits: u32) {
while n_bits > 0 {
let take = (64 - self.bit_len as u32).min(n_bits);
let mask = if take == 64 { u64::MAX } else { (1u64 << take) - 1 };
let chunk = value & mask;
self.bit_bucket |= chunk << self.bit_len; // LSB first filling
self.bit_len += take as u8;
value >>= take;
n_bits -= take;
// Write byte when 8 bits are full
while self.bit_len >= 8 {
self.buffer.push((self.bit_bucket & 0xFF) as u8);
self.bit_bucket >>= 8;
self.bit_len -= 8;
}
}
}
Advantages
- Efficient buffering: Use 64-bit buffer to reduce memory operations
- Bit alignment: Automatically handle bit boundary alignment
- Streaming processing: Support streaming read/write with controllable memory usage
Variable-Length Integer Encoding
ULEB128/SLEB128
json-packer uses LEB128 format to encode integers:
// varint.rs:3-12
pub fn write_uleb128(writer: &mut BitWriter, mut value: u64) {
loop {
let mut byte = (value & 0x7F) as u8;
value >>= 7;
if value != 0 { byte |= 0x80; } // Continuation bit
writer.write_byte(byte);
if value == 0 { break; }
}
}
Encoding Efficiency
- Compactness: Small integers occupy fewer bytes
- Unsigned/signed: Separate optimization handling
- Standard format: Compatible with universal LEB128 standard
Determinism Guarantee
Sorting Strategy
To ensure the same input produces the same output, json-packer adopts deterministic sorting in key operations:
- Dictionary key sorting: Sort by lexicographical order (
dict.rs:42
) - Huffman symbol sorting: Sort by lexicographical order (
huffman.rs:26
) - String pool sorting: Frequency descending + lexicographical ascending (
pool.rs:44
)
Build Process Determinism
Huffman tree construction uses deterministic merge strategy:
// huffman.rs:156-159
impl Ord for OrdNode {
fn cmp(&self, other: &Self) -> Ordering {
// Lower frequency first, then minimum symbol index first
other.0.freq.cmp(&self.0.freq).then(other.0.min_sym_idx.cmp(&self.0.min_sym_idx))
}
}
Performance Optimization
Memory Management
- Pre-allocate capacity: Pre-allocate
Vec
capacity based on element count - Zero-copy design: String processing avoids unnecessary copying
- Bit operation optimization: Use bitwise operations instead of division/modulo
Encoding Efficiency
- Type tag compression: Only use 3 bits to represent 8 JSON types
- Variable-length encoding: Integer length dynamically adjusts based on value size
- Huffman compression: Object keys optimize encoding length based on frequency
Error Handling
json-packer defines a complete error type system:
// error.rs
pub enum Error {
BadMagic, // File header identifier error
BadVersion, // Version not supported
BitstreamOutOfBounds, // Bitstream out of bounds
VarintError, // Variable-length integer error
IllegalFloat, // Illegal float (NaN/Inf)
HuffmanError, // Huffman encoding error
PoolMissing, // Value pool missing
PoolIdOutOfRange, // Pool ID out of range
// ...
}
Version Compatibility
v1 Format (Basic Compression)
- No string value pool
- Support Huffman key compression
- Best compatibility
v2 Format (Enhanced Compression)
- Enable string value pool
- Higher compression ratio
- Backward compatible with v1 decoding
Compression options control:
// encode.rs:55-64
#[derive(Debug, Clone)]
pub struct CompressOptions {
pub enable_value_pool: bool, // Enable value pool (automatically use v2 format)
pub pool_min_repeats: u32, // Pool minimum repetition count
pub pool_min_string_len: usize, // Pool minimum string length
}