Documentation

Name: OpenCompress
Author: OpenCompress

Architecture, API reference, and how each compression stage works.

Architecture Overview

OpenCompress uses a 5-stage compression pipeline that optimizes both input and output tokens. The core inference engine is LLMLingua-2 (XLM-RoBERTa) running on SageMaker GPU for ~100ms token-level pruning, combined with dictionary aliasing, code minification, and scenario-aware output shaping.

★ GPU Inference·§ Dict Compress·@ Output Alias·↻ Stream Restore

S0Code Minification

Strips comments, empty lines, type annotations, and redundant whitespace. High-frequency identifiers (≥3 occurrences) are automatically shortened. 15–55% savings on code-heavy content.

S1Dictionary Compression

Extracts repeating substrings (JSON keys, identifiers, repeated values) and replaces them with compact §XX aliases. Each alias is 1–2 tokens vs the original multi-token substring.

S2Semantic Pruning

LLMLingua-2 (XLM-RoBERTa) runs on SageMaker GPU for ~100ms inference. Token-level binary classifier scores each token's importance and drops low-value tokens while preserving meaning. 40–50% input savings.

S3Output Shaping

Scenario-aware concise instructions are injected into the system prompt (codegen, codefix, structured, default). @XX output aliases let the LLM produce shorter responses. All aliases restored during streaming.

S4Post-Process

Restores both §XX (input) and @XX (output) aliases in real-time during streaming. Handles partial aliases split across chunks. Runs token accounting and quality sampling.

+Adaptive Rate

Analyzes content density (structure ratio, technical term density, words per line) and adjusts the compression rate to avoid destroying dense/structured content.

Semantic Pruning (LLMLingua-2)

The core compression engine uses LLMLingua-2 — a token-level binary classifier based on XLM-RoBERTa. Each token is scored for importance, and low-value tokens are dropped while preserving semantic meaning. Running on SageMaker GPU (ml.g4dn.xlarge, NVIDIA T4), inference takes ~100ms per message.

How It Works

1. Input text is tokenized by the XLM-RoBERTa tokenizer

2. Each token gets an importance score (0–1)

3. Tokens below the rate threshold are dropped

4. Remaining tokens are reassembled into compressed text

Per-message compression

System and assistant messages are compressed at the target rate. User messages are kept at rate=1.0 (no compression) to preserve intent.

Performance

Model	XLM-RoBERTa (560M)
Inference	~100ms (GPU)
Hardware	SageMaker ml.g4dn.xlarge
Input savings	40–50%
Fallback	CPU (3–16s)

How Dictionary Compression Works

The dictionary engine extracts repeating substrings from the input and assigns compact aliases. This is especially effective for structured content like JSON, API responses, and code. The dictionary is applied before semantic pruning, maximizing the combined compression ratio.

Input-side aliases (§XX)

Input:   "transaction_status_code": "declined"
         "transaction_status_code": "unauthorized"
         "transaction_status_code": "throttled"

Dictionary:
  §01 = transaction_status_code

Output:  "§01": "declined"
         "§01": "unauthorized"
         "§01": "throttled"

Savings: 3 × (24 - 3) = 63 characters saved

Output-side aliases (@XX)

The LLM is instructed to use @XX aliases in its response:

  @01 = "function"  @02 = "return"  @03 = "undefined"

LLM output:  "The @01 should @02 the value instead of @03"
Expanded:    "The function should return the value instead of undefined"

Output aliases save 2–10× more than input compression
since output tokens are typically 3–5× more expensive.

The dictionary is injected into the system prompt so the LLM understands both §XX and @XX aliases. The post-processor restores all aliases in real-time during streaming, handling partial aliases split across stream chunks.

Output Shaping

Output tokens cost 3–5× more than input tokens. OpenCompress reduces output through two mechanisms: concise inject and output aliases.

Concise Inject

Analyzes system + user messages to detect the scenario (codegen, codefix, structured, default) and appends a tailored concise instruction to the system prompt. 50–80% output token reduction with no quality loss on factual content.

Output Aliases (@XX)

Predicts which input terms the LLM will echo in its response. Assigns @XX aliases so the LLM outputs short references instead of full terms. Aliases are expanded transparently during streaming.

Adaptive Rate Control

Content density analysis adjusts compression rate per-message. Dense structured content (code, JSON) gets lighter compression. Sparse prose gets compressed harder. Prevents information loss on critical content.

Verified Results

Results from the playground A/B comparison (direct vs compressed, same model). Each example runs two parallel LLM calls — one with the original prompt, one compressed.

Example	Input saved	Output saved	Cost saved
Code Review	47%	82%	77%
API Reference	44%	76%	69%
Agent Trace	46%	83%	72%
Log Analysis	46%	55%	51%
Schema Review	44%	80%	74%

Try it yourself in the Playground — no signup required.