Documentation

Architecture, API reference, and how each compression stage works.

Architecture Overview

OpenCompress uses a 5-stage compression pipeline that optimizes both input and output tokens. The core inference engine is LLMLingua-2 (XLM-RoBERTa) running on SageMaker GPU for ~100ms token-level pruning, combined with dictionary aliasing, code minification, and scenario-aware output shaping.

messages[]S0Code MinificationStrip comments, whitespace, shorten identifiers15–55% on code-heavy contentS1Dictionary CompressionRepeated substrings → §XX aliases (bidirectional)Input + output alias mappingS2 ★Semantic PruningLLMLingua-2 (XLM-RoBERTa) on SageMaker GPUToken-level keep/drop classifier, ~100ms inference40–50% input tokens savedS3Output ShapingConcise inject + @XX output aliases + dynamic max_tokens50–80% output tokens savedUpstream LLMS4Post-ProcessStreaming §XX / @XX alias restorationHandles partial aliases across stream chunksADAPTIVEsmart_rate()density→rate
GPU Inference·§ Dict Compress·@ Output Alias· Stream Restore
S0Code Minification
Strips comments, empty lines, type annotations, and redundant whitespace. High-frequency identifiers (≥3 occurrences) are automatically shortened. 15–55% savings on code-heavy content.
S1Dictionary Compression
Extracts repeating substrings (JSON keys, identifiers, repeated values) and replaces them with compact §XX aliases. Each alias is 1–2 tokens vs the original multi-token substring.
S2Semantic Pruning
LLMLingua-2 (XLM-RoBERTa) runs on SageMaker GPU for ~100ms inference. Token-level binary classifier scores each token's importance and drops low-value tokens while preserving meaning. 40–50% input savings.
S3Output Shaping
Scenario-aware concise instructions are injected into the system prompt (codegen, codefix, structured, default). @XX output aliases let the LLM produce shorter responses. All aliases restored during streaming.
S4Post-Process
Restores both §XX (input) and @XX (output) aliases in real-time during streaming. Handles partial aliases split across chunks. Runs token accounting and quality sampling.
+Adaptive Rate
Analyzes content density (structure ratio, technical term density, words per line) and adjusts the compression rate to avoid destroying dense/structured content.

Semantic Pruning (LLMLingua-2)

The core compression engine uses LLMLingua-2 — a token-level binary classifier based on XLM-RoBERTa. Each token is scored for importance, and low-value tokens are dropped while preserving semantic meaning. Running on SageMaker GPU (ml.g4dn.xlarge, NVIDIA T4), inference takes ~100ms per message.

How It Works
1. Input text is tokenized by the XLM-RoBERTa tokenizer
2. Each token gets an importance score (0–1)
3. Tokens below the rate threshold are dropped
4. Remaining tokens are reassembled into compressed text
Per-message compression
System and assistant messages are compressed at the target rate. User messages are kept at rate=1.0 (no compression) to preserve intent.
Performance
ModelXLM-RoBERTa (560M)
Inference~100ms (GPU)
HardwareSageMaker ml.g4dn.xlarge
Input savings40–50%
FallbackCPU (3–16s)

How Dictionary Compression Works

The dictionary engine extracts repeating substrings from the input and assigns compact aliases. This is especially effective for structured content like JSON, API responses, and code. The dictionary is applied before semantic pruning, maximizing the combined compression ratio.

Input-side aliases (§XX)
Input:   "transaction_status_code": "declined"
         "transaction_status_code": "unauthorized"
         "transaction_status_code": "throttled"

Dictionary:
  §01 = transaction_status_code

Output:  "§01": "declined"
         "§01": "unauthorized"
         "§01": "throttled"

Savings: 3 × (24 - 3) = 63 characters saved
Output-side aliases (@XX)
The LLM is instructed to use @XX aliases in its response:

  @01 = "function"  @02 = "return"  @03 = "undefined"

LLM output:  "The @01 should @02 the value instead of @03"
Expanded:    "The function should return the value instead of undefined"

Output aliases save 2–10× more than input compression
since output tokens are typically 3–5× more expensive.

The dictionary is injected into the system prompt so the LLM understands both §XX and @XX aliases. The post-processor restores all aliases in real-time during streaming, handling partial aliases split across stream chunks.

Output Shaping

Output tokens cost 3–5× more than input tokens. OpenCompress reduces output through two mechanisms: concise inject and output aliases.

1
Concise Inject
Analyzes system + user messages to detect the scenario (codegen, codefix, structured, default) and appends a tailored concise instruction to the system prompt. 50–80% output token reduction with no quality loss on factual content.
2
Output Aliases (@XX)
Predicts which input terms the LLM will echo in its response. Assigns @XX aliases so the LLM outputs short references instead of full terms. Aliases are expanded transparently during streaming.
3
Adaptive Rate Control
Content density analysis adjusts compression rate per-message. Dense structured content (code, JSON) gets lighter compression. Sparse prose gets compressed harder. Prevents information loss on critical content.

Verified Results

Results from the playground A/B comparison (direct vs compressed, same model). Each example runs two parallel LLM calls — one with the original prompt, one compressed.

ExampleInput savedOutput savedCost saved
Code Review47%82%77%
API Reference44%76%69%
Agent Trace46%83%72%
Log Analysis46%55%51%
Schema Review44%80%74%

Try it yourself in the Playground — no signup required.