Documentation
Architecture, API reference, and how each compression stage works.
Architecture Overview
OpenCompress uses a 5-stage compression pipeline that optimizes both input and output tokens. The core inference engine is LLMLingua-2 (XLM-RoBERTa) running on SageMaker GPU for ~100ms token-level pruning, combined with dictionary aliasing, code minification, and scenario-aware output shaping.
Semantic Pruning (LLMLingua-2)
The core compression engine uses LLMLingua-2 — a token-level binary classifier based on XLM-RoBERTa. Each token is scored for importance, and low-value tokens are dropped while preserving semantic meaning. Running on SageMaker GPU (ml.g4dn.xlarge, NVIDIA T4), inference takes ~100ms per message.
| Model | XLM-RoBERTa (560M) |
| Inference | ~100ms (GPU) |
| Hardware | SageMaker ml.g4dn.xlarge |
| Input savings | 40–50% |
| Fallback | CPU (3–16s) |
How Dictionary Compression Works
The dictionary engine extracts repeating substrings from the input and assigns compact aliases. This is especially effective for structured content like JSON, API responses, and code. The dictionary is applied before semantic pruning, maximizing the combined compression ratio.
Input: "transaction_status_code": "declined"
"transaction_status_code": "unauthorized"
"transaction_status_code": "throttled"
Dictionary:
§01 = transaction_status_code
Output: "§01": "declined"
"§01": "unauthorized"
"§01": "throttled"
Savings: 3 × (24 - 3) = 63 characters savedThe LLM is instructed to use @XX aliases in its response: @01 = "function" @02 = "return" @03 = "undefined" LLM output: "The @01 should @02 the value instead of @03" Expanded: "The function should return the value instead of undefined" Output aliases save 2–10× more than input compression since output tokens are typically 3–5× more expensive.
The dictionary is injected into the system prompt so the LLM understands both §XX and @XX aliases. The post-processor restores all aliases in real-time during streaming, handling partial aliases split across stream chunks.
Output Shaping
Output tokens cost 3–5× more than input tokens. OpenCompress reduces output through two mechanisms: concise inject and output aliases.
Verified Results
Results from the playground A/B comparison (direct vs compressed, same model). Each example runs two parallel LLM calls — one with the original prompt, one compressed.
| Example | Input saved | Output saved | Cost saved |
|---|---|---|---|
| Code Review | 47% | 82% | 77% |
| API Reference | 44% | 76% | 69% |
| Agent Trace | 46% | 83% | 72% |
| Log Analysis | 46% | 55% | 51% |
| Schema Review | 44% | 80% | 74% |
Try it yourself in the Playground — no signup required.