Documentation
Architecture, API reference, and how each compression stage works.
Architecture Overview
OpenCompact uses a 5-stage compression pipeline that optimizes both input and output tokens. At its core is an Agent-Aware Distilled model — a compact token-level classifier (44–150M params) trained on 105K annotated agent samples, replacing the generic LLMLingua-2 approach with domain-matched compression that runs 4–12× faster.
Agent-Aware Distillation
The core innovation: instead of using a generic model trained on meeting transcripts (LLMLingua-2), we trained a purpose-built classifier on real agent workloads. The key insight is data quality > model size — a 44M model with domain-matched training data matches or exceeds a 560M generic model.
| Training data | MeetingBank → AgentCompBench |
| Model size | 560M → 44–150M |
| Inference speed | ~5K → 15–25K tok/s |
| Rate control | Per-rate → single model |
| Content routing | None → auto-detect |
API Reference
/api/compareMain endpoint. Compresses the prompt, runs both direct and compressed requests in parallel, and streams results via SSE.
{
"model": "deepseek/deepseek-v3.2",
"system_prompt": "Optional system prompt",
"user_prompt": "Your prompt to compress",
"compression_rate": 0.5,
"strategy": null,
"preset": null,
"dict_enabled": true
}dict_info — Dictionary compression metadata (aliases, mappings)compression_info — Token counts, rates, content type, density infodirect_token / compact_token — Streaming tokens (parallel)direct_done / compact_done — Final stats (tokens, latency)dict_stats — Alias hit rate and usage in outputquality — Cosine similarity, info retention, structural preservationoutput_guard — Warning if compact output exceeds directstream_end — Sentinel event/api/compressStandalone compression without LLM call. Returns compressed text and metrics. Supports batch mode for up to 50 documents.
/api/configReturns available models, presets, strategies, and defaults.
/api/healthHealth check. Returns worker info, thread pool size, model status, and PID.
Quick Start
git clone <repo-url> cd opencompactor pip install -r requirements.txt
cp config.example.yaml config.yaml # Edit config.yaml with your settings export OPENCOMPACTOR_API_KEY=your-openrouter-key
python demo_server.py # Server starts on http://localhost:3000
docker compose up -d # Backend on :3000, frontend on :3001
How Dictionary Compression Works
The dictionary engine extracts repeating substrings from the input and assigns compact aliases. This is especially effective for structured content like JSON, API responses, and code. The dictionary is applied before the distilled model prunes tokens, maximizing the combined compression ratio.
Input: "transaction_status_code": "declined"
"transaction_status_code": "unauthorized"
"transaction_status_code": "throttled"
Dictionary:
§01 = transaction_status_code
Output: "§01": "declined"
"§01": "unauthorized"
"§01": "throttled"
Savings: 3 × (24 - 3) = 63 characters savedThe LLM is instructed to use @XX aliases in its response: @01 = "function" @02 = "return" @03 = "undefined" LLM output: "The @01 should @02 the value instead of @03" Expanded: "The function should return the value instead of undefined" Output aliases can save 2-10× more than input compression since output tokens are typically more expensive.
The dictionary is injected into the system prompt so the LLM understands both §XX and @XX aliases. The post-processor restores all aliases in real-time during streaming, handling partial aliases split across stream chunks.