Documentation

Architecture, API reference, and how each compression stage works.

Architecture Overview

OpenCompact uses a 5-stage compression pipeline that optimizes both input and output tokens. At its core is an Agent-Aware Distilled model — a compact token-level classifier (44–150M params) trained on 105K annotated agent samples, replacing the generic LLMLingua-2 approach with domain-matched compression that runs 4–12× faster.

feedbackskip (short)PATH A: STRUCTPATH B: NATURALmessages[]S0Content ClassifierContent-type routing via heuristic classifierS1ADict CompressRepeated substring aliasingS1BDict CompressRepeated substring aliasingS2A ★compact-structuredToken-level binary classifierRate-conditioned, ≤86M paramsS2B ★compact-naturalToken-level binary classifierRate-conditioned, ≤86M paramsCORE INNOVATIONScenario-Specific DistillationS3Output ShapingSystem prompt injection + output budget controlUpstream LLMS4Post-ProcessStreaming decompression + adaptive rate feedback
Agent-Distilled·§ Dict Compress· Routing· Feedback
0Content Classifier
Auto-detects content type (json_api, code, prose, chat, mixed) and routes to specialized compression strategies optimized for each type.
1Dictionary Compression
Extracts repeating substrings (JSON keys, identifiers, repeated values) and replaces them with compact §XX aliases. Each alias is 1–2 tokens vs the original multi-token substring.
2Agent-Aware Distillation
A compact token-level keep/drop classifier (44–150M params) trained on AgentCompBench — 30K samples across 16 agent scenarios. Rate-conditioned: a single model handles all compression rates (0.2–0.8). Replaces the generic LLMLingua-2 approach with 4–12× faster inference.
3Output Shaping
Injects concise instruction into the system prompt. Calculates dynamic max_tokens and prepares @XX output alias instructions so the LLM produces shorter responses using dictionary aliases.
4Post-Process
Restores both §XX (input) and @XX (output) aliases in real-time during streaming. Handles partial aliases split across chunks. Runs token accounting and quality sampling.
+Adaptive Control
Analyzes content density (structure ratio, technical term density, words per line) and adjusts the compression rate to avoid destroying dense/structured content.

Agent-Aware Distillation

The core innovation: instead of using a generic model trained on meeting transcripts (LLMLingua-2), we trained a purpose-built classifier on real agent workloads. The key insight is data quality > model size — a 44M model with domain-matched training data matches or exceeds a 560M generic model.

AgentCompBench Dataset
30,000 samples × 3 compression rates = 105K annotations
16 agent scenarios across 2 categories:
Structured (14K)
code_review, tool_result, json_schema, code_debug, db_query, mcp_tool_call, git_diff, browser_a11y, cicd_log
Natural (21K)
email_digest, news_aggregate, agent_history, rag_context, meeting_notes, chat_history, devops_incident, browser_session
vs LLMLingua-2
Training dataMeetingBank → AgentCompBench
Model size560M → 44–150M
Inference speed~5K → 15–25K tok/s
Rate controlPer-rate → single model
Content routingNone → auto-detect

API Reference

POST/api/compare

Main endpoint. Compresses the prompt, runs both direct and compressed requests in parallel, and streams results via SSE.

Request Body
{
  "model": "deepseek/deepseek-v3.2",
  "system_prompt": "Optional system prompt",
  "user_prompt": "Your prompt to compress",
  "compression_rate": 0.5,
  "strategy": null,
  "preset": null,
  "dict_enabled": true
}
SSE Event Types
dict_info — Dictionary compression metadata (aliases, mappings)
compression_info — Token counts, rates, content type, density info
direct_token / compact_token — Streaming tokens (parallel)
direct_done / compact_done — Final stats (tokens, latency)
dict_stats — Alias hit rate and usage in output
quality — Cosine similarity, info retention, structural preservation
output_guard — Warning if compact output exceeds direct
stream_end — Sentinel event
POST/api/compress

Standalone compression without LLM call. Returns compressed text and metrics. Supports batch mode for up to 50 documents.

GET/api/config

Returns available models, presets, strategies, and defaults.

GET/api/health

Health check. Returns worker info, thread pool size, model status, and PID.

Quick Start

1. Clone & install
git clone <repo-url>
cd opencompactor
pip install -r requirements.txt
2. Configure
cp config.example.yaml config.yaml
# Edit config.yaml with your settings
export OPENCOMPACTOR_API_KEY=your-openrouter-key
3. Run
python demo_server.py
# Server starts on http://localhost:3000
4. Docker (optional)
docker compose up -d
# Backend on :3000, frontend on :3001

How Dictionary Compression Works

The dictionary engine extracts repeating substrings from the input and assigns compact aliases. This is especially effective for structured content like JSON, API responses, and code. The dictionary is applied before the distilled model prunes tokens, maximizing the combined compression ratio.

Input-side aliases (§XX)
Input:   "transaction_status_code": "declined"
         "transaction_status_code": "unauthorized"
         "transaction_status_code": "throttled"

Dictionary:
  §01 = transaction_status_code

Output:  "§01": "declined"
         "§01": "unauthorized"
         "§01": "throttled"

Savings: 3 × (24 - 3) = 63 characters saved
Output-side aliases (@XX)
The LLM is instructed to use @XX aliases in its response:

  @01 = "function"  @02 = "return"  @03 = "undefined"

LLM output:  "The @01 should @02 the value instead of @03"
Expanded:    "The function should return the value instead of undefined"

Output aliases can save 2-10× more than input compression
since output tokens are typically more expensive.

The dictionary is injected into the system prompt so the LLM understands both §XX and @XX aliases. The post-processor restores all aliases in real-time during streaming, handling partial aliases split across stream chunks.

Core Innovations

1
Agent-Aware Distillation
First compression model trained specifically for agent scenarios — tool calls, code review, API responses, agent history. Outperforms generic models at 4–12× the speed.
2
Data Quality > Model Size
A 44M parameter model with domain-matched training data matches or exceeds a 560M generic model (XLM-RoBERTa-large). Runs on consumer hardware.
3
Bidirectional Dictionary Compression
Input-side §XX aliases + output-side @XX aliases. Output alias compression is 2–10× more valuable since output tokens cost more.
4
Content-Type Routing + Rate Conditioning
Auto-detects content type and routes to specialized compression. A single rate-conditioned model handles the full 0.2–0.8 range — no separate checkpoints needed.
5
Closed-Loop Adaptive Control
Runtime rate-distortion curve fitting with quality feedback. Automatically adjusts compression rate based on content density to prevent information loss.