Documentation

Architecture, API reference, and how each compression stage works.

Architecture Overview

OpenCompact uses a 5-stage compression pipeline that optimizes both input and output tokens. At its core is an Agent-Aware Distilled model — a compact token-level classifier (44–150M params) trained on 105K annotated agent samples, replacing the generic LLMLingua-2 approach with domain-matched compression that runs 4–12× faster.

★ Agent-Distilled·§ Dict Compress·⟁ Routing·↻ Feedback

0Content Classifier

Auto-detects content type (json_api, code, prose, chat, mixed) and routes to specialized compression strategies optimized for each type.

1Dictionary Compression

Extracts repeating substrings (JSON keys, identifiers, repeated values) and replaces them with compact §XX aliases. Each alias is 1–2 tokens vs the original multi-token substring.

2Agent-Aware Distillation

A compact token-level keep/drop classifier (44–150M params) trained on AgentCompBench — 30K samples across 16 agent scenarios. Rate-conditioned: a single model handles all compression rates (0.2–0.8). Replaces the generic LLMLingua-2 approach with 4–12× faster inference.

3Output Shaping

Injects concise instruction into the system prompt. Calculates dynamic max_tokens and prepares @XX output alias instructions so the LLM produces shorter responses using dictionary aliases.

4Post-Process

Restores both §XX (input) and @XX (output) aliases in real-time during streaming. Handles partial aliases split across chunks. Runs token accounting and quality sampling.

+Adaptive Control

Analyzes content density (structure ratio, technical term density, words per line) and adjusts the compression rate to avoid destroying dense/structured content.

Agent-Aware Distillation

The core innovation: instead of using a generic model trained on meeting transcripts (LLMLingua-2), we trained a purpose-built classifier on real agent workloads. The key insight is data quality > model size — a 44M model with domain-matched training data matches or exceeds a 560M generic model.

AgentCompBench Dataset

30,000 samples × 3 compression rates = 105K annotations

16 agent scenarios across 2 categories:

Structured (14K)

code_review, tool_result, json_schema, code_debug, db_query, mcp_tool_call, git_diff, browser_a11y, cicd_log

Natural (21K)

email_digest, news_aggregate, agent_history, rag_context, meeting_notes, chat_history, devops_incident, browser_session

vs LLMLingua-2

Training data	MeetingBank → AgentCompBench
Model size	560M → 44–150M
Inference speed	~5K → 15–25K tok/s
Rate control	Per-rate → single model
Content routing	None → auto-detect

API Reference

POST/api/compare

Main endpoint. Compresses the prompt, runs both direct and compressed requests in parallel, and streams results via SSE.

Request Body

{
  "model": "deepseek/deepseek-v3.2",
  "system_prompt": "Optional system prompt",
  "user_prompt": "Your prompt to compress",
  "compression_rate": 0.5,
  "strategy": null,
  "preset": null,
  "dict_enabled": true
}

SSE Event Types

dict_info — Dictionary compression metadata (aliases, mappings)

compression_info — Token counts, rates, content type, density info

direct_token / compact_token — Streaming tokens (parallel)

direct_done / compact_done — Final stats (tokens, latency)

dict_stats — Alias hit rate and usage in output

quality — Cosine similarity, info retention, structural preservation

output_guard — Warning if compact output exceeds direct

stream_end — Sentinel event

POST/api/compress

Standalone compression without LLM call. Returns compressed text and metrics. Supports batch mode for up to 50 documents.

GET/api/config

Returns available models, presets, strategies, and defaults.

GET/api/health

Health check. Returns worker info, thread pool size, model status, and PID.

Quick Start

1. Clone & install

git clone <repo-url>
cd opencompactor
pip install -r requirements.txt

2. Configure

cp config.example.yaml config.yaml
# Edit config.yaml with your settings
export OPENCOMPACTOR_API_KEY=your-openrouter-key

3. Run

python demo_server.py
# Server starts on http://localhost:3000

4. Docker (optional)

docker compose up -d
# Backend on :3000, frontend on :3001

How Dictionary Compression Works

The dictionary engine extracts repeating substrings from the input and assigns compact aliases. This is especially effective for structured content like JSON, API responses, and code. The dictionary is applied before the distilled model prunes tokens, maximizing the combined compression ratio.

Input-side aliases (§XX)

Input:   "transaction_status_code": "declined"
         "transaction_status_code": "unauthorized"
         "transaction_status_code": "throttled"

Dictionary:
  §01 = transaction_status_code

Output:  "§01": "declined"
         "§01": "unauthorized"
         "§01": "throttled"

Savings: 3 × (24 - 3) = 63 characters saved

Output-side aliases (@XX)

The LLM is instructed to use @XX aliases in its response:

  @01 = "function"  @02 = "return"  @03 = "undefined"

LLM output:  "The @01 should @02 the value instead of @03"
Expanded:    "The function should return the value instead of undefined"

Output aliases can save 2-10× more than input compression
since output tokens are typically more expensive.

The dictionary is injected into the system prompt so the LLM understands both §XX and @XX aliases. The post-processor restores all aliases in real-time during streaming, handling partial aliases split across stream chunks.

Core Innovations

Agent-Aware Distillation

First compression model trained specifically for agent scenarios — tool calls, code review, API responses, agent history. Outperforms generic models at 4–12× the speed.

Data Quality > Model Size

A 44M parameter model with domain-matched training data matches or exceeds a 560M generic model (XLM-RoBERTa-large). Runs on consumer hardware.

Bidirectional Dictionary Compression

Input-side §XX aliases + output-side @XX aliases. Output alias compression is 2–10× more valuable since output tokens cost more.

Content-Type Routing + Rate Conditioning

Auto-detects content type and routes to specialized compression. A single rate-conditioned model handles the full 0.2–0.8 range — no separate checkpoints needed.

Closed-Loop Adaptive Control

Runtime rate-distortion curve fitting with quality feedback. Automatically adjusts compression rate based on content density to prevent information loss.