Cut your LLM costs in real time

Compress prompts by 53%, cut latency 62%

Dual-layer adaptive compression that optimizes both input and output tokens. Dictionary aliases + semantic pruning + output guard.

53.6%
Tokens Saved
Input compression
62%
Latency Reduction
End-to-end
0.80+
Quality Score
Cosine similarity

5-Stage Compression Pipeline

Each stage targets a different source of redundancy

Content Classifier

Auto-detect content type and route to specialized strategy

json_api, code, prose, chat, mixed

Dictionary Compression

Extract repeating substrings and assign §XX / @XX aliases

Input + output bidirectional

Agent-Aware Distillation

Token-level keep/drop classifier trained on 105K agent samples

4–12× faster than LLMLingua-2

Output Shaping

Concise injection, dynamic max_tokens, alias restoration

Real-time streaming decompression

Adaptive Control

Closed-loop rate adjustment based on content density

Smart rate for dense content

What Makes It Different

Novel contributions beyond existing compression methods

Dual-Layer Compression

Combines structural dictionary compression with semantic token pruning for deeper reduction than either method alone.

Output Token Optimization

First system to compress output tokens via dictionary aliases. Output guard ensures expansions don't exceed direct output length.

Multilingual Support

Native Chinese, English, and mixed-language handling with CJK-aware tokenization and language-specific rate adjustment.

Adaptive Rate Control

Content density detection automatically adjusts compression rate. Dense structured content gets gentler compression to preserve information.

Compared to Existing Methods

FeatureOpenCompact
Input token compression
Output token optimization
Dictionary alias compression
Agent-aware distilled pruning
Content-type routing
Adaptive rate control
Multilingual (CJK + Latin)
Streaming decompression
Quality evaluation

Try it yourself

Paste your prompt, pick a compression rate, and see the A/B comparison in real time.