One proxy between your code and the LLM API. Optimize cache hits, compress context, route smart, track every dollar.
TokenPak sits between your code and the LLM API. It packs your tokens before they leave your machine β optimizing cache hits, compressing context, and routing requests β then tracks every dollar so you know exactly what youβre spending.
Everything runs on your machine. No cloud. No accounts. No data leaves except the (optimized) API call.
LLM providers discount repeated prompt prefixes. But most SDKs serialize tool schemas and system prompts with non-deterministic key ordering β different bytes every request, cache miss every time.
TokenPakβs Tool Schema Registry normalizes everything into identical bytes. Same tools, same bytes, cache hit. On an agent with 20+ tools sending 10-20KB of schemas per request, this is the difference between paying full price and paying 10%.
Not every request needs your most expensive model. TokenPak routes by pattern, token count, or intent β sending simple tasks to cheaper models while keeping complex work on the heavy hitters. Automatic fallback chains mean nothing breaks.
Multi-stage pipeline strips redundancy from your context:
Content-aware: code gets AST-level compression (tree-sitter), docs get section-aware trimming, JSON/YAML gets schema extraction, logs get pattern dedup. 20-60% reduction depending on content type.
pip install tokenpak
tokenpak serve --port 8766
Change one line in your code:
import anthropic
client = anthropic.Anthropic(
base_url="http://localhost:8766", # π¦ that's it
api_key="sk-ant-..." # passes straight through
)
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": "Explain quantum computing"}]
)
# β Cache optimized. Compressed. Cost tracked. Automatically.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8766", api_key="sk-...")
export ANTHROPIC_BASE_URL=http://localhost:8766
# done. every request now goes through TokenPak.
git clone https://github.com/tokenpak/tokenpak.git && cd tokenpak
cp .env.example .env # add your API key
docker compose -f docker/docker-compose.yml up -d
tokenpak savings
Shows cost breakdown by model, cache hit rates, and what TokenPak saved you β updated in real time from your actual usage.
Web dashboard at http://localhost:8766/dashboard:
All local. All your data. Nothing phones home.
| Feature | What | Why |
|---|---|---|
| π§ Cache Optimization | Deterministic tool schema serialization | 86% of savings in production |
| π Smart Routing | Route by model, pattern, intent, token count | Right model for the job, automatic failover |
| π¦ Compression | Content-aware pipeline (code, docs, data, logs) | 20-60% fewer tokens per request |
| π° Cost Tracking | Per-request, per-model, per-session pricing | Know exactly what you spend |
| π Dashboard | 10-page web UI with FinOps/Engineering/Audit views | See everything, export anything |
| π Vault Indexing | Semantic search over your codebase | Zero-token search β never calls an LLM |
| π§ͺ A/B Testing | Compare strategies with statistical significance | Data-driven optimization |
| π» Shadow Mode | Validate compression without affecting production | Safe to try, safe to ship |
| π¨ Budget Enforcement | Limits + alerts per session, model, or agent | Never blow your budget |
| π‘οΈ DLP Scanning | Detect and redact sensitive data | PII stays on your machine |
| π Data Connectors | Local, Git, Obsidian, GitHub, Google Drive, Notion | Index any knowledge source |
| β‘ +2ms Latency | Sub-millisecond compression, minimal proxy overhead | You wonβt notice it |
| Platform | How |
|---|---|
| Anthropic SDK | base_url="http://localhost:8766" |
| OpenAI SDK | base_url="http://localhost:8766" |
| Google AI (Gemini) | Proxy adapter |
| Claude Code | export ANTHROPIC_BASE_URL=http://localhost:8766 |
| OpenClaw | Set provider base_url to proxy |
| Cursor | Custom API endpoint in settings |
| LiteLLM | Drop-in middleware or proxy |
| LangChain | from tokenpak.adapters.langchain import LangChainAdapter |
| Ollama | Compression + routing (no cost tracking for local) |
| curl / httpx / requests | Standard REST API |
# Run
tokenpak serve --port 8766 # start local proxy
tokenpak status # health check
tokenpak doctor # diagnose issues
# Monitor
tokenpak cost --week # cost report by model
tokenpak savings # what you've saved
# Compress
tokenpak compress <file> # dry-run compression
tokenpak diff <file> # before/after comparison
tokenpak demo # see pipeline on sample data
# Search
tokenpak index <path> # index a directory
tokenpak vault search "query" # semantic search (zero tokens)
# Route
tokenpak route add --model 'gpt-4*' --target anthropic/claude-sonnet-4
tokenpak route list
# Debug
tokenpak trace --id <id> # inspect pipeline run
tokenpak replay <id> # replay past request
30+ commands. Full reference: docs/cli-reference.md
ββββββββββββββββ ββββββββββββββββββββββββββββββββββββββββββββ ββββββββββββββββ
β Your Code ββββββΆβ π¦ TokenPak Proxy (:8766) ββββββΆβ LLM API β
β (any SDK) βββββββ Runs on YOUR machine βββββββ β
ββββββββββββββββ β β ββββββββββββββββ
β Cache Opt. Routing Telemetry β
β Compression Budget Dashboard β
β A/B Testing Shadow DLP Scan β
β Schema Reg. Circuit Brk Conn Pool β
ββββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββββββββββββββββββββ
β π Local Storage (never leaves) β
β SQLite telemetry Β· Vault index Β· Cache β
ββββββββββββββββββββββββββββββββββββββββββββ
{
"proxy": {
"port": 8766,
"passthrough_url": "https://api.anthropic.com"
},
"compression": {
"enabled": true,
"level": "balanced"
},
"budget": {
"monthly_usd": null,
"alert_at_pct": 80
}
}
Pre-built configs: anthropic-only Β· openai-only Β· cost-saving-max Β· local-ollama Β· privacy-first Β· mixed-routing Β· single-user Β· team-internal
| Platform | Guide |
|---|---|
| pip | pip install tokenpak && tokenpak serve |
| Docker Compose | docker/ |
| Kubernetes | deployments/k8s/ |
| AWS ECS | deployments/aws-ecs/ |
| GCP Cloud Run | deployments/gcp-cloud-run/ |
| systemd | tokenpak/agent/systemd/ |
pip install -e ".[dev]"
pytest tests/ -q # 316 tests, ~16s
Installation Β· Configuration Β· CLI Reference Β· Architecture Β· API Reference Β· Error Codes Β· Troubleshooting Β· Security
TokenPak implements the TokenPak Protocol v1.0: Block Schema Β· Compiled Artifact Β· Evidence Pack
See CONTRIBUTING.md.
git clone https://github.com/tokenpak/tokenpak && cd tokenpak
pip install -e ".[dev]" && pytest tests/ -q