tokenpak

πŸ“¦ TokenPak

One proxy between your code and the LLM API. Optimize cache hits, compress context, route smart, track every dollar.

Python 3.10+ License: MIT Tests

TokenPak sits between your code and the LLM API. It packs your tokens before they leave your machine β€” optimizing cache hits, compressing context, and routing requests β€” then tracks every dollar so you know exactly what you’re spending.

Everything runs on your machine. No cloud. No accounts. No data leaves except the (optimized) API call.


Where The Savings Come From

🧊 Cache Optimization

LLM providers discount repeated prompt prefixes. But most SDKs serialize tool schemas and system prompts with non-deterministic key ordering β€” different bytes every request, cache miss every time.

TokenPak’s Tool Schema Registry normalizes everything into identical bytes. Same tools, same bytes, cache hit. On an agent with 20+ tools sending 10-20KB of schemas per request, this is the difference between paying full price and paying 10%.

πŸ”€ Smart Routing

Not every request needs your most expensive model. TokenPak routes by pattern, token count, or intent β€” sending simple tasks to cheaper models while keeping complex work on the heavy hitters. Automatic fallback chains mean nothing breaks.

πŸ“¦ Token Compression

Multi-stage pipeline strips redundancy from your context:

  1. Segment β€” split into semantic blocks
  2. Fingerprint β€” detect type (code, docs, config, logs)
  3. Compress β€” apply type-aware recipes
  4. Budget β€” allocate tokens by priority
  5. Assemble β€” rebuild with fewer tokens

Content-aware: code gets AST-level compression (tree-sitter), docs get section-aware trimming, JSON/YAML gets schema extraction, logs get pattern dedup. 20-60% reduction depending on content type.


Quick Start

pip install tokenpak
tokenpak serve --port 8766

Change one line in your code:

Anthropic (Claude)

import anthropic

client = anthropic.Anthropic(
    base_url="http://localhost:8766",  # πŸ“¦ that's it
    api_key="sk-ant-..."              # passes straight through
)

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)
# β†’ Cache optimized. Compressed. Cost tracked. Automatically.

OpenAI

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8766", api_key="sk-...")

Claude Code / OpenClaw

export ANTHROPIC_BASE_URL=http://localhost:8766
# done. every request now goes through TokenPak.

Docker

git clone https://github.com/tokenpak/tokenpak.git && cd tokenpak
cp .env.example .env        # add your API key
docker compose -f docker/docker-compose.yml up -d

See What You’re Spending

tokenpak savings

Shows cost breakdown by model, cache hit rates, and what TokenPak saved you β€” updated in real time from your actual usage.

Web dashboard at http://localhost:8766/dashboard:

All local. All your data. Nothing phones home.


Everything It Does

Feature What Why
🧊 Cache Optimization Deterministic tool schema serialization 86% of savings in production
πŸ”€ Smart Routing Route by model, pattern, intent, token count Right model for the job, automatic failover
πŸ“¦ Compression Content-aware pipeline (code, docs, data, logs) 20-60% fewer tokens per request
πŸ’° Cost Tracking Per-request, per-model, per-session pricing Know exactly what you spend
πŸ“Š Dashboard 10-page web UI with FinOps/Engineering/Audit views See everything, export anything
πŸ” Vault Indexing Semantic search over your codebase Zero-token search β€” never calls an LLM
πŸ§ͺ A/B Testing Compare strategies with statistical significance Data-driven optimization
πŸ‘» Shadow Mode Validate compression without affecting production Safe to try, safe to ship
🚨 Budget Enforcement Limits + alerts per session, model, or agent Never blow your budget
πŸ›‘οΈ DLP Scanning Detect and redact sensitive data PII stays on your machine
πŸ”Œ Data Connectors Local, Git, Obsidian, GitHub, Google Drive, Notion Index any knowledge source
⚑ +2ms Latency Sub-millisecond compression, minimal proxy overhead You won’t notice it

Compatibility

Platform How
Anthropic SDK base_url="http://localhost:8766"
OpenAI SDK base_url="http://localhost:8766"
Google AI (Gemini) Proxy adapter
Claude Code export ANTHROPIC_BASE_URL=http://localhost:8766
OpenClaw Set provider base_url to proxy
Cursor Custom API endpoint in settings
LiteLLM Drop-in middleware or proxy
LangChain from tokenpak.adapters.langchain import LangChainAdapter
Ollama Compression + routing (no cost tracking for local)
curl / httpx / requests Standard REST API

CLI

# Run
tokenpak serve --port 8766          # start local proxy
tokenpak status                     # health check
tokenpak doctor                     # diagnose issues

# Monitor
tokenpak cost --week                # cost report by model
tokenpak savings                    # what you've saved

# Compress
tokenpak compress <file>            # dry-run compression
tokenpak diff <file>                # before/after comparison
tokenpak demo                       # see pipeline on sample data

# Search
tokenpak index <path>               # index a directory
tokenpak vault search "query"       # semantic search (zero tokens)

# Route
tokenpak route add --model 'gpt-4*' --target anthropic/claude-sonnet-4
tokenpak route list

# Debug
tokenpak trace --id <id>            # inspect pipeline run
tokenpak replay <id>                # replay past request

30+ commands. Full reference: docs/cli-reference.md


Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Your Code   │────▢│  πŸ“¦ TokenPak Proxy (:8766)               │────▢│  LLM API     β”‚
β”‚  (any SDK)   │◀────│  Runs on YOUR machine                    │◀────│              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚                                          β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚  Cache Opt.    Routing      Telemetry    β”‚
                     β”‚  Compression   Budget       Dashboard    β”‚
                     β”‚  A/B Testing   Shadow       DLP Scan     β”‚
                     β”‚  Schema Reg.   Circuit Brk  Conn Pool    β”‚
                     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                        β”‚
                     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                     β”‚  πŸ“ Local Storage (never leaves)         β”‚
                     β”‚  SQLite telemetry Β· Vault index Β· Cache  β”‚
                     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Configuration

{
  "proxy": {
    "port": 8766,
    "passthrough_url": "https://api.anthropic.com"
  },
  "compression": {
    "enabled": true,
    "level": "balanced"
  },
  "budget": {
    "monthly_usd": null,
    "alert_at_pct": 80
  }
}

Pre-built configs: anthropic-only Β· openai-only Β· cost-saving-max Β· local-ollama Β· privacy-first Β· mixed-routing Β· single-user Β· team-internal


Deployment

Platform Guide
pip pip install tokenpak && tokenpak serve
Docker Compose docker/
Kubernetes deployments/k8s/
AWS ECS deployments/aws-ecs/
GCP Cloud Run deployments/gcp-cloud-run/
systemd tokenpak/agent/systemd/

Testing

pip install -e ".[dev]"
pytest tests/ -q           # 316 tests, ~16s

Docs

Installation Β· Configuration Β· CLI Reference Β· Architecture Β· API Reference Β· Error Codes Β· Troubleshooting Β· Security


Protocol

TokenPak implements the TokenPak Protocol v1.0: Block Schema Β· Compiled Artifact Β· Evidence Pack


Contributing

See CONTRIBUTING.md.

git clone https://github.com/tokenpak/tokenpak && cd tokenpak
pip install -e ".[dev]" && pytest tests/ -q

License

MIT