tokenpak

TokenPak Architecture v2 — Universal Content Compiler

Vision

One command turns any directory into LLM-ready context.

tokenpak index ~/company-data --budget 8000

TokenPak is not a text compressor. It’s a universal content compiler that processes any file type — documents, code, images, audio, video, datasets — into a versioned, compressed, budget-aware knowledge index that any LLM can consume.


The Problem (Enterprise)

Companies have:

Current solutions are point tools:

TokenPak: One pipeline that orchestrates all of them.


Architecture

┌─────────────────────────────────────────────────────────────┐
│                    INPUT: Any Directory                       │
│  /contracts  /recordings  /code  /images  /data  /docs       │
└──────────────────────────┬──────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│              Layer 1: PROCESSORS (Multimodal)                │
│                                                              │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐       │
│  │   Text   │ │  Images  │ │  Audio   │ │  Video   │       │
│  │ Markdown │ │ Describe │ │ Whisper  │ │Keyframes │       │
│  │ PDF text │ │ Embed    │ │Transcribe│ │+Transcript│       │
│  │ Plaintext│ │ Dedupe   │ │ Diarize  │ │ Chapters │       │
│  └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘       │
│       │             │            │             │              │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐       │
│  │   Code   │ │   PDFs   │ │ JSON/CSV │ │  Archives│       │
│  │Tree-sitter│ │OCR+Digital│ │ Schema  │ │ Recursive│       │
│  │Signatures│ │Multi-engine│ │+Sampling│ │ Extract  │       │
│  └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘       │
│       └──────┬─────┴──────┬─────┴──────┬─────┘              │
│              │            │            │                     │
│              ▼            ▼            ▼                     │
│  ┌─────────────────────────────────────────────────┐        │
│  │         QUALITY SCORER (per output)              │        │
│  │  Score 0.0-1.0 | Route: auto/review/reject       │        │
│  └─────────────────────────┬───────────────────────┘        │
└────────────────────────────┼────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────┐
│              Layer 2: BLOCK REGISTRY (Novel IP)              │
│                                                              │
│  ┌──────────────────────────────────────────────────┐       │
│  │ Content-hash each processed output (SHA-256)      │       │
│  │ Version tracking: v1, v2, v3...                   │       │
│  │ Change detection: only reprocess modified files   │       │
│  │ Wire format: compact binary/text blocks (.tokpak) │       │
│  └──────────────────────────────────────────────────┘       │
│                                                              │
│  Registry: { "contracts/Q4-report.pdf": {                    │
│    hash: "a3f8c...", version: 3, type: "pdf-digital",        │
│    quality: 0.94, tokens: 2340, compressed: 580,             │
│    lastProcessed: "2026-02-20T17:00:00Z" }}                  │
└──────────────────────────┬──────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│              Layer 3: RETRIEVAL (QMD Integration)             │
│                                                              │
│  ┌──────────────────────────────────────────────────┐       │
│  │ QMD indexes all text blocks (BM25 + vectors)      │       │
│  │ Hybrid search: keyword + semantic + LLM reranking │       │
│  │ Query expansion for better recall                  │       │
│  │ Context tree (hierarchical metadata)               │       │
│  └──────────────────────────────────────────────────┘       │
│                                                              │
│  tokenpak search "Q4 revenue projections"                    │
│  → Returns top-k relevant blocks from ANY file type          │
│  → PDF page 47, meeting transcript 14:32, Slack thread       │
└──────────────────────────┬──────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│              Layer 4: COMPRESSION ENGINE                      │
│                                                              │
│  ┌──────────────────────────────────────────────────┐       │
│  │ LLMLingua-2: token-level filtering (MIT)          │       │
│  │ SelectiveContext: sentence-level filtering         │       │
│  │ Code: signature extraction (keep API, drop bodies)│       │
│  │ JSON: schema + sampling (keys+types, not full data│       │
│  │ Force tokens: never remove critical keywords       │       │
│  └──────────────────────────────────────────────────┘       │
└──────────────────────────┬──────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│              Layer 5: BUDGET ALLOCATOR (Novel IP)             │
│                                                              │
│  ┌──────────────────────────────────────────────────┐       │
│  │ QMD-weighted importance scoring per block          │       │
│  │ Quadratic allocation (high-importance → more room) │       │
│  │ Minimum floor per category (prevent starvation)    │       │
│  │ Dynamic redistribution (adapts to what's present)  │       │
│  └──────────────────────────────────────────────────┘       │
│                                                              │
│  Budget: 8000 tokens                                         │
│  Block A (Q4 report, importance=10): 3,400 tokens            │
│  Block B (meeting notes, importance=8): 2,200 tokens         │
│  Block C (code ref, importance=5): 900 tokens                │
│  Block D (old email, importance=2): 150 tokens               │
│  Block E (tool schemas, importance=3): 350 tokens            │
│  Overhead (wire format, refs): 1,000 tokens                  │
└──────────────────────────┬──────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│              Layer 6: OUTPUT (Wire Format)                    │
│                                                              │
│  ┌──────────────────────────────────────────────────┐       │
│  │ TOKPAK:1                                          │       │
│  │ BUDGET: {max_in:8000, used:7950}                  │       │
│  │ BLOCKS: [                                         │       │
│  │   {ref:"Q4-report#v3", quality:0.94, tokens:3400},│       │
│  │   {ref:"meeting-jan15#v1", quality:0.91, ...},    │       │
│  │ ]                                                 │       │
│  │ PROVENANCE: every fact traceable to source file    │       │
│  │ OUTPUT_CONTRACT: PATCH | PLAN | ACTIONS | QUESTIONS│       │
│  └──────────────────────────────────────────────────┘       │
└──────────────────────────┬──────────────────────────────────┘
                           │
                           ▼
                      Any LLM Provider
              (Anthropic, OpenAI, Google, local)

Component Stack

Open Source (Integrated, not built by us)

Component Purpose License Why this one
QMD Text retrieval (BM25 + vectors + reranking) MIT Best local hybrid search, OpenClaw-native
LLMLingua-2 Token-level compression MIT SOTA 2-20x compression, <5% quality loss
SelectiveContext Sentence-level filtering MIT Preserves readability
Whisper Audio transcription MIT Industry standard, local
Tesseract OCR (primary) Apache 2.0 Most mature, widest language support
EasyOCR OCR (fallback) Apache 2.0 Better on handwriting
Surya OCR (fallback) GPL-3.0 Best on complex layouts
Tree-sitter Code parsing MIT Fast, accurate, 100+ languages
PaddleOCR OCR (CJK specialist) Apache 2.0 Best for Chinese/Japanese/Korean

Novel IP (Built by us)

Component Purpose Status
Block Registry Version, hash, track all processed content Spec complete
Wire Format Compact output format with provenance Spec complete
Budget Allocator Quadratic importance-weighted token distribution Designed
Quality Scorer Per-output confidence scoring with routing Designed
Review Queue Human-in-the-loop for low-confidence outputs Designed
Monitoring Proxy Real-time token tracking + cost estimation Working ✅
Hybrid Calibration Static benchmark + bounded dynamic worker adjustment Working ✅
Multi-engine Fallback Try Tesseract → EasyOCR → Surya → flag for review Designed

Processing Pipeline (Per File Type)

Text (Markdown, plaintext, HTML)

File → detect encoding → extract text → QMD index → block registry

Code (Python, JS, TS, Go, Rust, etc.)

File → Tree-sitter parse → extract:
  - imports/dependencies
  - function/class signatures + docstrings
  - type definitions
  - drop function bodies
→ block registry

PDFs (Digital)

File → detect type (digital/scanned) → extract text → preserve tables/structure → QMD index → block registry

PDFs (Scanned)

File → page images → OCR (multi-engine):
  Engine 1 (Tesseract) → score
  Engine 2 (EasyOCR) → score
  Best score wins (or merge)
→ quality check:
  >0.90: auto-accept
  0.70-0.90: accept with warning
  0.50-0.70: review queue
  <0.50: reject / try next engine
→ QMD index → block registry

Images

File → resize to max 1024px → LLM describe (cache result) → dedupe check (perceptual hash) → block registry

Audio

File → Whisper transcribe → speaker diarization → timestamp segments → QMD index → block registry

Video

File → extract keyframes (1/scene) → Whisper on audio track → combine:
  - Keyframe descriptions
  - Timestamped transcript
  - Chapter markers
→ QMD index → block registry

JSON/CSV

File → detect schema → extract:
  - Column names + types
  - First 5 rows (sampling)
  - Row count, null rates
  - For nested: keys at max depth 3
→ block registry

QMD Integration Architecture

TokenPak CLI
    │
    ├── tokenpak index <dir>
    │   │
    │   ├── Walk directory
    │   ├── Process each file (multimodal pipeline)
    │   ├── Store processed blocks in registry
    │   └── QMD indexes all text outputs
    │       ├── qmd collection add <processed-output-dir>
    │       ├── qmd embed (generate vectors)
    │       └── qmd update (on file changes)
    │
    ├── tokenpak search "query" --budget 8000
    │   │
    │   ├── QMD hybrid search (BM25 + vectors + reranking)
    │   ├── Returns ranked blocks with scores
    │   ├── Budget allocator distributes tokens
    │   ├── Compression engine packs each block
    │   └── Wire format output ready for LLM
    │
    └── tokenpak serve --port 8766
        │
        ├── Transparent proxy mode (current, working)
        ├── Intercepts LLM requests
        ├── Injects relevant context from index
        ├── Tracks tokens, cost, latency
        └── Syncs stats to dashboard

Budget Allocation: Quadratic Importance Weighting

Inspired by Quadratic Mean Diameter (QMD) from forestry — high-importance blocks get quadratically more space.

def allocate_budget(blocks, total_budget):
    """
    Allocate token budget using quadratic importance weighting.
    High-importance blocks get disproportionately more space.
    """
    # Minimum floor: every block gets at least 3% of budget
    floor = 0.03 * total_budget
    remaining = total_budget - (floor * len(blocks))
    
    # Square importance scores
    squared = {b: b.importance ** 2 for b in blocks}
    total_sq = sum(squared.values())
    
    # Distribute remaining budget proportionally to squared importance
    allocation = {}
    for block, sq in squared.items():
        allocation[block] = floor + (sq / total_sq) * remaining
    
    return allocation

Importance Scoring

Signal Weight Source
QMD relevance score 0.40 QMD search result
Recency 0.20 File modification date
Source quality 0.15 Quality scorer output
User preference 0.15 Explicit pins/priorities
File type authority 0.10 Contracts > chat logs

Competitive Positioning

What exists (point solutions)

Tool What it does What it doesn’t do
QMD Text search (BM25 + vectors) Images, audio, video, code, PDFs, compression
LLMLingua Token compression Retrieval, multimodal, versioning, quality
Whisper Audio transcription Everything else
LlamaIndex RAG framework Local processing, compression, quality scoring
Anthropic cache Provider-specific caching Cross-provider, multimodal, retrieval

What TokenPak does (unified pipeline)

All of the above, orchestrated:

The pitch: “One CLI that turns any directory into optimized LLM context. Local, private, cross-provider.”


Revenue Model

Tier Price Includes
Open Source Free (MIT) CLI, text processing, basic compression, QMD integration
Pro $99/mo Multimodal processing, monitoring dashboard, priority support
Enterprise $999/mo+ Quality review queue, SLA, custom integrations, audit trails
Usage-based $0.10-0.50 per 1M tokens processed Alternative for high-volume users

90-Day Revised Roadmap

Month 1: Core Pipeline (Weeks 1-4)

Week Deliverable
1 tokenpak index — directory walker + text processing + block registry
2 QMD integration — auto-index, search, budget allocation
3 tokenpak search — query → retrieve → compress → output
4 Code processing (Tree-sitter) + JSON/CSV schema extraction

Month 2: Multimodal + Quality (Weeks 5-8)

Week Deliverable
5 PDF processing (digital + OCR with multi-engine fallback)
6 Quality scoring + review queue
7 Audio (Whisper) + image (LLM describe + cache)
8 Video (keyframes + transcript)

Month 3: Launch (Weeks 9-12)

Week Deliverable
9 Monitoring dashboard (web UI)
10 npm + PyPI package publishing
11 Documentation, examples, benchmarks
12 Open source launch (GitHub, Hacker News, Discord)

Agent Support

Agent Role
Kevin Architecture decisions, positioning, launch strategy
Cali Core implementation, multimodal processors, QMD integration
Trix Proxy/monitoring, benchmarks, documentation, DevOps
Sue Task management, QA, community prep

Success Metrics

Phase 1 (Month 1) — Validation

Phase 2 (Month 2) — Multimodal

Phase 3 (Month 3) — Launch


Benchmark Data (Collected)

Test Avg tokens/request Reduction Date
Baseline (no optimization) 20,801 2026-02-20
QMD only 6,136 70.5% 2026-02-20
TokenPak compression only TBD Pending A/B run Pending
QMD + TokenPak hybrid 3,265 84.3% vs baseline (≈43% extra vs QMD-only) 2026-02-21

Runtime Performance (Phase 2)


Architecture v2: 2026-02-20 Authors: Trix 🐰 + Kevin