Middleware: PII filtering and intelligent routing
LocalAI ships a request-middleware layer that sits between the HTTP API and
the backend dispatcher. Two subsystems share that layer because they share
the same lifecycle hook: PII filtering scans the request body before it
reaches a backend (and the SSE stream on the way out), and the intelligent
router rewrites input.Model so a single client-facing model name fans
out across multiple downstream targets.
Both are inspected and configured from the same admin page
(/app/middleware), backed by the same REST surface (/api/middleware/*,
/api/pii/*, /api/router/*) and the same MCP tools.
Request lifecycle
The router runs first (it picks the target model so per-model PII has
something to gate on), per-model PII runs next (gated by the resolved
config), the backend executes, and the streaming PII filter rewrites the
SSE response in flight. Each subsystem writes to its own admin-visible
log: /api/router/decisions for routing, /api/pii/events for redaction
and block actions.
PII filtering
PII redaction is per-model and off by default. The default flips to
on for any backend whose name starts with proxy- because that traffic
crosses the network to a third-party provider. Explicit pii.enabled
in a model’s YAML always wins over the backend default.
Pattern catalog
The built-in regex tier ships six patterns. Each has a default action
(mask, block, or route_local) and a length cap that prevents
pathological inputs from blowing up scanning time:
| ID | Description | Default action | Max length |
|---|---|---|---|
email | Email address | mask | 254 |
phone | Phone number (international or US) | mask | 24 |
ssn | US Social Security Number | mask | 11 |
credit_card | Credit card number (Luhn-verified) | mask | 19 |
ipv4 | IPv4 address | mask | 15 |
api_key_prefix | sk-, pk-, xoxb-, ghp_, github_pat_ | block | 200 |
mask rewrites the match to [REDACTED:<id>] in the request body before
forwarding. block returns HTTP 400 with error.type=pii_blocked to the
client without forwarding. route_local is reserved for the routing
integration (see below) and falls back to mask when no local route is
available.
Per-model configuration
Add a pii: block to a model YAML to opt in (or out, or to override
per-pattern actions):
The regex itself stays global — only the action is settable per-model.
Adding new patterns is a build-time concern (extend patternRegexps in
core/services/routing/pii/patterns.go).
NER tier (optional)
The regex matcher covers high-precision patterns. For natural-language
PII (proper names, addresses, organization names) LocalAI carries an
encoder NER tier that runs after the regex pass. It expects a
transformers token-classification model wired through the TokenClassify
gRPC primitive (e.g. dslim/bert-base-NER). The detector annotates
spans with an entity group (PER, LOC, ORG, MISC); per-group
actions are configurable through the same pii: block.
The NER tier ships as a contract (NERDetector, NERConfig in
core/services/routing/pii/ner.go); an operator-facing knob to load and
attach a detector is not plumbed yet. When no detector is configured the
regex tier still runs.
Streaming PII filter
Buffered (/v1/chat/completions without "stream": true) responses are
forwarded verbatim today — only the request-side scan runs. Streaming
responses run through pii.StreamFilter which buffers SSE chunks until
either a full pattern matches or the buffer’s max length is reached,
then emits the safe prefix. The streaming filter is what makes the
cloud-proxy backend and the MITM proxy safe to expose to clients that
issue streaming requests.
The streaming filter is wired automatically for any model with pii.enabled
true — there is no separate streaming toggle.
Admin page
The /app/middleware page (admin role only) has four tabs — Filtering,
Routing, MITM Proxy (see the MITM doc),
and Events. The Filtering tab shows:
- The pattern catalogue with live action dropdowns. Changing an action via
the UI calls
PUT /api/pii/patterns/:idand updates the live redactor in-process. Click Persist in the action header to write the current state intoruntime_settings.jsonso the next process start re-applies it. - A per-model resolved-state table — each model row reports
enabled, the per-pattern overrides, and which patterns are effectively active. - A live test panel that posts sample text to
/api/pii/testand highlights matches with their resolved actions, without storing the text in the event log.
REST surface
| Method | Path | Auth | Purpose |
|---|---|---|---|
| GET | /api/pii/patterns | any | Live pattern list with current actions. Used by the UI catalogue. |
| POST | /api/pii/test | any | Dry-run the redactor on {"text":"..."}. Returns hits and the would-be-rewritten body. Does not write to the event log. |
| GET | /api/pii/events | admin | Recent middleware events — PII redactions, MITM connect/traffic, admission denials. Filterable by correlation_id, user_id, pattern_id, kind. |
| PUT | /api/pii/patterns/:id | admin | Update a pattern in-process. Body accepts {"action":"mask"|"block"|"route_local"} and/or {"disabled":true|false}. Transient — reverts on restart unless persisted. |
| POST | /api/pii/patterns/persist | admin | Snapshot the live per-pattern (action, disabled) state into runtime_settings.json. |
| GET | /api/middleware/status | admin | Aggregated dashboard data: patterns + per-model resolved state + router status + MITM status + admission status. One round-trip for the UI. |
MCP tools
The same surface is mirrored through the LocalAI Assistant MCP server so the in-process and stdio assistants can manage the filter conversationally:
| Tool | Read/Write | Purpose |
|---|---|---|
list_pii_patterns | read | Returns the live pattern list. |
get_pii_events | read | Recent redaction / block events with optional filters. |
test_pii_redaction | read | Dry-run sample text without writing to the event log. |
get_middleware_status | read | Aggregator — the same payload as GET /api/middleware/status. |
set_pii_pattern_action | write | Update a pattern’s action. Admin-only. |
persist_pii_patterns | write | Snapshot live state to runtime_settings.json. Admin-only. |
Intelligent routing
A router model is a model whose YAML carries a router: block. When
a client addresses it ("model": "smart-router"), the middleware
classifies the prompt, picks a downstream candidate model, rewrites
input.Model to the candidate, and the standard model-resolution path
runs against that resolved target. ACL checks, disabled-state, and
per-model PII all apply to the resolved model — the router does
model selection only.
Depth-1 invariant
Candidates must not themselves be router models. A
smart-router → claude-strict → cloud-proxy chain is fine
(claude-strict is a regular cloud-proxy model). A
smart-router → other-router → real-model chain is rejected at runtime
by the middleware (the dispatcher returns HTTP 500 with a
depth-1 invariant error). This keeps the dispatch graph acyclic and
predictable.
Fallback
If no candidate’s label set covers the active label set from the classifier,
or the classifier errors out, the router uses cfg.Router.Fallback.
An empty fallback causes the dispatch to fail with HTTP 500 rather
than silently routing somewhere unintended — fail-fast, not
silent-bypass.
Available classifiers
LocalAI ships two classifier implementations. Pick one with classifier:
in the router YAML:
| Classifier | When to use | Underlying primitive |
|---|---|---|
score (default) | Small classifier-tuned LM (Arch-Router-style). Best when label vocabulary is well-covered by next-token continuation. | Score gRPC primitive (llama-cpp, vLLM). |
colbert | When label descriptions are abstract or short and a next-token classifier produces flat distributions. Robust on long-form policy descriptions. | rerankers backend in ColBERT mode (e.g. bge-m3-colbert from the gallery). |
Both classifiers share the same YAML shape: classifier_model,
policies, candidates, fallback, activation_threshold,
classifier_cache_size, and the optional embedding_cache block.
The Score classifier
The score classifier works like this:
- Build a Qwen/ChatML system prompt that lists every policy label with its description and primes the model to emit a label as the assistant turn.
- Ask the classifier model to score every policy label as the
first-token(s) continuation. This uses the
ScoregRPC primitive (backend.proto::Score), which returns per-candidate log-probabilities length-normalized so candidates of unequal token length stay comparable. - Softmax the length-normalized log-probabilities into a probability distribution over labels.
- Threshold the distribution: every label whose probability passes
activation_thresholdjoins the active label set. - Pick the FIRST candidate whose
Labelsis a superset of the active set. Admins order candidates smallest → largest so a single-label query routes to the smallest capable model, while a query that activates multiple labels falls to a candidate that covers them all.
This is the Arch-Router approach extended for multi-label. The distribution carries more signal than the argmax — reading off the spread lets one prompt activate multiple policies and route to a model capable of all of them.
Recommended classifier model
Arch-Router-1.5B is the canonical choice. It’s a Qwen-2.5-1.5B-Instruct base trained specifically on routing-policy continuation, so the ChatML system-prompt
- label-continuation pattern produces well-separated label probabilities without prompt tuning. The Q4_K_M GGUF runs on CPU, GPU, and Intel SYCL.
The classifier model must support the Score gRPC primitive (today: the
llama-cpp and vLLM backends) and use the ChatML chat template. Any small
ChatML instruct model works under those constraints, but expect flatter
probability distributions which translate to a higher
activation_threshold to keep noise out of the active label set.
On llama-cpp, declare known_usecases: [score] on the classifier
model — LocalAI rejects configs that combine score with
chat/completion/embeddings there, because the Score RPC races
the llama_context against slot-loop traffic.
The Colbert classifier
The colbert classifier reranks each policy description against the
prompt via the rerankers backend and activates the labels whose
relevance scores clear activation_threshold (default 0.5 for
reranker-style scores in [0, 1]).
The reranker scores the description (natural English) rather than
asking a small LM to score the label as a next-token continuation,
so it tends to be more robust when policy labels are abstract slugs
(compliance-review, tier-2-support). The trade-off is one
reranker round-trip per request — bge-m3 in ColBERT mode is fast
enough on GPU that this is comparable to the Score path for most
workloads. The embedding_cache block applies identically.
The reranker model’s type: (in the model YAML) selects which
underlying scoring head loads — colbert for late-interaction MaxSim,
cross-encoder for cross-attention scoring. The classifier itself is
indifferent; pick the head that fits your latency / quality budget.
YAML reference
Tuning activation_threshold
The threshold is the single knob you’ll want to tune per (classifier-model, policy-set) pair. On Arch-Router-1.5B with the three-policy setup above, sweeping the threshold over a hand-labeled 30-prompt corpus produced:
| Threshold | Label-set accuracy | End-to-end routing accuracy |
|---|---|---|
| 0.15 (package default) | 30% | 73% |
| 0.30 | 57% | 87% |
| 0.40 | 60% | 90% |
| 0.45 | 67% | 97% |
| 0.50 | 67% | 97% |
The classifier’s argmax matches the dominant label 93% of the time on this corpus — what the threshold controls is how much secondary-label noise leaks into the active label set. Low thresholds push single-label queries to multi-label-capable (larger) candidates unnecessarily; 0.40 keeps the dominant label dominant without losing genuine compound activations.
Re-tune per (classifier-model, policy-set) pair. The /api/score
endpoint (see below) is the convenient probe — it returns the raw
length-normalized log-probabilities so you can sweep thresholds offline
without driving real chat completions.
Embedding cache (L2)
Classification is the most expensive thing the middleware does. The score classifier already memo-caches verbatim repeats (case- and whitespace-folded prompt → decision); the embedding cache is the L2 tier that catches semantically similar prompts — “How do I exit vim?” and “i need to quit vim” can share a decision instead of running the classifier twice.
Pairs naturally with a larger / slower classifier model: the steady-state
cost on cache hits collapses to one embedding round-trip plus a KNN
search, both well under 100ms with nomic-embed-text-v1.5 + local-store.
Configuration
Add an embedding_cache: block to a router model:
Omit the block entirely to disable. The cache adds two new failure modes (embedder unavailable, store unavailable) — both fall through to the inner classifier so routing keeps working.
How it works
For each request:
- Embed the probe prompt via the configured
embedding_model. - KNN top-1 against the per-router local-store collection.
- If similarity ≥
similarity_threshold, return the cached decision (Cached=true,CacheSimilarity=<sim>in the decision log). - Miss → run the inner classifier. If
decision.score >= confidence_threshold, insert(embedding, decision)into the store. Low-confidence decisions are deliberately skipped so they can’t poison future paraphrases.
The local-store collection is named router-cache-<router-model-name> by
default — each router gets its own collection so two routers can’t
cross-contaminate. Collections persist on disk (local-store is the
canonical persistent vector backend), so the cache survives restarts.
Tuning notes
- Similarity threshold: 0.80 is the package default — re-tune per (embedding model, corpus). The histogram on the Routing tab shows where the cosine distribution actually sits; pick a threshold above the cross-intent cluster and below the paraphrase cluster.
- Confidence threshold: 0.60 corresponds roughly to “the classifier is committed to a top label.” Don’t lower this — caching unsure decisions propagates the uncertainty.
- Cache flush: invalidates automatically when the router YAML
changes (the classifier cache is fingerprinted by
yaml.Marshal), but the underlying local-store collection still holds the old payloads. Manual flush via local-store admin or by renamingstore_nameif you need a hard reset. - Latency budget: an embedding round-trip (typically 30–80ms for small embedding models) plus KNN search (~5ms) is added to every miss on top of the classifier latency. Cache hits skip the classifier entirely. Break-even is around 7–10% hit rate; agent loops with repeated phrasing easily exceed this.
Admin page
The /app/middleware page has a Routing tab listing every router
model’s classifier, policies, candidates, and fallback. The Events
tab shows the decision log — one row per classified request with
correlation ID, requested model, served model, classifier name, active
labels, top-label score, and latency.
Routing decisions are stored in an in-process ring buffer (default
capacity 5,000). The decision log is for audit and tuning — the
canonical usage log lives in /api/usage and correlates by request ID.
REST surface
| Method | Path | Auth | Purpose |
|---|---|---|---|
| GET | /api/router/status | any | Router configuration: each router model’s classifier, policies, candidates. |
| GET | /api/router/decisions | admin | Decision log with optional filters (correlation_id, user_id, router_model, limit). |
| POST | /api/score | admin | Direct access to the Score gRPC primitive — useful for offline threshold tuning. Body: {"model": "<classifier-model>", "prompt": "<chatml-prompt>", "candidates": ["label-a", ...], "length_normalize": true}. The llama-cpp and vLLM backends implement Score; other backends return UNIMPLEMENTED. |
MCP tools
| Tool | Read/Write | Purpose |
|---|---|---|
get_router_decisions | read | Recent decision log with optional filters. |
get_middleware_status | read | Includes the router section listing configured router models. |
Mutating routing config — adding a candidate, changing the classifier
model — is YAML-only today; reload with POST /models/reload to pick
up edits without restarting.
Operational notes
- Reload after YAML edits. The router configs are loaded at startup
and cached.
POST /models/reloadre-reads from disk; the next request rebuilds the classifier from the new config (the classifier cache is fingerprinted byyaml.Marshal(RouterConfig)so it invalidates automatically). - Classifier latency on Arch-Router-1.5B Q4_K_M is ~500ms steady
for 3 policies on Intel SYCL. The score primitive re-decodes the full
prompt for every candidate today (the KV cache is cleared between
candidates); the prompt-KV-sharing optimization is on the perf TODO
list in
backend/cpp/llama-cpp/grpc-server.cpp::Score. Until then,classifier_cache_sizeis the highest-leverage knob for repeat-query workloads (agent loops). - Decision log size: 5,000-entry ring buffer per process. The log is in-process and not persisted — pair with the usage log for long-horizon audit.
Related features
- Cloud passthrough proxy — combine
the router with
proxy-*backends to send simple prompts to local models and complex ones to cloud providers. - MITM proxy — apply the same PII filter to Claude Code, Codex CLI, and any HTTPS client without LocalAI holding their API keys.
- Authentication — admin role is
required for mutating endpoints and the
/app/middlewarepage; in no-auth single-user mode the synthetic local user has admin role automatically.