Keep the productivity of LLM document processing — without sensitive data ever leaving a boundary you control.
Across operations, staff now paste documents into Gemini, OpenAI and Claude to draft, extract and reconcile. The productivity gain is real — and so is a new, mostly invisible, exposure.
PII and regulated financial data inside the document text are transmitted to a third party — names, account numbers, tax IDs, balances.
Volume, timing and document mix become metadata that can be used to reconstruct how your business actually operates.
The question isn't whether the tool is useful — it's what leaves your control every time someone uses it.
Zero-retention, no-training and DPA clauses are contractual assurances, enforced by trust and audit rights. They lower the likelihood of misuse. They do not make exposure technically impossible — and they say nothing about the metadata trail the calls leave behind.
Policy and process lower the odds. Only a technical boundary changes what is possible — and makes every other layer enforceable.
So we move both the de-identification and the model that reasons over your data inside a boundary you own — and route every call through it.
Nothing sensitive crosses the boundary — the provider never sees raw PII, and no external usage statistics are generated.
A base detector library — regex · NER · local-LLM · hybrid — composed with regulatory profiles and your own custom entities.
Consistent tokens preserve the relationships the model needs to reason — real values stay local and rehydrate only after processing.
Crown-jewel asset. Persisted only when re-id requires it; isolated zone with its own keys (BYOK / HSM) — the core service never holds the map.
Behaviour is declarative config, not code — and every disposition is logged, so compliance is demonstrable.
| Protection technique | mask · tokenize · format-preserving encrypt · synthesize · minimize |
| Deployment tier | self-hosted in-boundary (default) · optional external routing |
| Routing / policy | sensitivity classification → technique + tier, per jurisdiction |
| Metadata protection | batching · timing decoupling · prompt normalization |
Per stage and threshold: pass · flag · human-review queue. Invalid output triggers a repair-retry before falling through.
Per-client config is versioned, validated and test-harnessed before go-live; every output captures {config + model + prompt} for reproducible audit.
Detect-only exposure report. Zero change — proves leak risk & recall.
Active masking; low-confidence routed to human review. Builds trust.
Inline de-id + re-id per config. Runs autonomously — only exceptions surface.
As an outsourcer you are typically a Data Processor — security obligations apply directly and flow down by contract.
| Process only on instructions | Per-client config = documented instructions |
| Reasonable security safeguards | De-ID + isolated vault + in-boundary model |
| Logging & visibility | Lineage IDs + audit logs (≥ 1-yr retention) |
| Storage limitation / erasure | Config retention; vault purge by re-id mode |
| Demonstrable compliance | {config + model + prompt} version capture |
Keep the productivity of LLM document processing — without sensitive data ever leaving a boundary you control.