CHECKLLMAI.
1 / 11
SYS.STATUS: ONLINE

Document Intelligence &
PII De-Identification Firewall

Keep the productivity of LLM document processing — without sensitive data ever leaving a boundary you control.

CHECKLLMAI.
DEPLOYMENT ARM · STRINGIFY AI
chk_worker_node_01
INIT hooking legacy_db_cluster
OK read-only sentinel active
DETECT entity [Col: Pat_ID]
MASK tokenize → [PII_07]
BLOCK raw egress denied
AWAIT 1-click approval _
The Reality

Your teams adopted LLMs. Output went up.

Across operations, staff now paste documents into Gemini, OpenAI and Claude to draft, extract and reconcile. The productivity gain is real — and so is a new, mostly invisible, exposure.

Content leaves the building

PII and regulated financial data inside the document text are transmitted to a third party — names, account numbers, tax IDs, balances.

Patterns leak even when content is clean

Volume, timing and document mix become metadata that can be used to reconstruct how your business actually operates.

!

The question isn't whether the tool is useful — it's what leaves your control every time someone uses it.

Understand the Risk · 01

What actually happens on a third-party call

Document + prompt
your sensitive content, in clear text
Crosses your edge
leaves your network & legal control
Provider API
OpenAI · Gemini · Claude
Held on their infra
retention window · abuse-monitoring · usage telemetry
What the enterprise license buys you
A promise — not a boundary.

Zero-retention, no-training and DPA clauses are contractual assurances, enforced by trust and audit rights. They lower the likelihood of misuse. They do not make exposure technically impossible — and they say nothing about the metadata trail the calls leave behind.

Understand the Risk · 02

Now multiply that by every team, every day

One promise, stretched across hundreds of users and thousands of daily calls.
a
Shadow usage you can't see — staff using personal or unsanctioned tools.
b
Redaction done inconsistently, or not at all, person to person.
c
Every call adds to a metadata footprint you can't recall.
d
No single point where the rule is actually enforced.
How you contain it — defense in depth
01
Process · review & approval steps
lowers likelihood
02
Team policy · who may use what
lowers likelihood
03
AI policy · acceptable use, data classes
lowers likelihood
04
Hardware / network · egress & DLP controls
reduces surface
05
Technical boundary · in-boundary processing
removes the possibility
=

Policy and process lower the odds. Only a technical boundary changes what is possible — and makes every other layer enforceable.

The Approach

Control the boundary — not the contract

So we move both the de-identification and the model that reasons over your data inside a boundary you own — and route every call through it.

Unstructured docs
PDF · scans · DB · storage
Your Trust Boundary · Client or Mutual Infra
De-ID firewall
Self-hosted model
Structured output
safe · schema-valid

Nothing sensitive crosses the boundary — the provider never sees raw PII, and no external usage statistics are generated.

Architecture

The processing pipeline

Trust Boundary — begins at ingestion · self-hosted OCR & model
01
Ingestion adapter
per-record · batch · source
02
OCR / Pre-processing
in-boundary · offline
03
Detection / De-ID
taxonomy + profiles
04
LLM processing
self-hosted · structured
05
Output mapping
client schema · field policy
06
Re-identification
optional · metered
LOGper-record lineage · de-identified by default · reproducible {config + model + prompt}
Cross-cutting
Token vault
isolated trust zone
Entitlement & metering
module licensing
Config registry
versioned policy
Capabilities · Data Protection

Detect, de-identify, tokenize

Composable taxonomy

A base detector library — regex · NER · local-LLM · hybrid — composed with regulatory profiles and your own custom entities.

DPDPHIPAAGLBAPCIGDPR+ custom
Reversible tokenization

Consistent tokens preserve the relationships the model needs to reason — real values stay local and rehydrate only after processing.

PARTY_1 owes ACCT_7 ₹2.4L
0
No re-identification — output stays tokenized; map ephemeral, never persisted.
1
Inside the API — rehydrated per field in an isolated vault; paid, metered.
2
Client-side — tokens + secure map returned; the API never persists it.
The token vault

Crown-jewel asset. Persisted only when re-id requires it; isolated zone with its own keys (BYOK / HSM) — the core service never holds the map.

Capabilities · Configurability & Assurance

Configurable by design, provable by record

Behaviour is declarative config, not code — and every disposition is logged, so compliance is demonstrable.

Protection techniquemask · tokenize · format-preserving encrypt · synthesize · minimize
Deployment tierself-hosted in-boundary (default) · optional external routing
Routing / policysensitivity classification → technique + tier, per jurisdiction
Metadata protectionbatching · timing decoupling · prompt normalization
Failure & low-confidence contract

Per stage and threshold: pass · flag · human-review queue. Invalid output triggers a repair-retry before falling through.

Config is code

Per-client config is versioned, validated and test-harnessed before go-live; every output captures {config + model + prompt} for reproducible audit.

Deployment & The Journey

In-boundary deployment, de-risked rollout

How we deploy
Self-hosted open-weight model on client or mutually-trusted infra.
OCR & pre-processing run in-boundary — the boundary begins at ingestion, not the model.
In-boundary deployment directly satisfies RBI-style data-localization.
Logging de-identified by default; retention & access set per client.
Observational-first journey
01
Sentinel
Days 1–14

Detect-only exposure report. Zero change — proves leak risk & recall.

02
Co-Pilot
Days 15–60

Active masking; low-confidence routed to human review. Builds trust.

03
Autopilot
Days 60+

Inline de-id + re-id per config. Runs autonomously — only exceptions surface.

Compliance & Assurance

Audit-ready by design

The stake — DPDP, in force
₹250cr
security-safeguard failure
₹200cr
breach-notification failure
May 2027
hard-enforcement date

As an outsourcer you are typically a Data Processor — security obligations apply directly and flow down by contract.

DPDP processor obligation → how it's satisfied
Process only on instructionsPer-client config = documented instructions
Reasonable security safeguardsDe-ID + isolated vault + in-boundary model
Logging & visibilityLineage IDs + audit logs (≥ 1-yr retention)
Storage limitation / erasureConfig retention; vault purge by re-id mode
Demonstrable compliance{config + model + prompt} version capture
In Summary

Safe by architecture.
Configurable by design.

01
Boundary control — data and model stay in-boundary; zero egress, no usage-stat leakage.
02
Configurable — profiles, actions, output and routing as declarative per-client config.
03
Provable — lineage and reproducible outputs map directly to DPDP / HIPAA evidence.
04
De-risked — Sentinel → Co-Pilot → Autopilot proves value before any autonomy.

Keep the productivity of LLM document processing — without sensitive data ever leaving a boundary you control.