SYS.STATUS: ONLINE

Document Intelligence &
PII De-Identification Firewall

Keep the productivity of LLM document processing — without sensitive data ever leaving a boundary you control.

CHECKLLMAI.

DEPLOYMENT ARM · STRINGIFY AI

chk_worker_node_01

INIT hooking legacy_db_cluster

OK read-only sentinel active

DETECT entity [Col: Pat_ID]

MASK tokenize → [PII_07]

BLOCK raw egress denied

AWAIT 1-click approval _

The Reality

Your teams adopted LLMs. Output went up.

Across operations, staff now paste documents into Gemini, OpenAI and Claude to draft, extract and reconcile. The productivity gain is real — and so is a new, mostly invisible, exposure.

Content leaves the building

PII and regulated financial data inside the document text are transmitted to a third party — names, account numbers, tax IDs, balances.

Patterns leak even when content is clean

Volume, timing and document mix become metadata that can be used to reconstruct how your business actually operates.

!

The question isn't whether the tool is useful — it's what leaves your control every time someone uses it.

Understand the Risk · 01

What actually happens on a third-party call

Document + prompt

your sensitive content, in clear text

Crosses your edge

leaves your network & legal control

Provider API

OpenAI · Gemini · Claude

Held on their infra

retention window · abuse-monitoring · usage telemetry

What the enterprise license buys you

A promise — not a boundary.

Zero-retention, no-training and DPA clauses are contractual assurances, enforced by trust and audit rights. They lower the likelihood of misuse. They do not make exposure technically impossible — and they say nothing about the metadata trail the calls leave behind.

Understand the Risk · 02

Now multiply that by every team, every day

One promise, stretched across hundreds of users and thousands of daily calls.

a

Shadow usage you can't see — staff using personal or unsanctioned tools.

b

Redaction done inconsistently, or not at all, person to person.

c

Every call adds to a metadata footprint you can't recall.

d

No single point where the rule is actually enforced.

How you contain it — defense in depth

01

Process · review & approval steps

lowers likelihood

02

Team policy · who may use what

lowers likelihood

03

AI policy · acceptable use, data classes

lowers likelihood

04

Hardware / network · egress & DLP controls

reduces surface

05

Technical boundary · in-boundary processing

removes the possibility

=

Policy and process lower the odds. Only a technical boundary changes what is possible — and makes every other layer enforceable.

The Approach

Control the boundary — not the contract

So we move both the de-identification and the model that reasons over your data inside a boundary you own — and route every call through it.

Unstructured docs

PDF · scans · DB · storage

Your Trust Boundary · Client or Mutual Infra

De-ID firewall

Self-hosted model

Structured output

safe · schema-valid

›

Nothing sensitive crosses the boundary — the provider never sees raw PII, and no external usage statistics are generated.

Architecture

The processing pipeline

Trust Boundary — begins at ingestion · self-hosted OCR & model

01

Ingestion adapter

per-record · batch · source

02

OCR / Pre-processing

in-boundary · offline

03

Detection / De-ID

taxonomy + profiles

04

LLM processing

self-hosted · structured

05

Output mapping

client schema · field policy

06

Re-identification

optional · metered

LOGper-record lineage · de-identified by default · reproducible {config + model + prompt}

Cross-cutting

Token vault

isolated trust zone

Entitlement & metering

module licensing

Config registry

versioned policy

Capabilities · Data Protection

Detect, de-identify, tokenize

Composable taxonomy

A base detector library — regex · NER · local-LLM · hybrid — composed with regulatory profiles and your own custom entities.

DPDPHIPAAGLBAPCIGDPR+ custom

Reversible tokenization

Consistent tokens preserve the relationships the model needs to reason — real values stay local and rehydrate only after processing.

PARTY_1 owes ACCT_7 ₹2.4L

0

No re-identification — output stays tokenized; map ephemeral, never persisted.

1

Inside the API — rehydrated per field in an isolated vault; paid, metered.

2

Client-side — tokens + secure map returned; the API never persists it.

The token vault

Crown-jewel asset. Persisted only when re-id requires it; isolated zone with its own keys (BYOK / HSM) — the core service never holds the map.

Capabilities · Configurability & Assurance

Configurable by design, provable by record

Behaviour is declarative config, not code — and every disposition is logged, so compliance is demonstrable.

Protection technique	mask · tokenize · format-preserving encrypt · synthesize · minimize
Deployment tier	self-hosted in-boundary (default) · optional external routing
Routing / policy	sensitivity classification → technique + tier, per jurisdiction
Metadata protection	batching · timing decoupling · prompt normalization

Failure & low-confidence contract

Per stage and threshold: pass · flag · human-review queue. Invalid output triggers a repair-retry before falling through.

Config is code

Per-client config is versioned, validated and test-harnessed before go-live; every output captures {config + model + prompt} for reproducible audit.

Deployment & The Journey

In-boundary deployment, de-risked rollout

How we deploy

›

Self-hosted open-weight model on client or mutually-trusted infra.

›

OCR & pre-processing run in-boundary — the boundary begins at ingestion, not the model.

›

In-boundary deployment directly satisfies RBI-style data-localization.

›

Logging de-identified by default; retention & access set per client.

Observational-first journey

01

Sentinel

Days 1–14

Detect-only exposure report. Zero change — proves leak risk & recall.

02

Co-Pilot

Days 15–60

Active masking; low-confidence routed to human review. Builds trust.

03

Autopilot

Days 60+

Inline de-id + re-id per config. Runs autonomously — only exceptions surface.

Compliance & Assurance

Audit-ready by design

The stake — DPDP, in force

₹250cr

security-safeguard failure

₹200cr

breach-notification failure

May 2027

hard-enforcement date

As an outsourcer you are typically a Data Processor — security obligations apply directly and flow down by contract.

DPDP processor obligation → how it's satisfied

Process only on instructions	Per-client config = documented instructions
Reasonable security safeguards	De-ID + isolated vault + in-boundary model
Logging & visibility	Lineage IDs + audit logs (≥ 1-yr retention)
Storage limitation / erasure	Config retention; vault purge by re-id mode
Demonstrable compliance	{config + model + prompt} version capture

In Summary

Safe by architecture.
Configurable by design.

01

Boundary control — data and model stay in-boundary; zero egress, no usage-stat leakage.

02

Configurable — profiles, actions, output and routing as declarative per-client config.

03

Provable — lineage and reproducible outputs map directly to DPDP / HIPAA evidence.

04

De-risked — Sentinel → Co-Pilot → Autopilot proves value before any autonomy.

›

Keep the productivity of LLM document processing — without sensitive data ever leaving a boundary you control.

Document Intelligence &PII De-Identification Firewall

Your teams adopted LLMs. Output went up.

What actually happens on a third-party call

Now multiply that by every team, every day

Control the boundary — not the contract

The processing pipeline

Detect, de-identify, tokenize

Configurable by design, provable by record

In-boundary deployment, de-risked rollout

Audit-ready by design

Safe by architecture.Configurable by design.

Document Intelligence &
PII De-Identification Firewall

Safe by architecture.
Configurable by design.