Developer Guide
This page explains how anonymization is implemented in Backend for different pipelines, and how caching of detected entities works.
For the high-level flow, see Architecture βΊ Anonymization.
Servicesβ
Backend exposes two anonymization services with different scopes and dependencies:
Anonymization configuration from Admin (e.g., "ignore URLs", "ignore list") is enforced in the NER Service layer only. The NerModelAnonymizer path bypasses that layer and therefore does not apply ignore lists or the URLβignore toggle.
1) Anonymizer (Ingestion + Analytics)β
- Purpose: Remove PII from incoming conversation messages and analytics events during ingestion.
- Dependency: Calls the NER Service over RPC/HTTP.
- Flow:
- Backend receives a message/event in the ingestion pipeline.
- Backend calls NER Service with the raw text.
- NER Service invokes the configured NER Model (e.g., Flair or spaCy) to detect entities, applies masking/removal, and returns anonymized text.
- Backend stores/publishes the anonymized version (Cloud SQL, BigQuery).
- Notes:
- Honors account settings (anonymize on/off, ignore URLs, ignore list).
- Used for both message storage and analytics publications.
2) NerModelAnonymizer (Assistant threads and traces)β
- Purpose: Remove PII from Assistant threads and traces (after sending to external LLMs).
- Dependency: Calls the NER Model directly (bypasses the NER Service layer).
- Flow:
- Backend extracts assistant thread text segments.
- Backend invokes the configured NER Model directly to detect entities.
- Backend applies masking/removal using returned entity spans/types.
- The cleaned content proceeds through the Assistant pipeline.
- Notes:
- Designed for lower-latency, in-process use where the service hop is undesirable.
- Shares the same model choice (Flair/spaCy) semantics as the NER Service.
- Does not apply Admin anonymization exceptions (no ignore list, no "ignore URLs"); those options are only honored by the NER Service.
Relationship to Architectureβ
The architecture diagram shows Backend β NER Service β NER Model for the ingestion path. NerModelAnonymizer corresponds to a direct Backend β NER Model path used by the Assistant subsystem.
Cachingβ
Entity detection results are cached centrally to reduce latency and model load.
- Store: Redis (shared cluster)
- What is cached: Detected entity spans/types for a given input text
- Timeout:
CACHE_TIMEOUTβ currently one hour
Actual cache key in use:
cache_key = generate_cache_key(
"ner_cache",
"anonymize_with_ner_model",
predictions_url,
text,
)
# generate_cache_key implementation (conceptual):
key_data = f"{func_name}:{json.dumps(args, sort_keys=True, default=str)}"
key_hash = hashlib.sha256(key_data.encode()).hexdigest()[:16]
return f"{cache_name}:{func_name}:{key_hash}"
# Resulting Redis key format
"ner_cache:anonymize_with_ner_model:<16-hex-sha256>"
Inputs and implications:
args = [predictions_url, text]predictions_urlembeds the model selection (e.g., Flair vs spaCy). Changing the model changes this URL and therefore invalidates prior keys automatically.- Other account options (e.g., ignore URLs, ignore list) are not part of the key. This is safe because the cache stores only detected entities; masking is applied after retrieval using the current options.
- No text normalization is applied before keying. Near-duplicates that differ in whitespace, casing, or punctuation will not hit the cache.
Behavior:
- On cache hit, entity spans/types are reused and masking is applied without calling the model.
- On miss, the model is invoked, results are stored with
CACHE_TIMEOUT, and then applied.
Example:
predictions_url = "https://ner.internal/predict?model=flair"
text = "Hi, I am John Doe. Phone: +3161234567"
cache_key => "ner_cache:anonymize_with_ner_model:<hash16>"
Operational notes:
- Model/version changes: Ensure
predictions_urluniquely identifies the deployed model (and version) so cache keys naturally rotate on upgrades. - Option changes: Since entities are cached (not masked text), toggling "ignore URLs" or updating the ignore list affects only the masking step and does not require cache invalidation.
- Privacy: Cache values should contain only derived entities/spans, not raw sensitive text.
Usage notesβ
- Ingestion pipelines should use
Anonymizerto ensure uniform behavior with analytics and storage. - Assistant pipelines should use
NerModelAnonymizerfor in-process redaction with minimal overhead. - Account-level configuration from Admin (model choice, toggles) only applies to the
Anonymizerpath.
Relatedβ
- Architecture: Anonymization, Data Ingestion
- Administration: Anonymization β User Guide