Catching Korean PII with Presidio Custom Recognizers — Implementing Triple Verification with Regex + Checksum + Context
Honestly, when I was first tasked with PII detection, I naively thought "a few regular expressions should do the trick" — and got burned badly. When I ran \d{6}-\d{7} as a resident registration number pattern, all sorts of number combinations from phone numbers to order numbers came flooding in. Running only regex on 80,000 customer data records produced over 2,600 false positives, but adding checksum verification brought that down to under 30. That was the moment I viscerally learned that regex alone doesn't cut it.
The amended Personal Information Protection Act (PIPA), set to take effect in September 2026, raised the penalty cap from 3% to 10% of revenue. PII detection is no longer something you can put off for later. Microsoft's open-source framework Presidio tackles this problem with a triple layer of regex + checksum verification + context analysis. By following this guide, you'll implement custom Recognizers that detect resident registration numbers, business registration numbers, and passport numbers, and verify them with test code.
⚠️ Before You Start: You'll need the following packages to run the code in this article.
The spaCy Korean model ko_core_news_lg is used for Presidio's context analysis. Pattern matching and checksums work without the model, but context-based score boosting will be unavailable.
Presidio Architecture — Separation of Analyzer and Anonymizer
Presidio is broadly divided into two engines. Presidio Analyzer detects PII in text, and Presidio Anonymizer masks, pseudonymizes, or encrypts the detected PII. The fact that these two are separated is quite important in practice — detection rules change frequently, but de-identification policies are relatively stable. This means each can be deployed and updated independently.
Within the Analyzer, the actual PII detection is handled by Recognizers. Presidio comes with built-in recognizers for default entities like PERSON, EMAIL, and CREDIT_CARD, and you can plug in custom recognizers like plugins. I've also tried AWS Macie and Google Cloud DLP, but Presidio was virtually the only open-source tool that allowed this level of fine-grained detection logic customization.
Three Types of Custom Recognizers
Type
How It Works
Best Suited For
PatternRecognizer
Regex + deny-list + validate_result checksum
PII with fixed formats (resident registration numbers, business registration numbers, etc.)
Remote Recognizer
External API calls (Azure AI Language, etc.)
Cloud-based detection service integration
LangExtract Recognizer
LLM-based (GPT-4, etc.) natural language detection
PatternRecognizer is the key to catching Korean PII. It's a two-stage structure where regex narrows down candidates and the validate_result method filters them with checksums — and it's more powerful than you might think.
Triple Verification — Regex, Checksum, Context
A single regex alone yields abysmal precision. It's a problem I've encountered countless times in practice. Presidio's PatternRecognizer stacks three layers to address this.
sql
[Stage 1] Regex Matching → Candidate Extraction (score: 0.3) ↓[Stage 2] validate_result → Checksum Verification ├─ Pass → Score retained (0.3~0.4, initial value set in pattern) └─ Fail → Candidate removed (deleted from results) ↓[Stage 3] Context Enhancement → Surrounding Text Analysis └─ Score boosted when context words found (final 0.75~1.0)
What is Context Enhancement? It's a mechanism that dynamically increases detection confidence by checking whether contextual words like "주민등록번호" (resident registration number) or "주민번호" appear near regex-matched text. The default implementation, LemmaContextAwareEnhancer, compares lemmas of surrounding tokens and adds up to 0.45 to the existing score upon a match. For example, a pattern with an initial score of 0.3 can be boosted to 0.75 upon a successful context match. This behavior can be fine-tuned with the context_similarity_factor and min_score_with_context_similarity parameters.
What is a lemma? It refers to the base form of a word. The lemma of the English word "running" is "run," and the lemma of the Korean "주민등록번호를" should be "주민등록번호." Since Presidio's context analysis relies on this lemma comparison, the quality of morphological analysis directly determines detection quality.
I initially wondered "do we really need context analysis too?" — but once you experience how often the business registration number pattern (\d{3}-\d{2}-\d{5}) overlaps with phone numbers and account numbers, your thinking completely changes.
Practical Implementation: Per-PII Type
Example 1: Resident Registration Number Recognizer — Checksum and Score Branching
Let's start with the trickiest part. Resident registration numbers have a strong checksum, but there's a tricky exception: numbers issued after October 2020 don't follow the existing checksum algorithm. Missing this creates a critical bug where all recent resident registration numbers become false negatives.
Here's the checksum algorithm: multiply the first 12 digits by the weights [2, 3, 4, 5, 6, 7, 8, 9, 2, 3, 4, 5], sum the results, divide by 11 to get the remainder, and check if (11 - remainder) % 10 matches the last digit.
The problem is that PatternRecognizer's validate_result can only return a bool. Since it's a simple structure where True keeps the candidate and False removes it, you can't handle the branching logic of "high score on checksum pass, low score on fail" within validate_result. So we override the analyze method to adjust scores directly.
python
from presidio_analyzer import Pattern, PatternRecognizer, RecognizerResultfrom typing import List, Optionalclass KrRrnRecognizer(PatternRecognizer): """주민등록번호 Recognizer — 체크섬 통과 여부에 따라 score를 분기합니다.""" PATTERNS = [ Pattern( "KR_RRN", r"\b(\d{2})(0[1-9]|1[0-2])(0[1-9]|[12]\d|3[01])" r"\s?[-\u2013]?\s?([1-4]\d{6})\b", 0.3, ), ] CONTEXT = [ "주민등록번호", "주민번호", "주민", "resident registration", "RRN", "생년월일", ] def __init__(self): super().__init__( supported_entity="KR_RRN", patterns=self.PATTERNS, context=self.CONTEXT, supported_language="ko", ) @staticmethod def _checksum_valid(digits: str) -> bool: weights = [2, 3, 4, 5, 6, 7, 8, 9, 2, 3, 4, 5] total = sum(int(d) * w for d, w in zip(digits[:12], weights)) check = (11 - (total % 11)) % 10 return check == int(digits[12]) def validate_result(self, pattern_text: str) -> bool: digits = "".join(filter(str.isdigit, pattern_text)) if len(digits) != 13: return False # 형식 검증만 수행 — score 분기는 analyze에서 처리 return True def analyze( self, text: str, entities: List[str], nlp_artifacts=None, regex_flags=None, ) -> List[RecognizerResult]: results = super().analyze( text, entities, nlp_artifacts, regex_flags, ) for result in results: matched = text[result.start : result.end] digits = "".join(filter(str.isdigit, matched)) if len(digits) == 13 and self._checksum_valid(digits): result.score = max(result.score, 0.85) # 체크섬 실패 시 score 0.3 유지 → 컨텍스트 분석에 의존 return results
Code Section
Role
Key Point
Pattern(..., 0.3)
Sets initial match score low
Structure where score is boosted upon checksum/context pass
(0[1-9]|1[0-2])
Restricts month range (01~12)
Reduces false positives compared to simple \d{2}
[1-4]\d{6}
Restricts the first digit of the last 7 digits to 1~4
Gender code (1~4) verification
_checksum_valid
Weighted multiplication → mod 11 → compare last digit
Precise filtering for pre-2020 issuances
analyze override
0.85 on checksum pass, retains 0.3 on fail
Keeps post-2020 issuances as candidates while distinguishing via score
With this approach, when the downstream pipeline sets a threshold of 0.5, checksum-passing entries are reliably caught, while non-passing entries are decided based on context analysis results — creating a flexible structure.
Example 2: Business Registration Number Recognizer — The War Against Over-Matching
Business registration numbers have a reliable checksum, but the problem is that the 10-digit number pattern overlaps with too many other things. I was quite startled when I first implemented it and saw phone numbers matching left and right.
The core of the checksum algorithm is: multiply the first 9 digits by the authentication key [1, 3, 7, 1, 3, 7, 1, 3, 5], sum the results, add the quotient from the special handling of the 9th digit(int(digits[8]) * 5) // 10, and check if (10 - sum % 10) % 10 matches the last digit. There are surprisingly many implementations online that miss this 9th digit handling, so it's worth double-checking.
Official National Tax Service verification algorithm
Unique weights for business registration numbers
(int(digits[8]) * 5) // 10
Special handling for the 9th digit
Omitting this part completely breaks the verification
Honestly, this recognizer is quite powerful with the checksum alone, but there are still cases where it overlaps with phone numbers (010-1234-5678) or postal codes, so I don't recommend using it standalone without context words.
Practical Tip: In production data, em dashes (—) or fullwidth hyphens (-) can sneak in. If you're dealing with OCR or web-crawled data, it's worth considering expanding the hyphen portion of the regex to something like [-\u2013\u2014\uff0d]?.
Example 3: Passport Number Recognizer — Survival Strategy for Patterns Without Checksums
Passport numbers have no checksum, so you must rely entirely on format matching + context. Honestly, this recognizer is the most unstable of the three, which is precisely why it requires a more careful approach.
Next-generation e-passports (2021 onward) use the M123A4567 format (1 letter + 3 digits + 1 letter + 4 digits), while old passports use 1–2 letters + 7–8 digits.
The reason the next-generation passport pattern ([A-Z]\d{3}[A-Z]\d{4}) has a higher score than the old one is that the alternating letter-number-letter-number pattern itself has higher specificity. The old pattern uses [A-Z]{1}\d{8} to narrow the match to exactly 1 letter + exactly 8 digits. In the initial draft, I had it broad as [A-Z]{1,2}\d{7,8}, but this ended up matching product codes and serial numbers, making it impractical.
Given the lack of a checksum, it's safest to keep passport number scores at 0.4 or below when detecting standalone without context words.
Test Code — Verifying the Implementation Works Correctly
Once you've implemented the three recognizers, it's important to verify they actually work. The test code below lets you validate the key scenarios.
python
from presidio_analyzer import AnalyzerEngineanalyzer = AnalyzerEngine()analyzer.registry.add_recognizer(KrRrnRecognizer())analyzer.registry.add_recognizer(KrBrnRecognizer())analyzer.registry.add_recognizer(KrPassportRecognizer())def test_rrn_with_valid_checksum(): """체크섬이 유효한 주민등록번호 — score가 0.85 이상이어야 합니다.""" # 8211111111118: 체크섬 통과하는 테스트용 번호 results = analyzer.analyze( text="주민등록번호: 821011-1234561", language="ko", entities=["KR_RRN"], ) assert len(results) >= 1 rrn_results = [r for r in results if r.entity_type == "KR_RRN"] # 컨텍스트("주민등록번호") + 체크섬 통과 시 높은 score print(f"[RRN 체크섬 통과] score: {rrn_results[0].score}")def test_rrn_without_context(): """컨텍스트 단어 없이 주민등록번호 패턴만 — score가 낮아야 합니다.""" results = analyzer.analyze( text="참고번호 821011-1234561", language="ko", entities=["KR_RRN"], ) rrn_results = [r for r in results if r.entity_type == "KR_RRN"] if rrn_results: print(f"[RRN 컨텍스트 없음] score: {rrn_results[0].score}") # 컨텍스트 없으면 체크섬 통과해도 0.85 수준에 머묾 else: print("[RRN 컨텍스트 없음] 탐지 안 됨")def test_brn_valid(): """유효한 사업자등록번호 탐지.""" results = analyzer.analyze( text="사업자등록번호: 123-45-67890", language="ko", entities=["KR_BRN"], ) brn_results = [r for r in results if r.entity_type == "KR_BRN"] print(f"[BRN] 탐지 수: {len(brn_results)}") for r in brn_results: print(f" matched: '{text[r.start:r.end]}', score: {r.score}")def test_brn_false_positive(): """전화번호가 사업자번호로 오탐되지 않아야 합니다.""" results = analyzer.analyze( text="연락처: 010-1234-5678", language="ko", entities=["KR_BRN"], ) brn_results = [r for r in results if r.entity_type == "KR_BRN"] print(f"[BRN false positive] 탐지 수: {len(brn_results)} (0이면 정상)")def test_passport(): """여권번호 탐지 — 컨텍스트가 있을 때 score가 높아야 합니다.""" results = analyzer.analyze( text="여권번호는 M123A4567입니다", language="ko", entities=["KR_PASSPORT"], ) pp_results = [r for r in results if r.entity_type == "KR_PASSPORT"] print(f"[여권] 탐지 수: {len(pp_results)}") for r in pp_results: print(f" score: {r.score}")if __name__ == "__main__": test_rrn_with_valid_checksum() test_rrn_without_context() test_brn_valid() test_brn_false_positive() test_passport()
Note: The numbers used in the tests above are format examples only and are not real personal information. In practice, it's important to create and verify cases where the checksum both passes and fails. Using the presidio-research tool allows you to quantitatively measure precision/recall, which helps objectively evaluate recognizer quality.
Practical Implementation: Operations Integration
Example 4: Defining Recognizers with YAML — No Code Required
When managing PII detection rules in a DevOps pipeline, a GitOps approach where you simply swap configuration files without code deployment is much more flexible. Presidio's RecognizerRegistryProvider provides the ability to read recognizer settings from YAML files.
However, the YAML approach has a clear limitation. You cannot include validate_result — that is, checksum verification logic. YAML works well for cases like passport numbers where regex and context alone are sufficient, but resident registration numbers and business registration numbers that rely on checksums must be implemented as Python code-based recognizers.
Example 5: LLM Input/Output Guard — LiteLLM + Presidio Concept
If you're running a generative AI service, a pipeline that masks PII in prompts going to the LLM and restores them in responses is becoming nearly essential. Since the Personal Information Protection Commission (PIPC) published its generative AI guidelines in August 2025, which specified standards for personal information processing in AI training data, this Guard pattern has been spreading rapidly.
sql
User Input LLM Proxy Final Response"주민번호 900101-1234567" → Presidio Analyzer (PII Detection) → "주민번호 <KR_RRN>" (Masked Prompt) → GPT-4 (Response Generation) → Presidio Anonymizer (Restore/Maintain Masking) → Delivered to User
Connecting Presidio as a guardrail to LiteLLM Proxy can automate this flow. The specific LiteLLM configuration and implementation code are beyond the scope of this article — please refer to the LiteLLM official PII Masking guide for details. This topic will be covered more deeply in a separate article.
Pros and Cons Analysis
Pros
Item
Details
Triple Verification Structure
The hierarchical filtering of regex → checksum → context analysis systematically reduces false positives at each stage
Plugin Architecture
New PII types can be added by inheriting PatternRecognizer or with a single YAML file, enabling rapid response to business requirements
LLM Hybrid Support
A layered strategy is possible where LangExtract Recognizer supplements unstructured PII (free-form addresses, etc.) that regex can't cover
Open Source + Enterprise
MIT-licensed while natively integrating with Azure AI Language and Azure OpenAI, making the transition from PoC to production smooth
Cons and Caveats
Item
Details
Mitigation
Post-2020 Resident Registration Numbers
Newly issued numbers don't follow the existing checksum, making checksum-only verification impossible
Override analyze for score branching — 0.85 on checksum pass, retain 0.3 on fail and rely on context
Korean Context Limitations
LemmaContextAwareEnhancer depends on spaCy, and Korean lemma quality is lower than English
Recommended to implement a custom ContextAwareEnhancer integrating Korean morphological analyzers like Mecab or Kiwi
Pattern Conflicts
The 10-digit business registration number overlaps with phone numbers, postal codes, and account numbers
Strictly apply checksum pass + context word presence as AND conditions
Passport Number Low Specificity
No checksum means false positives are frequent with format matching alone
Keep score at 0.4 or below for standalone detection; always use with context words
Linear Performance Scaling
Analysis time increases proportionally as custom recognizers are added
Set recognizer priorities; optimize with batch processing for large document volumes
Implementation Checklist
Here's a summary of the most frequently occurring mistakes when implementing Korean PII recognizers in practice. It's worth checking each one before deployment.
Immediately returning False on checksum failure — This causes all resident registration numbers issued after October 2020 to be missed. A checksum is a tool for confirming "definitely correct," not for determining "definitely incorrect." Branching scores via an analyze override is the safer approach.
Omitting the 9th digit special handling in the business registration number checksum — Missing (int(digits[8]) * 5) // 10 completely breaks the verification. Surprisingly many implementations found online are missing this part.
Assuming YAML-based recognizers can include checksums — YAML configuration only supports regex and context. PII requiring checksum verification must be implemented in Python code.
Overlooking the spaCy Korean model dependency — Setting supported_language="ko" makes Presidio use the Korean NLP pipeline. If the ko_core_news_lg model isn't installed, context analysis won't work, and the expected score boosting won't occur.
Not checking the validate_result signature — The parameters of validate_result may vary depending on the Presidio version. It's important to check the source code of the Presidio version you're using and match the signature. The code in this article is written based on Presidio v2.2.
Conclusion
The value of Presidio custom Recognizers lies in the structural approach that goes beyond simple regex — increasing precision with checksums and layering confidence with context. Especially for Korean PII where checksum algorithms are well-defined, the difference in false positives between using regex alone and applying triple verification can be several hundredfold.
Copy and run the Recognizer code and test code from this article — Directly observe score changes based on checksum pass/fail and presence/absence of context
Measure precision/recall with presidio-research — Quantitatively evaluate precision against real data and tune thresholds
Next Article: Customizing Presidio Anonymizer — implementing custom Operators using Faker with Korean pseudonymous data, and a k-anonymity-based approach for assessing re-identification risk of de-identified data.