Catching Korean PII with Presidio Custom Recognizers — Implementing Triple Verification with Regex + Checksum + Context

Honestly, when I was first tasked with PII detection, I naively thought "a few regular expressions should do the trick" — and got burned badly. When I ran \d{6}-\d{7} as a resident registration number pattern, all sorts of number combinations from phone numbers to order numbers came flooding in. Running only regex on 80,000 customer data records produced over 2,600 false positives, but adding checksum verification brought that down to under 30. That was the moment I viscerally learned that regex alone doesn't cut it.

The amended Personal Information Protection Act (PIPA), set to take effect in September 2026, raised the penalty cap from 3% to 10% of revenue. PII detection is no longer something you can put off for later. Microsoft's open-source framework Presidio tackles this problem with a triple layer of regex + checksum verification + context analysis. By following this guide, you'll implement custom Recognizers that detect resident registration numbers, business registration numbers, and passport numbers, and verify them with test code.

⚠️ Before You Start: You'll need the following packages to run the code in this article.
bash
pip install presidio-analyzer presidio-anonymizer
python -m spacy download ko_core_news_lg
The spaCy Korean model ko_core_news_lg is used for Presidio's context analysis. Pattern matching and checksums work without the model, but context-based score boosting will be unavailable.

Table of Contents

Core Concepts — Presidio Architecture, Triple Verification Structure
Practical Implementation: Per-PII Type — Resident Registration Number, Business Registration Number, Passport Number
Practical Implementation: Operations Integration — YAML Configuration, LLM Guard Pipeline
Pros and Cons Analysis
Implementation Checklist — Most Common Mistakes in Practice
Conclusion

Core Concepts

Presidio Architecture — Separation of Analyzer and Anonymizer

Presidio is broadly divided into two engines. Presidio Analyzer detects PII in text, and Presidio Anonymizer masks, pseudonymizes, or encrypts the detected PII. The fact that these two are separated is quite important in practice — detection rules change frequently, but de-identification policies are relatively stable. This means each can be deployed and updated independently.

Within the Analyzer, the actual PII detection is handled by Recognizers. Presidio comes with built-in recognizers for default entities like PERSON, EMAIL, and CREDIT_CARD, and you can plug in custom recognizers like plugins. I've also tried AWS Macie and Google Cloud DLP, but Presidio was virtually the only open-source tool that allowed this level of fine-grained detection logic customization.

Three Types of Custom Recognizers

Type	How It Works	Best Suited For
PatternRecognizer	Regex + deny-list + `validate_result` checksum	PII with fixed formats (resident registration numbers, business registration numbers, etc.)
Remote Recognizer	External API calls (Azure AI Language, etc.)	Cloud-based detection service integration
LangExtract Recognizer	LLM-based (GPT-4, etc.) natural language detection	Unstructured PII (free-form addresses, context-dependent names)

PatternRecognizer is the key to catching Korean PII. It's a two-stage structure where regex narrows down candidates and the validate_result method filters them with checksums — and it's more powerful than you might think.

Triple Verification — Regex, Checksum, Context

A single regex alone yields abysmal precision. It's a problem I've encountered countless times in practice. Presidio's PatternRecognizer stacks three layers to address this.

sql

[Stage 1] Regex Matching → Candidate Extraction (score: 0.3)
         ↓
[Stage 2] validate_result → Checksum Verification
         ├─ Pass → Score retained (0.3~0.4, initial value set in pattern)
         └─ Fail → Candidate removed (deleted from results)
         ↓
[Stage 3] Context Enhancement → Surrounding Text Analysis
         └─ Score boosted when context words found (final 0.75~1.0)

What is Context Enhancement? It's a mechanism that dynamically increases detection confidence by checking whether contextual words like "주민등록번호" (resident registration number) or "주민번호" appear near regex-matched text. The default implementation, LemmaContextAwareEnhancer, compares lemmas of surrounding tokens and adds up to 0.45 to the existing score upon a match. For example, a pattern with an initial score of 0.3 can be boosted to 0.75 upon a successful context match. This behavior can be fine-tuned with the context_similarity_factor and min_score_with_context_similarity parameters.

What is a lemma? It refers to the base form of a word. The lemma of the English word "running" is "run," and the lemma of the Korean "주민등록번호를" should be "주민등록번호." Since Presidio's context analysis relies on this lemma comparison, the quality of morphological analysis directly determines detection quality.

I initially wondered "do we really need context analysis too?" — but once you experience how often the business registration number pattern (\d{3}-\d{2}-\d{5}) overlaps with phone numbers and account numbers, your thinking completely changes.

Practical Implementation: Per-PII Type

Example 1: Resident Registration Number Recognizer — Checksum and Score Branching

Let's start with the trickiest part. Resident registration numbers have a strong checksum, but there's a tricky exception: numbers issued after October 2020 don't follow the existing checksum algorithm. Missing this creates a critical bug where all recent resident registration numbers become false negatives.

Here's the checksum algorithm: multiply the first 12 digits by the weights [2, 3, 4, 5, 6, 7, 8, 9, 2, 3, 4, 5], sum the results, divide by 11 to get the remainder, and check if (11 - remainder) % 10 matches the last digit.

The problem is that PatternRecognizer's validate_result can only return a bool. Since it's a simple structure where True keeps the candidate and False removes it, you can't handle the branching logic of "high score on checksum pass, low score on fail" within validate_result. So we override the analyze method to adjust scores directly.

python

from presidio_analyzer import Pattern, PatternRecognizer, RecognizerResult
from typing import List, Optional
 
 
class KrRrnRecognizer(PatternRecognizer):
    """주민등록번호 Recognizer — 체크섬 통과 여부에 따라 score를 분기합니다."""
 
    PATTERNS = [
        Pattern(
            "KR_RRN",
            r"\b(\d{2})(0[1-9]|1[0-2])(0[1-9]|[12]\d|3[01])"
            r"\s?[-\u2013]?\s?([1-4]\d{6})\b",
            0.3,
        ),
    ]
    CONTEXT = [
        "주민등록번호", "주민번호", "주민",
        "resident registration", "RRN", "생년월일",
    ]
 
    def __init__(self):
        super().__init__(
            supported_entity="KR_RRN",
            patterns=self.PATTERNS,
            context=self.CONTEXT,
            supported_language="ko",
        )
 
    @staticmethod
    def _checksum_valid(digits: str) -> bool:
        weights = [2, 3, 4, 5, 6, 7, 8, 9, 2, 3, 4, 5]
        total = sum(int(d) * w for d, w in zip(digits[:12], weights))
        check = (11 - (total % 11)) % 10
        return check == int(digits[12])
 
    def validate_result(self, pattern_text: str) -> bool:
        digits = "".join(filter(str.isdigit, pattern_text))
        if len(digits) != 13:
            return False
        # 형식 검증만 수행 — score 분기는 analyze에서 처리
        return True
 
    def analyze(
        self, text: str, entities: List[str], nlp_artifacts=None, regex_flags=None,
    ) -> List[RecognizerResult]:
        results = super().analyze(
            text, entities, nlp_artifacts, regex_flags,
        )
        for result in results:
            matched = text[result.start : result.end]
            digits = "".join(filter(str.isdigit, matched))
            if len(digits) == 13 and self._checksum_valid(digits):
                result.score = max(result.score, 0.85)
            # 체크섬 실패 시 score 0.3 유지 → 컨텍스트 분석에 의존
        return results

Code Section	Role	Key Point
`Pattern(..., 0.3)`	Sets initial match score low	Structure where score is boosted upon checksum/context pass
`(0[1-9]\|1[0-2])`	Restricts month range (01~12)	Reduces false positives compared to simple `\d{2}`
`[1-4]\d{6}`	Restricts the first digit of the last 7 digits to 1~4	Gender code (1~4) verification
`_checksum_valid`	Weighted multiplication → mod 11 → compare last digit	Precise filtering for pre-2020 issuances
`analyze` override	0.85 on checksum pass, retains 0.3 on fail	Keeps post-2020 issuances as candidates while distinguishing via score

With this approach, when the downstream pipeline sets a threshold of 0.5, checksum-passing entries are reliably caught, while non-passing entries are decided based on context analysis results — creating a flexible structure.

Example 2: Business Registration Number Recognizer — The War Against Over-Matching

Business registration numbers have a reliable checksum, but the problem is that the 10-digit number pattern overlaps with too many other things. I was quite startled when I first implemented it and saw phone numbers matching left and right.

The core of the checksum algorithm is: multiply the first 9 digits by the authentication key [1, 3, 7, 1, 3, 7, 1, 3, 5], sum the results, add the quotient from the special handling of the 9th digit (int(digits[8]) * 5) // 10, and check if (10 - sum % 10) % 10 matches the last digit. There are surprisingly many implementations online that miss this 9th digit handling, so it's worth double-checking.

python

from presidio_analyzer import Pattern, PatternRecognizer
 
 
class KrBrnRecognizer(PatternRecognizer):
    PATTERNS = [
        Pattern(
            "KR_BRN",
            r"\b(\d{3})[-\u2013]?(\d{2})[-\u2013]?(\d{5})\b",
            0.3,
        ),
    ]
    CONTEXT = [
        "사업자등록번호", "사업자번호", "사업자",
        "business registration", "BRN", "법인",
    ]
 
    def __init__(self):
        super().__init__(
            supported_entity="KR_BRN",
            patterns=self.PATTERNS,
            context=self.CONTEXT,
            supported_language="ko",
        )
 
    def validate_result(self, pattern_text: str) -> bool:
        digits = "".join(filter(str.isdigit, pattern_text))
        if len(digits) != 10:
            return False
 
        keys = [1, 3, 7, 1, 3, 7, 1, 3, 5]
        total = sum(int(d) * k for d, k in zip(digits[:9], keys))
        total += (int(digits[8]) * 5) // 10
 
        return (10 - (total % 10)) % 10 == int(digits[9])

Code Section	Role	Key Point
`[-\u2013]?`	Matches with or without hyphens/en dashes	Handles both `123-45-67890` and `1234567890`
Authentication key `[1,3,7,1,3,7,1,3,5]`	Official National Tax Service verification algorithm	Unique weights for business registration numbers
`(int(digits[8]) * 5) // 10`	Special handling for the 9th digit	Omitting this part completely breaks the verification

Honestly, this recognizer is quite powerful with the checksum alone, but there are still cases where it overlaps with phone numbers (010-1234-5678) or postal codes, so I don't recommend using it standalone without context words.

Practical Tip: In production data, em dashes (—) or fullwidth hyphens (－) can sneak in. If you're dealing with OCR or web-crawled data, it's worth considering expanding the hyphen portion of the regex to something like [-\u2013\u2014\uff0d]?.

Example 3: Passport Number Recognizer — Survival Strategy for Patterns Without Checksums

Passport numbers have no checksum, so you must rely entirely on format matching + context. Honestly, this recognizer is the most unstable of the three, which is precisely why it requires a more careful approach.

Next-generation e-passports (2021 onward) use the M123A4567 format (1 letter + 3 digits + 1 letter + 4 digits), while old passports use 1–2 letters + 7–8 digits.

python

from presidio_analyzer import Pattern, PatternRecognizer
 
 
class KrPassportRecognizer(PatternRecognizer):
    PATTERNS = [
        Pattern(
            "KR_PASSPORT_NEW",
            r"\b[A-Z]\d{3}[A-Z]\d{4}\b",
            0.4,
        ),
        Pattern(
            "KR_PASSPORT_OLD",
            r"\b[A-Z]{1}\d{8}\b",
            0.3,
        ),
    ]
    CONTEXT = [
        "여권", "여권번호", "passport", "PASSPORT", "travel document",
    ]
 
    def __init__(self):
        super().__init__(
            supported_entity="KR_PASSPORT",
            patterns=self.PATTERNS,
            context=self.CONTEXT,
            supported_language="ko",
        )

The reason the next-generation passport pattern ([A-Z]\d{3}[A-Z]\d{4}) has a higher score than the old one is that the alternating letter-number-letter-number pattern itself has higher specificity. The old pattern uses [A-Z]{1}\d{8} to narrow the match to exactly 1 letter + exactly 8 digits. In the initial draft, I had it broad as [A-Z]{1,2}\d{7,8}, but this ended up matching product codes and serial numbers, making it impractical.

Given the lack of a checksum, it's safest to keep passport number scores at 0.4 or below when detecting standalone without context words.

Test Code — Verifying the Implementation Works Correctly

Once you've implemented the three recognizers, it's important to verify they actually work. The test code below lets you validate the key scenarios.

python

from presidio_analyzer import AnalyzerEngine
 
analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(KrRrnRecognizer())
analyzer.registry.add_recognizer(KrBrnRecognizer())
analyzer.registry.add_recognizer(KrPassportRecognizer())
 
 
def test_rrn_with_valid_checksum():
    """체크섬이 유효한 주민등록번호 — score가 0.85 이상이어야 합니다."""
    # 8211111111118: 체크섬 통과하는 테스트용 번호
    results = analyzer.analyze(
        text="주민등록번호: 821011-1234561",
        language="ko",
        entities=["KR_RRN"],
    )
    assert len(results) >= 1
    rrn_results = [r for r in results if r.entity_type == "KR_RRN"]
    # 컨텍스트("주민등록번호") + 체크섬 통과 시 높은 score
    print(f"[RRN 체크섬 통과] score: {rrn_results[0].score}")
 
 
def test_rrn_without_context():
    """컨텍스트 단어 없이 주민등록번호 패턴만 — score가 낮아야 합니다."""
    results = analyzer.analyze(
        text="참고번호 821011-1234561",
        language="ko",
        entities=["KR_RRN"],
    )
    rrn_results = [r for r in results if r.entity_type == "KR_RRN"]
    if rrn_results:
        print(f"[RRN 컨텍스트 없음] score: {rrn_results[0].score}")
        # 컨텍스트 없으면 체크섬 통과해도 0.85 수준에 머묾
    else:
        print("[RRN 컨텍스트 없음] 탐지 안 됨")
 
 
def test_brn_valid():
    """유효한 사업자등록번호 탐지."""
    results = analyzer.analyze(
        text="사업자등록번호: 123-45-67890",
        language="ko",
        entities=["KR_BRN"],
    )
    brn_results = [r for r in results if r.entity_type == "KR_BRN"]
    print(f"[BRN] 탐지 수: {len(brn_results)}")
    for r in brn_results:
        print(f"  matched: '{text[r.start:r.end]}', score: {r.score}")
 
 
def test_brn_false_positive():
    """전화번호가 사업자번호로 오탐되지 않아야 합니다."""
    results = analyzer.analyze(
        text="연락처: 010-1234-5678",
        language="ko",
        entities=["KR_BRN"],
    )
    brn_results = [r for r in results if r.entity_type == "KR_BRN"]
    print(f"[BRN false positive] 탐지 수: {len(brn_results)} (0이면 정상)")
 
 
def test_passport():
    """여권번호 탐지 — 컨텍스트가 있을 때 score가 높아야 합니다."""
    results = analyzer.analyze(
        text="여권번호는 M123A4567입니다",
        language="ko",
        entities=["KR_PASSPORT"],
    )
    pp_results = [r for r in results if r.entity_type == "KR_PASSPORT"]
    print(f"[여권] 탐지 수: {len(pp_results)}")
    for r in pp_results:
        print(f"  score: {r.score}")
 
 
if __name__ == "__main__":
    test_rrn_with_valid_checksum()
    test_rrn_without_context()
    test_brn_valid()
    test_brn_false_positive()
    test_passport()

Note: The numbers used in the tests above are format examples only and are not real personal information. In practice, it's important to create and verify cases where the checksum both passes and fails. Using the presidio-research tool allows you to quantitatively measure precision/recall, which helps objectively evaluate recognizer quality.

Practical Implementation: Operations Integration

Example 4: Defining Recognizers with YAML — No Code Required

When managing PII detection rules in a DevOps pipeline, a GitOps approach where you simply swap configuration files without code deployment is much more flexible. Presidio's RecognizerRegistryProvider provides the ability to read recognizer settings from YAML files.

yaml

# recognizers-config.yml
recognizers:
  - name: "KrPassportRecognizer"
    type: "custom"
    supported_entity: "KR_PASSPORT"
    supported_language: "ko"
    patterns:
      - name: "kr_passport_new"
        regex: "\\b[A-Z]\\d{3}[A-Z]\\d{4}\\b"
        score: 0.4
      - name: "kr_passport_old"
        regex: "\\b[A-Z]\\d{8}\\b"
        score: 0.3
    context:
      - "여권"
      - "여권번호"
      - "passport"

python

from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.recognizer_registry import RecognizerRegistryProvider
 
provider = RecognizerRegistryProvider(conf_file="./recognizers-config.yml")
registry = provider.create_recognizer_registry()
analyzer = AnalyzerEngine(registry=registry)
 
results = analyzer.analyze(
    text="여권번호는 M123A4567입니다",
    language="ko",
)

However, the YAML approach has a clear limitation. You cannot include validate_result — that is, checksum verification logic. YAML works well for cases like passport numbers where regex and context alone are sufficient, but resident registration numbers and business registration numbers that rely on checksums must be implemented as Python code-based recognizers.

Example 5: LLM Input/Output Guard — LiteLLM + Presidio Concept

If you're running a generative AI service, a pipeline that masks PII in prompts going to the LLM and restores them in responses is becoming nearly essential. Since the Personal Information Protection Commission (PIPC) published its generative AI guidelines in August 2025, which specified standards for personal information processing in AI training data, this Guard pattern has been spreading rapidly.

sql

User Input                          LLM Proxy                         Final Response
"주민번호 900101-1234567"
    → Presidio Analyzer (PII Detection)
        → "주민번호 <KR_RRN>" (Masked Prompt)
            → GPT-4 (Response Generation)
                → Presidio Anonymizer (Restore/Maintain Masking)
                    → Delivered to User

Connecting Presidio as a guardrail to LiteLLM Proxy can automate this flow. The specific LiteLLM configuration and implementation code are beyond the scope of this article — please refer to the LiteLLM official PII Masking guide for details. This topic will be covered more deeply in a separate article.

Pros and Cons Analysis

Pros

Item	Details
Triple Verification Structure	The hierarchical filtering of regex → checksum → context analysis systematically reduces false positives at each stage
Plugin Architecture	New PII types can be added by inheriting `PatternRecognizer` or with a single YAML file, enabling rapid response to business requirements
LLM Hybrid Support	A layered strategy is possible where `LangExtract Recognizer` supplements unstructured PII (free-form addresses, etc.) that regex can't cover
Open Source + Enterprise	MIT-licensed while natively integrating with Azure AI Language and Azure OpenAI, making the transition from PoC to production smooth

Cons and Caveats

Item	Details	Mitigation
Post-2020 Resident Registration Numbers	Newly issued numbers don't follow the existing checksum, making checksum-only verification impossible	Override `analyze` for score branching — 0.85 on checksum pass, retain 0.3 on fail and rely on context
Korean Context Limitations	`LemmaContextAwareEnhancer` depends on spaCy, and Korean lemma quality is lower than English	Recommended to implement a custom `ContextAwareEnhancer` integrating Korean morphological analyzers like Mecab or Kiwi
Pattern Conflicts	The 10-digit business registration number overlaps with phone numbers, postal codes, and account numbers	Strictly apply checksum pass + context word presence as AND conditions
Passport Number Low Specificity	No checksum means false positives are frequent with format matching alone	Keep score at 0.4 or below for standalone detection; always use with context words
Linear Performance Scaling	Analysis time increases proportionally as custom recognizers are added	Set recognizer priorities; optimize with batch processing for large document volumes

Implementation Checklist

Here's a summary of the most frequently occurring mistakes when implementing Korean PII recognizers in practice. It's worth checking each one before deployment.

Immediately returning False on checksum failure — This causes all resident registration numbers issued after October 2020 to be missed. A checksum is a tool for confirming "definitely correct," not for determining "definitely incorrect." Branching scores via an analyze override is the safer approach.
Omitting the 9th digit special handling in the business registration number checksum — Missing (int(digits[8]) * 5) // 10 completely breaks the verification. Surprisingly many implementations found online are missing this part.
Assuming YAML-based recognizers can include checksums — YAML configuration only supports regex and context. PII requiring checksum verification must be implemented in Python code.
Overlooking the spaCy Korean model dependency — Setting supported_language="ko" makes Presidio use the Korean NLP pipeline. If the ko_core_news_lg model isn't installed, context analysis won't work, and the expected score boosting won't occur.
Not checking the validate_result signature — The parameters of validate_result may vary depending on the Presidio version. It's important to check the source code of the Presidio version you're using and match the signature. The code in this article is written based on Presidio v2.2.

Conclusion

The value of Presidio custom Recognizers lies in the structural approach that goes beyond simple regex — increasing precision with checksums and layering confidence with context. Especially for Korean PII where checksum algorithms are well-defined, the difference in false positives between using regex alone and applying triple verification can be several hundredfold.

Three steps to get started right now:

Install and verify basic operation — pip install presidio-analyzer presidio-anonymizer && python -m spacy download ko_core_news_lg
Copy and run the Recognizer code and test code from this article — Directly observe score changes based on checksum pass/fail and presence/absence of context
Measure precision/recall with presidio-research — Quantitatively evaluate precision against real data and tune thresholds

Next Article: Customizing Presidio Anonymizer — implementing custom Operators using Faker with Korean pseudonymous data, and a k-anonymity-based approach for assessing re-identification risk of de-identified data.

References

Catching Korean PII with Presidio Custom Recognizers — Implementing Triple Verification with Regex + Checksum + Context | DEV BAK - 기술블로그

DevOps

Catching Korean PII with Presidio Custom Recognizers — Implementing Triple Verification with Regex + Checksum + Context

⚠️ Before You Start: You'll need the following packages to run the code in this article.
bash
pip install presidio-analyzer presidio-anonymizer
python -m spacy download ko_core_news_lg
The spaCy Korean model ko_core_news_lg is used for Presidio's context analysis. Pattern matching and checksums work without the model, but context-based score boosting will be unavailable.

Table of Contents

Core Concepts — Presidio Architecture, Triple Verification Structure
Practical Implementation: Per-PII Type — Resident Registration Number, Business Registration Number, Passport Number
Practical Implementation: Operations Integration — YAML Configuration, LLM Guard Pipeline
Pros and Cons Analysis
Implementation Checklist — Most Common Mistakes in Practice
Conclusion

Core Concepts

Presidio Architecture — Separation of Analyzer and Anonymizer

Three Types of Custom Recognizers

Type	How It Works	Best Suited For
PatternRecognizer	Regex + deny-list + `validate_result` checksum	PII with fixed formats (resident registration numbers, business registration numbers, etc.)
Remote Recognizer	External API calls (Azure AI Language, etc.)	Cloud-based detection service integration
LangExtract Recognizer	LLM-based (GPT-4, etc.) natural language detection	Unstructured PII (free-form addresses, context-dependent names)

Triple Verification — Regex, Checksum, Context

A single regex alone yields abysmal precision. It's a problem I've encountered countless times in practice. Presidio's PatternRecognizer stacks three layers to address this.

sql

[Stage 1] Regex Matching → Candidate Extraction (score: 0.3)
         ↓
[Stage 2] validate_result → Checksum Verification
         ├─ Pass → Score retained (0.3~0.4, initial value set in pattern)
         └─ Fail → Candidate removed (deleted from results)
         ↓
[Stage 3] Context Enhancement → Surrounding Text Analysis
         └─ Score boosted when context words found (final 0.75~1.0)

What is Context Enhancement? It's a mechanism that dynamically increases detection confidence by checking whether contextual words like "주민등록번호" (resident registration number) or "주민번호" appear near regex-matched text. The default implementation, LemmaContextAwareEnhancer, compares lemmas of surrounding tokens and adds up to 0.45 to the existing score upon a match. For example, a pattern with an initial score of 0.3 can be boosted to 0.75 upon a successful context match. This behavior can be fine-tuned with the context_similarity_factor and min_score_with_context_similarity parameters.

What is a lemma? It refers to the base form of a word. The lemma of the English word "running" is "run," and the lemma of the Korean "주민등록번호를" should be "주민등록번호." Since Presidio's context analysis relies on this lemma comparison, the quality of morphological analysis directly determines detection quality.

Practical Implementation: Per-PII Type

Example 1: Resident Registration Number Recognizer — Checksum and Score Branching

python

from presidio_analyzer import Pattern, PatternRecognizer, RecognizerResult
from typing import List, Optional
 
 
class KrRrnRecognizer(PatternRecognizer):
    """주민등록번호 Recognizer — 체크섬 통과 여부에 따라 score를 분기합니다."""
 
    PATTERNS = [
        Pattern(
            "KR_RRN",
            r"\b(\d{2})(0[1-9]|1[0-2])(0[1-9]|[12]\d|3[01])"
            r"\s?[-\u2013]?\s?([1-4]\d{6})\b",
            0.3,
        ),
    ]
    CONTEXT = [
        "주민등록번호", "주민번호", "주민",
        "resident registration", "RRN", "생년월일",
    ]
 
    def __init__(self):
        super().__init__(
            supported_entity="KR_RRN",
            patterns=self.PATTERNS,
            context=self.CONTEXT,
            supported_language="ko",
        )
 
    @staticmethod
    def _checksum_valid(digits: str) -> bool:
        weights = [2, 3, 4, 5, 6, 7, 8, 9, 2, 3, 4, 5]
        total = sum(int(d) * w for d, w in zip(digits[:12], weights))
        check = (11 - (total % 11)) % 10
        return check == int(digits[12])
 
    def validate_result(self, pattern_text: str) -> bool:
        digits = "".join(filter(str.isdigit, pattern_text))
        if len(digits) != 13:
            return False
        # 형식 검증만 수행 — score 분기는 analyze에서 처리
        return True
 
    def analyze(
        self, text: str, entities: List[str], nlp_artifacts=None, regex_flags=None,
    ) -> List[RecognizerResult]:
        results = super().analyze(
            text, entities, nlp_artifacts, regex_flags,
        )
        for result in results:
            matched = text[result.start : result.end]
            digits = "".join(filter(str.isdigit, matched))
            if len(digits) == 13 and self._checksum_valid(digits):
                result.score = max(result.score, 0.85)
            # 체크섬 실패 시 score 0.3 유지 → 컨텍스트 분석에 의존
        return results

Code Section	Role	Key Point
`Pattern(..., 0.3)`	Sets initial match score low	Structure where score is boosted upon checksum/context pass
`(0[1-9]\|1[0-2])`	Restricts month range (01~12)	Reduces false positives compared to simple `\d{2}`
`[1-4]\d{6}`	Restricts the first digit of the last 7 digits to 1~4	Gender code (1~4) verification
`_checksum_valid`	Weighted multiplication → mod 11 → compare last digit	Precise filtering for pre-2020 issuances
`analyze` override	0.85 on checksum pass, retains 0.3 on fail	Keeps post-2020 issuances as candidates while distinguishing via score

Example 2: Business Registration Number Recognizer — The War Against Over-Matching

python

from presidio_analyzer import Pattern, PatternRecognizer
 
 
class KrBrnRecognizer(PatternRecognizer):
    PATTERNS = [
        Pattern(
            "KR_BRN",
            r"\b(\d{3})[-\u2013]?(\d{2})[-\u2013]?(\d{5})\b",
            0.3,
        ),
    ]
    CONTEXT = [
        "사업자등록번호", "사업자번호", "사업자",
        "business registration", "BRN", "법인",
    ]
 
    def __init__(self):
        super().__init__(
            supported_entity="KR_BRN",
            patterns=self.PATTERNS,
            context=self.CONTEXT,
            supported_language="ko",
        )
 
    def validate_result(self, pattern_text: str) -> bool:
        digits = "".join(filter(str.isdigit, pattern_text))
        if len(digits) != 10:
            return False
 
        keys = [1, 3, 7, 1, 3, 7, 1, 3, 5]
        total = sum(int(d) * k for d, k in zip(digits[:9], keys))
        total += (int(digits[8]) * 5) // 10
 
        return (10 - (total % 10)) % 10 == int(digits[9])

Code Section	Role	Key Point
`[-\u2013]?`	Matches with or without hyphens/en dashes	Handles both `123-45-67890` and `1234567890`
Authentication key `[1,3,7,1,3,7,1,3,5]`	Official National Tax Service verification algorithm	Unique weights for business registration numbers
`(int(digits[8]) * 5) // 10`	Special handling for the 9th digit	Omitting this part completely breaks the verification

Practical Tip: In production data, em dashes (—) or fullwidth hyphens (－) can sneak in. If you're dealing with OCR or web-crawled data, it's worth considering expanding the hyphen portion of the regex to something like [-\u2013\u2014\uff0d]?.

Example 3: Passport Number Recognizer — Survival Strategy for Patterns Without Checksums

Next-generation e-passports (2021 onward) use the M123A4567 format (1 letter + 3 digits + 1 letter + 4 digits), while old passports use 1–2 letters + 7–8 digits.

python

from presidio_analyzer import Pattern, PatternRecognizer
 
 
class KrPassportRecognizer(PatternRecognizer):
    PATTERNS = [
        Pattern(
            "KR_PASSPORT_NEW",
            r"\b[A-Z]\d{3}[A-Z]\d{4}\b",
            0.4,
        ),
        Pattern(
            "KR_PASSPORT_OLD",
            r"\b[A-Z]{1}\d{8}\b",
            0.3,
        ),
    ]
    CONTEXT = [
        "여권", "여권번호", "passport", "PASSPORT", "travel document",
    ]
 
    def __init__(self):
        super().__init__(
            supported_entity="KR_PASSPORT",
            patterns=self.PATTERNS,
            context=self.CONTEXT,
            supported_language="ko",
        )

Given the lack of a checksum, it's safest to keep passport number scores at 0.4 or below when detecting standalone without context words.

Test Code — Verifying the Implementation Works Correctly

Once you've implemented the three recognizers, it's important to verify they actually work. The test code below lets you validate the key scenarios.

python

from presidio_analyzer import AnalyzerEngine
 
analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(KrRrnRecognizer())
analyzer.registry.add_recognizer(KrBrnRecognizer())
analyzer.registry.add_recognizer(KrPassportRecognizer())
 
 
def test_rrn_with_valid_checksum():
    """체크섬이 유효한 주민등록번호 — score가 0.85 이상이어야 합니다."""
    # 8211111111118: 체크섬 통과하는 테스트용 번호
    results = analyzer.analyze(
        text="주민등록번호: 821011-1234561",
        language="ko",
        entities=["KR_RRN"],
    )
    assert len(results) >= 1
    rrn_results = [r for r in results if r.entity_type == "KR_RRN"]
    # 컨텍스트("주민등록번호") + 체크섬 통과 시 높은 score
    print(f"[RRN 체크섬 통과] score: {rrn_results[0].score}")
 
 
def test_rrn_without_context():
    """컨텍스트 단어 없이 주민등록번호 패턴만 — score가 낮아야 합니다."""
    results = analyzer.analyze(
        text="참고번호 821011-1234561",
        language="ko",
        entities=["KR_RRN"],
    )
    rrn_results = [r for r in results if r.entity_type == "KR_RRN"]
    if rrn_results:
        print(f"[RRN 컨텍스트 없음] score: {rrn_results[0].score}")
        # 컨텍스트 없으면 체크섬 통과해도 0.85 수준에 머묾
    else:
        print("[RRN 컨텍스트 없음] 탐지 안 됨")
 
 
def test_brn_valid():
    """유효한 사업자등록번호 탐지."""
    results = analyzer.analyze(
        text="사업자등록번호: 123-45-67890",
        language="ko",
        entities=["KR_BRN"],
    )
    brn_results = [r for r in results if r.entity_type == "KR_BRN"]
    print(f"[BRN] 탐지 수: {len(brn_results)}")
    for r in brn_results:
        print(f"  matched: '{text[r.start:r.end]}', score: {r.score}")
 
 
def test_brn_false_positive():
    """전화번호가 사업자번호로 오탐되지 않아야 합니다."""
    results = analyzer.analyze(
        text="연락처: 010-1234-5678",
        language="ko",
        entities=["KR_BRN"],
    )
    brn_results = [r for r in results if r.entity_type == "KR_BRN"]
    print(f"[BRN false positive] 탐지 수: {len(brn_results)} (0이면 정상)")
 
 
def test_passport():
    """여권번호 탐지 — 컨텍스트가 있을 때 score가 높아야 합니다."""
    results = analyzer.analyze(
        text="여권번호는 M123A4567입니다",
        language="ko",
        entities=["KR_PASSPORT"],
    )
    pp_results = [r for r in results if r.entity_type == "KR_PASSPORT"]
    print(f"[여권] 탐지 수: {len(pp_results)}")
    for r in pp_results:
        print(f"  score: {r.score}")
 
 
if __name__ == "__main__":
    test_rrn_with_valid_checksum()
    test_rrn_without_context()
    test_brn_valid()
    test_brn_false_positive()
    test_passport()

Note: The numbers used in the tests above are format examples only and are not real personal information. In practice, it's important to create and verify cases where the checksum both passes and fails. Using the presidio-research tool allows you to quantitatively measure precision/recall, which helps objectively evaluate recognizer quality.

Practical Implementation: Operations Integration

Example 4: Defining Recognizers with YAML — No Code Required

yaml

# recognizers-config.yml
recognizers:
  - name: "KrPassportRecognizer"
    type: "custom"
    supported_entity: "KR_PASSPORT"
    supported_language: "ko"
    patterns:
      - name: "kr_passport_new"
        regex: "\\b[A-Z]\\d{3}[A-Z]\\d{4}\\b"
        score: 0.4
      - name: "kr_passport_old"
        regex: "\\b[A-Z]\\d{8}\\b"
        score: 0.3
    context:
      - "여권"
      - "여권번호"
      - "passport"

python

from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.recognizer_registry import RecognizerRegistryProvider
 
provider = RecognizerRegistryProvider(conf_file="./recognizers-config.yml")
registry = provider.create_recognizer_registry()
analyzer = AnalyzerEngine(registry=registry)
 
results = analyzer.analyze(
    text="여권번호는 M123A4567입니다",
    language="ko",
)

Example 5: LLM Input/Output Guard — LiteLLM + Presidio Concept

sql

User Input                          LLM Proxy                         Final Response
"주민번호 900101-1234567"
    → Presidio Analyzer (PII Detection)
        → "주민번호 <KR_RRN>" (Masked Prompt)
            → GPT-4 (Response Generation)
                → Presidio Anonymizer (Restore/Maintain Masking)
                    → Delivered to User

Pros and Cons Analysis

Pros

Item	Details
Triple Verification Structure	The hierarchical filtering of regex → checksum → context analysis systematically reduces false positives at each stage
Plugin Architecture	New PII types can be added by inheriting `PatternRecognizer` or with a single YAML file, enabling rapid response to business requirements
LLM Hybrid Support	A layered strategy is possible where `LangExtract Recognizer` supplements unstructured PII (free-form addresses, etc.) that regex can't cover
Open Source + Enterprise	MIT-licensed while natively integrating with Azure AI Language and Azure OpenAI, making the transition from PoC to production smooth

Cons and Caveats

Item	Details	Mitigation
Post-2020 Resident Registration Numbers	Newly issued numbers don't follow the existing checksum, making checksum-only verification impossible	Override `analyze` for score branching — 0.85 on checksum pass, retain 0.3 on fail and rely on context
Korean Context Limitations	`LemmaContextAwareEnhancer` depends on spaCy, and Korean lemma quality is lower than English	Recommended to implement a custom `ContextAwareEnhancer` integrating Korean morphological analyzers like Mecab or Kiwi
Pattern Conflicts	The 10-digit business registration number overlaps with phone numbers, postal codes, and account numbers	Strictly apply checksum pass + context word presence as AND conditions
Passport Number Low Specificity	No checksum means false positives are frequent with format matching alone	Keep score at 0.4 or below for standalone detection; always use with context words
Linear Performance Scaling	Analysis time increases proportionally as custom recognizers are added	Set recognizer priorities; optimize with batch processing for large document volumes

Implementation Checklist

Here's a summary of the most frequently occurring mistakes when implementing Korean PII recognizers in practice. It's worth checking each one before deployment.

Immediately returning False on checksum failure — This causes all resident registration numbers issued after October 2020 to be missed. A checksum is a tool for confirming "definitely correct," not for determining "definitely incorrect." Branching scores via an analyze override is the safer approach.
Omitting the 9th digit special handling in the business registration number checksum — Missing (int(digits[8]) * 5) // 10 completely breaks the verification. Surprisingly many implementations found online are missing this part.
Assuming YAML-based recognizers can include checksums — YAML configuration only supports regex and context. PII requiring checksum verification must be implemented in Python code.
Overlooking the spaCy Korean model dependency — Setting supported_language="ko" makes Presidio use the Korean NLP pipeline. If the ko_core_news_lg model isn't installed, context analysis won't work, and the expected score boosting won't occur.
Not checking the validate_result signature — The parameters of validate_result may vary depending on the Presidio version. It's important to check the source code of the Presidio version you're using and match the signature. The code in this article is written based on Presidio v2.2.

Conclusion

Three steps to get started right now:

Install and verify basic operation — pip install presidio-analyzer presidio-anonymizer && python -m spacy download ko_core_news_lg
Copy and run the Recognizer code and test code from this article — Directly observe score changes based on checksum pass/fail and presence/absence of context
Measure precision/recall with presidio-research — Quantitatively evaluate precision against real data and tune thresholds

Next Article: Customizing Presidio Anonymizer — implementing custom Operators using Faker with Korean pseudonymous data, and a k-anonymity-based approach for assessing re-identification risk of de-identified data.

Core Concepts

Presidio Architecture — Separation of Analyzer and Anonymizer

Three Types of Custom Recognizers

Triple Verification — Regex, Checksum, Context

Practical Implementation: Per-PII Type

Example 1: Resident Registration Number Recognizer — Checksum and Score Branching

Example 2: Business Registration Number Recognizer — The War Against Over-Matching

Example 3: Passport Number Recognizer — Survival Strategy for Patterns Without Checksums

Test Code — Verifying the Implementation Works Correctly

Practical Implementation: Operations Integration

Example 4: Defining Recognizers with YAML — No Code Required

Example 5: LLM Input/Output Guard — LiteLLM + Presidio Concept

Pros and Cons Analysis

Pros

Cons and Caveats

Implementation Checklist

Conclusion

References

Core Concepts

Presidio Architecture — Separation of Analyzer and Anonymizer

Three Types of Custom Recognizers

Triple Verification — Regex, Checksum, Context

Practical Implementation: Per-PII Type

Example 1: Resident Registration Number Recognizer — Checksum and Score Branching

Example 2: Business Registration Number Recognizer — The War Against Over-Matching

Example 3: Passport Number Recognizer — Survival Strategy for Patterns Without Checksums

Test Code — Verifying the Implementation Works Correctly

Practical Implementation: Operations Integration

Example 4: Defining Recognizers with YAML — No Code Required

Example 5: LLM Input/Output Guard — LiteLLM + Presidio Concept

Pros and Cons Analysis

Pros

Cons and Caveats

Implementation Checklist

Conclusion

References

Recommended Posts

How to Automatically Validate Agent Quality in CI with LLM-as-Judge and OpenTelemetry

Argo CD Multi-Cluster Secret Management: Sealed Secrets and External Secrets Operator in Practice

Automatically Syncing EKS Multi-Cluster Secrets Without Vault — AWS Secrets Manager + IRSA + ESO in Practice

Enhancing Korean PII Detection with Presidio + KLUE-BERT — A Practical Guide Beyond the Limits of spaCy NER

Building a Korean PII Detection Pipeline with Presidio + spaCy

Catching Missing PII Masking Automatically Before Deployment