Enhancing Korean PII Detection with Presidio + KLUE-BERT — A Practical Guide Beyond the Limits of spaCy NER

Table of Contents

Core Concepts
Prerequisites
Practical Implementation
spaCy Alone vs. Transformer Replacement — Before/After Comparison
Pros and Cons Analysis
Most Common Mistakes in Practice
Conclusion
References

To be honest, I initially thought spaCy's ko_core_news model alone would be sufficient for Korean PII detection. I tossed in a sentence like "홍길동 residing in Gangnam-gu, Seoul" and felt reassured when the results looked decent. But things changed once we started feeding real service data through it. When someone wrote "서울에서" (in Seoul), the postposition "에서" was captured as part of the entity, and it was all too common for person names to be missed whenever the context got even slightly complex. When we tallied up false negatives during internal QA, we found that 23 out of 200 test documents had missed names or addresses. With the 2025 Personal Information Protection Act (PIPA) amendment now in full effect — where a single false negative translates directly into legal risk — this level of accuracy was unacceptable.

This article covers how to connect a KLUE-BERT-based NER model to Microsoft Presidio's TransformersNlpEngine to improve Korean PII detection accuracy. The key is a hybrid architecture that retains spaCy's preprocessing capabilities while replacing only the NER engine with a Transformer. After reading this article, you'll be able to set up a Transformer-based Korean PII detection pipeline with a single YAML configuration, and combine it with regex Recognizers to build a pipeline that covers both structured and unstructured PII. From changing a single configuration file to the pitfalls you'll encounter in practice, I'll walk you through the trial and error I've experienced.

Before we dive in, let's clarify some terminology. Presidio is an open-source PII detection and anonymization framework created by Microsoft. NER (Named Entity Recognition) is a technique for identifying proper nouns like person names, locations, and organization names in text. Hugging Face is a platform for sharing and using Transformer models. The goal of this article is to combine these three to automatically detect Korean personal information.

Core Concepts

Where Does spaCy Korean NER Hit Its Limits?

spaCy is an NLP library that provides a single pipeline for the entire process of splitting text into tokens (word units), analyzing the part of speech of each token, and identifying proper nouns within sentences. The ko_core_news model is the Korean version of this pipeline, and it's not a bad model. It's fast thanks to its lightweight pipeline. The problem arises from the combination of the Korean language's characteristics and the CNN-based architecture.

The first issue is token boundary misalignment. In Korean, postpositions attach directly to words without spaces, as in "서울에서" (in Seoul), "홍길동이" (Honggildong [subject marker]), and "삼성전자의" (Samsung Electronics'). This type of language is called an agglutinative language, and it's what makes tokenization tricky. spaCy's Korean pipeline frequently includes these postpositions as part of the entity, so a location entity that should only capture "서울" (Seoul) ends up as "서울에서" (in Seoul). This is such a well-known problem that a related issue (#13705) has been filed on spaCy's GitHub.

The second issue is limited contextual understanding. Since spaCy's CNN-based pipeline only captures context within a narrow window, it frequently misses or misclassifies entities in long sentences or complex structures. In PII detection, a false negative directly means a personal information leak risk, making this a fairly critical weakness.

TransformersNlpEngine — A Hybrid of spaCy and Transformers

This is where Presidio's TransformersNlpEngine becomes meaningful. It's an architecture that keeps spaCy's preprocessing features like tokenization and lemmatization intact while replacing only the NER component with a Hugging Face Transformers model. The advantage is that it preserves what spaCy does well (preprocessing) while swapping out what it doesn't (NER), so there's no need to overhaul the existing Presidio pipeline.

Here's how the data flows:

입력 텍스트
  │
  ▼
┌─────────────────────────────────┐
│  spaCy (ko_core_news_sm)        │  ← 토큰화, 레마타이제이션
│  전처리 파이프라인               │
└──────────────┬──────────────────┘
               │
               ▼
┌─────────────────────────────────┐
│  Transformer NER 모델            │  ← KLUE-BERT 등 Hugging Face 모델
│  (서브워드 토큰 → 엔티티 예측)    │
└──────────────┬──────────────────┘
               │
               ▼
┌─────────────────────────────────┐
│  Presidio Recognizer Registry    │  ← NER 결과 + 정규표현식 Recognizer
│  + 컨텍스트 분석 + 점수 보정      │     + 커스텀 Recognizer 통합
└──────────────┬──────────────────┘
               │
               ▼
          PII 탐지 결과

Key Insight: spaCy handles preprocessing, Transformers handle NER. It's a division of labor where each does what it does best. Presidio's existing regex Recognizers and custom Recognizers continue to work as-is.

Thanks to this architecture, model replacement is straightforward — just change the model name in the YAML configuration or Python code, and you can immediately apply KLUE-BERT, KoELECTRA, or any other model. Not having to discard regex patterns or custom Recognizers you've already invested in is an enormous advantage in practice.

KLUE-BERT — A Transformer That Truly Understands Korean

KLUE-BERT is a Korean pre-trained BERT model released during the development of the Korean Language Understanding Evaluation (KLUE) benchmark. Here's a summary of its key specifications:

Item	Details
Parameters	111M
Training Data	62GB of Korean text
Tokenizer	Morpheme-aware BPE (vocabulary size: 32k)
NER Entity Types	PS (Person), LC (Location), OG (Organization), DT (Date), TI (Time), QT (Quantity)
KLUE-NER Entity F1	83.97% (per the KLUE benchmark paper)

The tokenizer deserves special attention here. KLUE-BERT uses a combination of the Mecab-ko morphological analyzer and BPE tokenization during pre-training, which helps the model better learn the agglutinative characteristics of Korean. However, one thing needs to be clarified — Mecab-ko is not automatically invoked during inference through Presidio's TransformersNlpEngine. Hugging Face's klue/bert-base tokenizer performs subword segmentation internally based on its learned BPE vocabulary, while the morphological analysis was used for training data preprocessing during pre-training. Still, because the model was trained with a vocabulary that reflects morpheme boundaries, it achieves more natural segmentation than spaCy when processing phrases like "서울에서."

Meanwhile, the Entity F1 score of 83.97% is the result reported for the KLUE-BERT base model in the KLUE benchmark paper. If your production environment requires an F1 above 90%, it's more realistic to consider KoELECTRA-base-v3, which achieved an Entity F1 of 92.6% on the KLUE-NER benchmark (per the KLUE leaderboard), or domain-specific fine-tuned models.

Prerequisites

Before following along with the code, you need to set up your environment. Install the following packages in a Python 3.8+ environment.

bash

# Presidio 핵심 패키지
pip install presidio-analyzer presidio-anonymizer
 
# spaCy 한국어 모델 (전처리용)
python -m spacy download ko_core_news_sm
 
# Transformer 런타임
pip install torch transformers

It works without a GPU, but if you're processing large volumes of documents in production, a CUDA-enabled GPU environment is recommended. On CPU, processing a single document can take several seconds.

Practical Implementation

Example 1: Setting Up in 5 Minutes with YAML Configuration

The fastest way to apply this in practice is to use Presidio's YAML configuration file. I remember the frustration of fiddling around with Python code at first, only to discover the YAML configuration existed.

yaml

nlp_engine_name: transformers
models:
  - lang_code: ko
    model_name:
      spacy: ko_core_news_sm
      transformers: klue/bert-base
 
ner_model_configuration:
  labels_to_ignore:
    - O
  aggregation_strategy: max
  stride: 16
  alignment_mode: expand
  model_to_presidio_entity_mapping:
    PER: PERSON
    PS: PERSON
    LC: LOCATION
    OG: ORGANIZATION
    DT: DATE_TIME
    TI: DATE_TIME
    QT: QUANTITY
  low_confidence_score_multiplier: 0.4
  low_score_entity_names:
    - QUANTITY

⚠️ Important: klue/bert-base is a general-purpose pre-trained model, not an NER fine-tuned model. This YAML is meant to illustrate the pipeline structure. In production, you need to put an NER fine-tuned model (search for klue ner on Hugging Face) in the transformers field to get meaningful NER results.

Let's go through what each setting does.

Setting	Role	Practical Tip
`spacy: ko_core_news_sm`	Handles tokenization and lemmatization preprocessing	`sm` is sufficient. Since the Transformer handles NER, the `lg` model is unnecessary overhead
`transformers: klue/bert-base`	Handles actual NER inference	Must be replaced with an NER fine-tuned model
`aggregation_strategy: max`	Aggregates subword token predictions into entity-level results	`max` is the most stable for Korean
`stride: 16`	Overlap size of the sliding window for long text processing	Too small and entities get cut off; too large and it gets slow
`alignment_mode: expand`	Alignment method between spaCy tokens and Transformer tokens	`expand` captures entities broadly, reducing false negatives
`model_to_presidio_entity_mapping`	Maps KLUE labels → Presidio entity types	Mapping both PS and PER is safer

More on aggregation_strategy: Transformers process text by splitting it into subword units. For example, "홍길동" might be split into three tokens: "홍", "##길", "##동". When these three tokens each predict different labels, they need to be merged into a single entity. max selects the label with the highest probability, while first follows the label of the first token. For Korean, max is generally more stable.

More on alignment_mode: When the entity range predicted by the Transformer doesn't perfectly align with the token boundaries segmented by spaCy, expand includes partially overlapping tokens in the entity, while strict only accepts tokens that are fully contained. In PII detection, it's better to cast a wider net than to miss something, so expand is the appropriate choice.

Example 2: Building the Pipeline with Python Code

When you need finer control than YAML offers, or when you need to register custom Recognizers alongside, it's better to configure things directly in Python code.

python

from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.nlp_engine import TransformersNlpEngine
 
# 1. Transformer NLP 엔진 초기화
transformer_engine = TransformersNlpEngine(
    models=[{
        "lang_code": "ko",
        "model_name": {
            "spacy": "ko_core_news_sm",
            "transformers": "klue/bert-base"  # ⚠️ 아래 주의사항 참고
        }
    }]
)
 
# 2. Analyzer 엔진 구성
analyzer = AnalyzerEngine(
    nlp_engine=transformer_engine,
    supported_languages=["ko"]
)
 
# 3. PII 탐지 실행
text = "홍길동의 전화번호는 010-1234-5678이고 서울시 강남구에 거주합니다."
results = analyzer.analyze(text=text, language="ko")
 
for result in results:
    print(f"  {result.entity_type}: '{text[result.start:result.end]}' "
          f"(신뢰도: {result.score:.2f})")

⚠️ Caution: This code is for illustrating the pipeline structure. Since klue/bert-base is a pre-trained model and not an NER fine-tuned model, running it as-is will likely produce no meaningful NER results. In production, search for klue ner on the Hugging Face Model Hub to find community fine-tuned models (e.g., a model fine-tuned on the KLUE-NER dataset from klue/bert-base) and insert that model ID in the transformers field.

Expected output when an NER fine-tuned model is applied:

  PERSON: '홍길동' (신뢰도: 0.92)
  PHONE_NUMBER: '010-1234-5678' (신뢰도: 0.95)
  LOCATION: '서울시 강남구' (신뢰도: 0.88)

Example 3: A Hybrid Strategy Combining Regex Recognizers

This is a situation you'll frequently encounter in practice. Korean PII broadly falls into two categories: structured PII with clear patterns, such as resident registration numbers, phone numbers, and bank account numbers, and unstructured PII requiring contextual understanding, such as person names, addresses, and organization names. No matter how good Transformer NER is, regex is more accurate and faster for resident registration number patterns.

python

from presidio_analyzer import PatternRecognizer, Pattern
 
# 주민등록번호 패턴 Recognizer
rrn_pattern = Pattern(
    name="korean_rrn",
    regex=r"(?<!\d)(\d{6}[-–]\d{7})(?!\d)",
    score=0.9
)
 
rrn_recognizer = PatternRecognizer(
    supported_entity="KR_RRN",
    supported_language="ko",
    patterns=[rrn_pattern],
    context=["주민등록번호", "주민번호", "생년월일"]
)
 
# Analyzer에 커스텀 Recognizer 추가
analyzer.registry.add_recognizer(rrn_recognizer)

Note the use of (?<!\d) and (?!\d) (negative lookbehind and lookahead) instead of \b in the regex. \b is an anchor that recognizes word boundaries, but it often doesn't work as expected in Korean text. In cases like "주민번호는123456-1234567입니다" where there are no spaces, \b may fail to match, so it's safer to explicitly specify non-digit character boundaries.

The context parameter is also important. When Presidio finds words specified in context (here: "주민등록번호", "주민번호", "생년월일") in the surrounding text of a pattern match, it boosts the confidence score of that result. In other words, for the text "주민번호는 123456-1234567," the context adjustment is added on top of the pattern match score (0.9), raising the final confidence. Simple regex alone makes it difficult to distinguish resident registration numbers from other number strings with similar formats, and this context analysis effectively reduces false positives.

This is exactly where Presidio's architecture shines. Transformer NER catches names, addresses, and organization names; regex Recognizers catch resident registration numbers, phone numbers, and bank account numbers; and context analysis adjusts confidence scores — these three layers naturally combine.

Key Insight: The key is not entrusting everything to a single model. A study published in Scientific Reports in 2025 also showed that a rule-based + machine learning hybrid approach outperformed single-model approaches for PII detection in financial documents. Presidio essentially supports this hybrid strategy at the framework level.

spaCy Alone vs. Transformer Replacement — Before/After Comparison

Just saying "accuracy improves" isn't very convincing, so let's compare with actual sentences. Below is a qualitative comparison of detection results between spaCy ko_core_news_sm alone and a Presidio pipeline with an NER fine-tuned Transformer model applied.

Input Text	spaCy Alone Result	After Transformer Replacement
"서울에서 근무하는 홍길동입니다"	LOC: "서울에서" (includes postposition) / PER: missed	LOC: "서울" ✓ / PER: "홍길동" ✓
"삼성전자의 이재용 부회장"	ORG: "삼성전자의" (includes postposition) / PER: missed	ORG: "삼성전자" ✓ / PER: "이재용" ✓
"김철수 교수가 연구를 진행"	PER: missed	PER: "김철수" ✓
"2025년 3월 15일 부산에서 개최"	DT: missed / LOC: "부산에서"	DT: "2025년 3월 15일" ✓ / LOC: "부산" ✓

Two patterns are clearly visible. First, spaCy's postposition inclusion problem ("서울에서", "삼성전자의") is cleanly resolved with the Transformer. Second, the Transformer catches person names and date entities that spaCy was missing. The reduction in person name false negatives is a decisive difference in PII detection.

Of course, this is a qualitative comparison, and in production, it's recommended to create your own test set and quantitatively measure Precision/Recall/F1. For reference, a Korean EMR de-identification study (npj Health Systems, 2025) reported that KLUE-BERT-based NER achieved an F1 of 94.30% on discharge summaries.

Pros and Cons Analysis

Pros

Item	Details
Korean contextual understanding	KLUE-BERT, trained on 62GB of Korean corpora, significantly improves NER accuracy in complex sentence structures compared to spaCy's CNN
Improved postposition handling	Tokenization reflecting morpheme boundaries enables natural segmentation like "서울에서" → "서울" + "에서"
Reuse of existing infrastructure	Presidio's regex Recognizers, context analysis, and anonymization pipeline can be used as-is
Easy model replacement	Switch to other models like KoELECTRA or KLUE-RoBERTa from the Hugging Face Hub by simply changing the configuration
Proven in practice	KLUE-BERT-based NER achieved F1 94.30% on discharge summaries in a Korean EMR de-identification study

Cons and Considerations

Item	Details	Mitigation
Resource consumption	A 111M parameter model significantly increases GPU memory usage and inference time compared to spaCy	Batch processing, GPU instances, or ONNX conversion for inference optimization
Manual label mapping	Manual mapping required between KLUE tags (PS, LC, OG) and Presidio standard types (PERSON, LOCATION)	Define once using `model_to_presidio_entity_mapping` in the YAML configuration
Missing Korean context words	Presidio's default context words ("Mr.", "Mrs.") are English-centric	Define Korean context words like "씨", "님", "대표", "교수" in custom Recognizers
Base model limitations	KLUE-BERT base's Entity F1 of 83.97% (per the KLUE benchmark paper) may be insufficient for production	Consider domain-specific fine-tuning or KoELECTRA (Entity F1 92.6%, per the KLUE leaderboard)

Most Common Mistakes in Practice

These are cases I've personally experienced or frequently seen around me. Bookmark this section — it'll come in handy later.

Confusing a pre-trained model with an NER fine-tuned model — klue/bert-base is a general-purpose pre-trained model. If you use it directly for NER tasks, it will barely detect any entities. Our team initially didn't know this and plugged in klue/bert-base as-is, then spent a long time debugging why we kept getting empty results even though "홍길동" was clearly in the text. You need to either find a model fine-tuned on KLUE-NER on Hugging Face or fine-tune one yourself.
Relying solely on Transformers for Korean-specific PII patterns — Korean-specific identifiers like resident registration numbers (XXXXXX-XXXXXXX) and business registration numbers (XXX-XX-XXXXX) require separate regex Recognizers. These patterns are often insufficiently represented in NER model training data, and since the patterns are well-structured, regex is far more accurate.
Confusing Entity F1 with Character F1 during evaluation — In Korean NER, the difference between these two metrics can be significant depending on whether postpositions are included. For example, when "서울에서" is detected as "서울," it's treated as incorrect in Entity F1 but may receive partial credit in Character F1. In the context of PII detection, the most important question is "did we miss any text spans containing personal information?" so it's appropriate to set recall-oriented evaluation criteria.

Conclusion

spaCy's Korean NER limitations are a structural issue, and Presidio's TransformersNlpEngine provides a realistic path to replacing it with a Transformer model while maintaining the existing pipeline. With the 2025 PIPA amendment making automated PII detection more critical than ever, this combination is one of the most promising choices for Korean personal information protection pipelines.

Three steps to get started right now:

Set up the environment: Install the base dependencies in your Python environment with pip install presidio-analyzer presidio-anonymizer && python -m spacy download ko_core_news_sm && pip install torch transformers.
Quick validation: Replace the transformers field in the YAML configuration or Python code examples above with an NER fine-tuned model, then test with the sentence "홍길동의 전화번호는 010-1234-5678이고 서울시 강남구에 거주합니다."
Production expansion: Add model_to_presidio_entity_mapping, register custom PatternRecognizers for resident registration numbers and phone numbers to complete the hybrid pipeline. Replacing just the model with KoELECTRA-base-v3 (monologg/koelectra-base-v3-discriminator) is also a good starting point for performance comparison.

Next article: A deep dive into Presidio custom Recognizers — covering how to perfectly detect Korean PII patterns including resident registration numbers, business registration numbers, and passport numbers using regex and context analysis.

References

Cited in this article

Further Reading

Enhancing Korean PII Detection with Presidio + KLUE-BERT — A Practical Guide Beyond the Limits of spaCy NER | DEV BAK - 기술블로그

DevOps

Enhancing Korean PII Detection with Presidio + KLUE-BERT — A Practical Guide Beyond the Limits of spaCy NER

Table of Contents

Core Concepts
Prerequisites
Practical Implementation
spaCy Alone vs. Transformer Replacement — Before/After Comparison
Pros and Cons Analysis
Most Common Mistakes in Practice
Conclusion
References

Before we dive in, let's clarify some terminology. Presidio is an open-source PII detection and anonymization framework created by Microsoft. NER (Named Entity Recognition) is a technique for identifying proper nouns like person names, locations, and organization names in text. Hugging Face is a platform for sharing and using Transformer models. The goal of this article is to combine these three to automatically detect Korean personal information.

Core Concepts

Where Does spaCy Korean NER Hit Its Limits?

TransformersNlpEngine — A Hybrid of spaCy and Transformers

Here's how the data flows:

입력 텍스트
  │
  ▼
┌─────────────────────────────────┐
│  spaCy (ko_core_news_sm)        │  ← 토큰화, 레마타이제이션
│  전처리 파이프라인               │
└──────────────┬──────────────────┘
               │
               ▼
┌─────────────────────────────────┐
│  Transformer NER 모델            │  ← KLUE-BERT 등 Hugging Face 모델
│  (서브워드 토큰 → 엔티티 예측)    │
└──────────────┬──────────────────┘
               │
               ▼
┌─────────────────────────────────┐
│  Presidio Recognizer Registry    │  ← NER 결과 + 정규표현식 Recognizer
│  + 컨텍스트 분석 + 점수 보정      │     + 커스텀 Recognizer 통합
└──────────────┬──────────────────┘
               │
               ▼
          PII 탐지 결과

Key Insight: spaCy handles preprocessing, Transformers handle NER. It's a division of labor where each does what it does best. Presidio's existing regex Recognizers and custom Recognizers continue to work as-is.

KLUE-BERT — A Transformer That Truly Understands Korean

KLUE-BERT is a Korean pre-trained BERT model released during the development of the Korean Language Understanding Evaluation (KLUE) benchmark. Here's a summary of its key specifications:

Item	Details
Parameters	111M
Training Data	62GB of Korean text
Tokenizer	Morpheme-aware BPE (vocabulary size: 32k)
NER Entity Types	PS (Person), LC (Location), OG (Organization), DT (Date), TI (Time), QT (Quantity)
KLUE-NER Entity F1	83.97% (per the KLUE benchmark paper)

Prerequisites

Before following along with the code, you need to set up your environment. Install the following packages in a Python 3.8+ environment.

bash

# Presidio 핵심 패키지
pip install presidio-analyzer presidio-anonymizer
 
# spaCy 한국어 모델 (전처리용)
python -m spacy download ko_core_news_sm
 
# Transformer 런타임
pip install torch transformers

It works without a GPU, but if you're processing large volumes of documents in production, a CUDA-enabled GPU environment is recommended. On CPU, processing a single document can take several seconds.

Practical Implementation

Example 1: Setting Up in 5 Minutes with YAML Configuration

yaml

nlp_engine_name: transformers
models:
  - lang_code: ko
    model_name:
      spacy: ko_core_news_sm
      transformers: klue/bert-base
 
ner_model_configuration:
  labels_to_ignore:
    - O
  aggregation_strategy: max
  stride: 16
  alignment_mode: expand
  model_to_presidio_entity_mapping:
    PER: PERSON
    PS: PERSON
    LC: LOCATION
    OG: ORGANIZATION
    DT: DATE_TIME
    TI: DATE_TIME
    QT: QUANTITY
  low_confidence_score_multiplier: 0.4
  low_score_entity_names:
    - QUANTITY

⚠️ Important: klue/bert-base is a general-purpose pre-trained model, not an NER fine-tuned model. This YAML is meant to illustrate the pipeline structure. In production, you need to put an NER fine-tuned model (search for klue ner on Hugging Face) in the transformers field to get meaningful NER results.

Let's go through what each setting does.

Setting	Role	Practical Tip
`spacy: ko_core_news_sm`	Handles tokenization and lemmatization preprocessing	`sm` is sufficient. Since the Transformer handles NER, the `lg` model is unnecessary overhead
`transformers: klue/bert-base`	Handles actual NER inference	Must be replaced with an NER fine-tuned model
`aggregation_strategy: max`	Aggregates subword token predictions into entity-level results	`max` is the most stable for Korean
`stride: 16`	Overlap size of the sliding window for long text processing	Too small and entities get cut off; too large and it gets slow
`alignment_mode: expand`	Alignment method between spaCy tokens and Transformer tokens	`expand` captures entities broadly, reducing false negatives
`model_to_presidio_entity_mapping`	Maps KLUE labels → Presidio entity types	Mapping both PS and PER is safer

More on aggregation_strategy: Transformers process text by splitting it into subword units. For example, "홍길동" might be split into three tokens: "홍", "##길", "##동". When these three tokens each predict different labels, they need to be merged into a single entity. max selects the label with the highest probability, while first follows the label of the first token. For Korean, max is generally more stable.

More on alignment_mode: When the entity range predicted by the Transformer doesn't perfectly align with the token boundaries segmented by spaCy, expand includes partially overlapping tokens in the entity, while strict only accepts tokens that are fully contained. In PII detection, it's better to cast a wider net than to miss something, so expand is the appropriate choice.

Example 2: Building the Pipeline with Python Code

When you need finer control than YAML offers, or when you need to register custom Recognizers alongside, it's better to configure things directly in Python code.

python

from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.nlp_engine import TransformersNlpEngine
 
# 1. Transformer NLP 엔진 초기화
transformer_engine = TransformersNlpEngine(
    models=[{
        "lang_code": "ko",
        "model_name": {
            "spacy": "ko_core_news_sm",
            "transformers": "klue/bert-base"  # ⚠️ 아래 주의사항 참고
        }
    }]
)
 
# 2. Analyzer 엔진 구성
analyzer = AnalyzerEngine(
    nlp_engine=transformer_engine,
    supported_languages=["ko"]
)
 
# 3. PII 탐지 실행
text = "홍길동의 전화번호는 010-1234-5678이고 서울시 강남구에 거주합니다."
results = analyzer.analyze(text=text, language="ko")
 
for result in results:
    print(f"  {result.entity_type}: '{text[result.start:result.end]}' "
          f"(신뢰도: {result.score:.2f})")

⚠️ Caution: This code is for illustrating the pipeline structure. Since klue/bert-base is a pre-trained model and not an NER fine-tuned model, running it as-is will likely produce no meaningful NER results. In production, search for klue ner on the Hugging Face Model Hub to find community fine-tuned models (e.g., a model fine-tuned on the KLUE-NER dataset from klue/bert-base) and insert that model ID in the transformers field.

Expected output when an NER fine-tuned model is applied:

  PERSON: '홍길동' (신뢰도: 0.92)
  PHONE_NUMBER: '010-1234-5678' (신뢰도: 0.95)
  LOCATION: '서울시 강남구' (신뢰도: 0.88)

Example 3: A Hybrid Strategy Combining Regex Recognizers

python

from presidio_analyzer import PatternRecognizer, Pattern
 
# 주민등록번호 패턴 Recognizer
rrn_pattern = Pattern(
    name="korean_rrn",
    regex=r"(?<!\d)(\d{6}[-–]\d{7})(?!\d)",
    score=0.9
)
 
rrn_recognizer = PatternRecognizer(
    supported_entity="KR_RRN",
    supported_language="ko",
    patterns=[rrn_pattern],
    context=["주민등록번호", "주민번호", "생년월일"]
)
 
# Analyzer에 커스텀 Recognizer 추가
analyzer.registry.add_recognizer(rrn_recognizer)

Key Insight: The key is not entrusting everything to a single model. A study published in Scientific Reports in 2025 also showed that a rule-based + machine learning hybrid approach outperformed single-model approaches for PII detection in financial documents. Presidio essentially supports this hybrid strategy at the framework level.

spaCy Alone vs. Transformer Replacement — Before/After Comparison

Input Text	spaCy Alone Result	After Transformer Replacement
"서울에서 근무하는 홍길동입니다"	LOC: "서울에서" (includes postposition) / PER: missed	LOC: "서울" ✓ / PER: "홍길동" ✓
"삼성전자의 이재용 부회장"	ORG: "삼성전자의" (includes postposition) / PER: missed	ORG: "삼성전자" ✓ / PER: "이재용" ✓
"김철수 교수가 연구를 진행"	PER: missed	PER: "김철수" ✓
"2025년 3월 15일 부산에서 개최"	DT: missed / LOC: "부산에서"	DT: "2025년 3월 15일" ✓ / LOC: "부산" ✓

Pros and Cons Analysis

Pros

Item	Details
Korean contextual understanding	KLUE-BERT, trained on 62GB of Korean corpora, significantly improves NER accuracy in complex sentence structures compared to spaCy's CNN
Improved postposition handling	Tokenization reflecting morpheme boundaries enables natural segmentation like "서울에서" → "서울" + "에서"
Reuse of existing infrastructure	Presidio's regex Recognizers, context analysis, and anonymization pipeline can be used as-is
Easy model replacement	Switch to other models like KoELECTRA or KLUE-RoBERTa from the Hugging Face Hub by simply changing the configuration
Proven in practice	KLUE-BERT-based NER achieved F1 94.30% on discharge summaries in a Korean EMR de-identification study

Cons and Considerations

Item	Details	Mitigation
Resource consumption	A 111M parameter model significantly increases GPU memory usage and inference time compared to spaCy	Batch processing, GPU instances, or ONNX conversion for inference optimization
Manual label mapping	Manual mapping required between KLUE tags (PS, LC, OG) and Presidio standard types (PERSON, LOCATION)	Define once using `model_to_presidio_entity_mapping` in the YAML configuration
Missing Korean context words	Presidio's default context words ("Mr.", "Mrs.") are English-centric	Define Korean context words like "씨", "님", "대표", "교수" in custom Recognizers
Base model limitations	KLUE-BERT base's Entity F1 of 83.97% (per the KLUE benchmark paper) may be insufficient for production	Consider domain-specific fine-tuning or KoELECTRA (Entity F1 92.6%, per the KLUE leaderboard)

Most Common Mistakes in Practice

These are cases I've personally experienced or frequently seen around me. Bookmark this section — it'll come in handy later.

Confusing a pre-trained model with an NER fine-tuned model — klue/bert-base is a general-purpose pre-trained model. If you use it directly for NER tasks, it will barely detect any entities. Our team initially didn't know this and plugged in klue/bert-base as-is, then spent a long time debugging why we kept getting empty results even though "홍길동" was clearly in the text. You need to either find a model fine-tuned on KLUE-NER on Hugging Face or fine-tune one yourself.
Relying solely on Transformers for Korean-specific PII patterns — Korean-specific identifiers like resident registration numbers (XXXXXX-XXXXXXX) and business registration numbers (XXX-XX-XXXXX) require separate regex Recognizers. These patterns are often insufficiently represented in NER model training data, and since the patterns are well-structured, regex is far more accurate.
Confusing Entity F1 with Character F1 during evaluation — In Korean NER, the difference between these two metrics can be significant depending on whether postpositions are included. For example, when "서울에서" is detected as "서울," it's treated as incorrect in Entity F1 but may receive partial credit in Character F1. In the context of PII detection, the most important question is "did we miss any text spans containing personal information?" so it's appropriate to set recall-oriented evaluation criteria.

Conclusion

Three steps to get started right now:

Set up the environment: Install the base dependencies in your Python environment with pip install presidio-analyzer presidio-anonymizer && python -m spacy download ko_core_news_sm && pip install torch transformers.
Quick validation: Replace the transformers field in the YAML configuration or Python code examples above with an NER fine-tuned model, then test with the sentence "홍길동의 전화번호는 010-1234-5678이고 서울시 강남구에 거주합니다."
Production expansion: Add model_to_presidio_entity_mapping, register custom PatternRecognizers for resident registration numbers and phone numbers to complete the hybrid pipeline. Replacing just the model with KoELECTRA-base-v3 (monologg/koelectra-base-v3-discriminator) is also a good starting point for performance comparison.

Next article: A deep dive into Presidio custom Recognizers — covering how to perfectly detect Korean PII patterns including resident registration numbers, business registration numbers, and passport numbers using regex and context analysis.

References

Cited in this article

Further Reading

Core Concepts

Where Does spaCy Korean NER Hit Its Limits?

TransformersNlpEngine — A Hybrid of spaCy and Transformers

KLUE-BERT — A Transformer That Truly Understands Korean

Prerequisites

Practical Implementation

Example 1: Setting Up in 5 Minutes with YAML Configuration

Example 2: Building the Pipeline with Python Code

Example 3: A Hybrid Strategy Combining Regex Recognizers

spaCy Alone vs. Transformer Replacement — Before/After Comparison

Pros and Cons Analysis

Pros

Cons and Considerations

Most Common Mistakes in Practice

Conclusion

References

Core Concepts

Where Does spaCy Korean NER Hit Its Limits?

TransformersNlpEngine — A Hybrid of spaCy and Transformers

KLUE-BERT — A Transformer That Truly Understands Korean

Prerequisites

Practical Implementation

Example 1: Setting Up in 5 Minutes with YAML Configuration

Example 2: Building the Pipeline with Python Code

Example 3: A Hybrid Strategy Combining Regex Recognizers

spaCy Alone vs. Transformer Replacement — Before/After Comparison

Pros and Cons Analysis

Pros

Cons and Considerations

Most Common Mistakes in Practice

Conclusion

References

Recommended Posts

Catching Korean PII with Presidio Custom Recognizers — Implementing Triple Verification with Regex + Checksum + Context

How to Automatically Validate Agent Quality in CI with LLM-as-Judge and OpenTelemetry

Argo CD Multi-Cluster Secret Management: Sealed Secrets and External Secrets Operator in Practice

Building a Korean PII Detection Pipeline with Presidio + spaCy

Catching Missing PII Masking Automatically Before Deployment

Running E2E Tests on a Masked Production DB — Building a Playwright + Neon Branching Pipeline