Enhancing Korean PII Detection with Presidio + KLUE-BERT — A Practical Guide Beyond the Limits of spaCy NER
Table of Contents
- Core Concepts
- Prerequisites
- Practical Implementation
- spaCy Alone vs. Transformer Replacement — Before/After Comparison
- Pros and Cons Analysis
- Most Common Mistakes in Practice
- Conclusion
- References
To be honest, I initially thought spaCy's ko_core_news model alone would be sufficient for Korean PII detection. I tossed in a sentence like "홍길동 residing in Gangnam-gu, Seoul" and felt reassured when the results looked decent. But things changed once we started feeding real service data through it. When someone wrote "서울에서" (in Seoul), the postposition "에서" was captured as part of the entity, and it was all too common for person names to be missed whenever the context got even slightly complex. When we tallied up false negatives during internal QA, we found that 23 out of 200 test documents had missed names or addresses. With the 2025 Personal Information Protection Act (PIPA) amendment now in full effect — where a single false negative translates directly into legal risk — this level of accuracy was unacceptable.
This article covers how to connect a KLUE-BERT-based NER model to Microsoft Presidio's TransformersNlpEngine to improve Korean PII detection accuracy. The key is a hybrid architecture that retains spaCy's preprocessing capabilities while replacing only the NER engine with a Transformer. After reading this article, you'll be able to set up a Transformer-based Korean PII detection pipeline with a single YAML configuration, and combine it with regex Recognizers to build a pipeline that covers both structured and unstructured PII. From changing a single configuration file to the pitfalls you'll encounter in practice, I'll walk you through the trial and error I've experienced.
Before we dive in, let's clarify some terminology. Presidio is an open-source PII detection and anonymization framework created by Microsoft. NER (Named Entity Recognition) is a technique for identifying proper nouns like person names, locations, and organization names in text. Hugging Face is a platform for sharing and using Transformer models. The goal of this article is to combine these three to automatically detect Korean personal information.
Core Concepts
Where Does spaCy Korean NER Hit Its Limits?
spaCy is an NLP library that provides a single pipeline for the entire process of splitting text into tokens (word units), analyzing the part of speech of each token, and identifying proper nouns within sentences. The ko_core_news model is the Korean version of this pipeline, and it's not a bad model. It's fast thanks to its lightweight pipeline. The problem arises from the combination of the Korean language's characteristics and the CNN-based architecture.
The first issue is token boundary misalignment. In Korean, postpositions attach directly to words without spaces, as in "서울에서" (in Seoul), "홍길동이" (Honggildong [subject marker]), and "삼성전자의" (Samsung Electronics'). This type of language is called an agglutinative language, and it's what makes tokenization tricky. spaCy's Korean pipeline frequently includes these postpositions as part of the entity, so a location entity that should only capture "서울" (Seoul) ends up as "서울에서" (in Seoul). This is such a well-known problem that a related issue (#13705) has been filed on spaCy's GitHub.
The second issue is limited contextual understanding. Since spaCy's CNN-based pipeline only captures context within a narrow window, it frequently misses or misclassifies entities in long sentences or complex structures. In PII detection, a false negative directly means a personal information leak risk, making this a fairly critical weakness.
TransformersNlpEngine — A Hybrid of spaCy and Transformers
This is where Presidio's TransformersNlpEngine becomes meaningful. It's an architecture that keeps spaCy's preprocessing features like tokenization and lemmatization intact while replacing only the NER component with a Hugging Face Transformers model. The advantage is that it preserves what spaCy does well (preprocessing) while swapping out what it doesn't (NER), so there's no need to overhaul the existing Presidio pipeline.
Here's how the data flows:
입력 텍스트
│
▼
┌─────────────────────────────────┐
│ spaCy (ko_core_news_sm) │ ← 토큰화, 레마타이제이션
│ 전처리 파이프라인 │
└──────────────┬──────────────────┘
│
▼
┌─────────────────────────────────┐
│ Transformer NER 모델 │ ← KLUE-BERT 등 Hugging Face 모델
│ (서브워드 토큰 → 엔티티 예측) │
└──────────────┬──────────────────┘
│
▼
┌─────────────────────────────────┐
│ Presidio Recognizer Registry │ ← NER 결과 + 정규표현식 Recognizer
│ + 컨텍스트 분석 + 점수 보정 │ + 커스텀 Recognizer 통합
└──────────────┬──────────────────┘
│
▼
PII 탐지 결과Key Insight: spaCy handles preprocessing, Transformers handle NER. It's a division of labor where each does what it does best. Presidio's existing regex Recognizers and custom Recognizers continue to work as-is.
Thanks to this architecture, model replacement is straightforward — just change the model name in the YAML configuration or Python code, and you can immediately apply KLUE-BERT, KoELECTRA, or any other model. Not having to discard regex patterns or custom Recognizers you've already invested in is an enormous advantage in practice.
KLUE-BERT — A Transformer That Truly Understands Korean
KLUE-BERT is a Korean pre-trained BERT model released during the development of the Korean Language Understanding Evaluation (KLUE) benchmark. Here's a summary of its key specifications:
| Item | Details |
|---|---|
| Parameters | 111M |
| Training Data | 62GB of Korean text |
| Tokenizer | Morpheme-aware BPE (vocabulary size: 32k) |
| NER Entity Types | PS (Person), LC (Location), OG (Organization), DT (Date), TI (Time), QT (Quantity) |
| KLUE-NER Entity F1 | 83.97% (per the KLUE benchmark paper) |
The tokenizer deserves special attention here. KLUE-BERT uses a combination of the Mecab-ko morphological analyzer and BPE tokenization during pre-training, which helps the model better learn the agglutinative characteristics of Korean. However, one thing needs to be clarified — Mecab-ko is not automatically invoked during inference through Presidio's TransformersNlpEngine. Hugging Face's klue/bert-base tokenizer performs subword segmentation internally based on its learned BPE vocabulary, while the morphological analysis was used for training data preprocessing during pre-training. Still, because the model was trained with a vocabulary that reflects morpheme boundaries, it achieves more natural segmentation than spaCy when processing phrases like "서울에서."
Meanwhile, the Entity F1 score of 83.97% is the result reported for the KLUE-BERT base model in the KLUE benchmark paper. If your production environment requires an F1 above 90%, it's more realistic to consider KoELECTRA-base-v3, which achieved an Entity F1 of 92.6% on the KLUE-NER benchmark (per the KLUE leaderboard), or domain-specific fine-tuned models.
Prerequisites
Before following along with the code, you need to set up your environment. Install the following packages in a Python 3.8+ environment.
# Presidio 핵심 패키지
pip install presidio-analyzer presidio-anonymizer
# spaCy 한국어 모델 (전처리용)
python -m spacy download ko_core_news_sm
# Transformer 런타임
pip install torch transformersIt works without a GPU, but if you're processing large volumes of documents in production, a CUDA-enabled GPU environment is recommended. On CPU, processing a single document can take several seconds.
Practical Implementation
Example 1: Setting Up in 5 Minutes with YAML Configuration
The fastest way to apply this in practice is to use Presidio's YAML configuration file. I remember the frustration of fiddling around with Python code at first, only to discover the YAML configuration existed.
nlp_engine_name: transformers
models:
- lang_code: ko
model_name:
spacy: ko_core_news_sm
transformers: klue/bert-base
ner_model_configuration:
labels_to_ignore:
- O
aggregation_strategy: max
stride: 16
alignment_mode: expand
model_to_presidio_entity_mapping:
PER: PERSON
PS: PERSON
LC: LOCATION
OG: ORGANIZATION
DT: DATE_TIME
TI: DATE_TIME
QT: QUANTITY
low_confidence_score_multiplier: 0.4
low_score_entity_names:
- QUANTITY⚠️ Important:
klue/bert-baseis a general-purpose pre-trained model, not an NER fine-tuned model. This YAML is meant to illustrate the pipeline structure. In production, you need to put an NER fine-tuned model (search forklue neron Hugging Face) in thetransformersfield to get meaningful NER results.
Let's go through what each setting does.
| Setting | Role | Practical Tip |
|---|---|---|
spacy: ko_core_news_sm |
Handles tokenization and lemmatization preprocessing | sm is sufficient. Since the Transformer handles NER, the lg model is unnecessary overhead |
transformers: klue/bert-base |
Handles actual NER inference | Must be replaced with an NER fine-tuned model |
aggregation_strategy: max |
Aggregates subword token predictions into entity-level results | max is the most stable for Korean |
stride: 16 |
Overlap size of the sliding window for long text processing | Too small and entities get cut off; too large and it gets slow |
alignment_mode: expand |
Alignment method between spaCy tokens and Transformer tokens | expand captures entities broadly, reducing false negatives |
model_to_presidio_entity_mapping |
Maps KLUE labels → Presidio entity types | Mapping both PS and PER is safer |
More on aggregation_strategy: Transformers process text by splitting it into subword units. For example, "홍길동" might be split into three tokens: "홍", "##길", "##동". When these three tokens each predict different labels, they need to be merged into a single entity.
maxselects the label with the highest probability, whilefirstfollows the label of the first token. For Korean,maxis generally more stable.
More on alignment_mode: When the entity range predicted by the Transformer doesn't perfectly align with the token boundaries segmented by spaCy,
expandincludes partially overlapping tokens in the entity, whilestrictonly accepts tokens that are fully contained. In PII detection, it's better to cast a wider net than to miss something, soexpandis the appropriate choice.
Example 2: Building the Pipeline with Python Code
When you need finer control than YAML offers, or when you need to register custom Recognizers alongside, it's better to configure things directly in Python code.
from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.nlp_engine import TransformersNlpEngine
# 1. Transformer NLP 엔진 초기화
transformer_engine = TransformersNlpEngine(
models=[{
"lang_code": "ko",
"model_name": {
"spacy": "ko_core_news_sm",
"transformers": "klue/bert-base" # ⚠️ 아래 주의사항 참고
}
}]
)
# 2. Analyzer 엔진 구성
analyzer = AnalyzerEngine(
nlp_engine=transformer_engine,
supported_languages=["ko"]
)
# 3. PII 탐지 실행
text = "홍길동의 전화번호는 010-1234-5678이고 서울시 강남구에 거주합니다."
results = analyzer.analyze(text=text, language="ko")
for result in results:
print(f" {result.entity_type}: '{text[result.start:result.end]}' "
f"(신뢰도: {result.score:.2f})")⚠️ Caution: This code is for illustrating the pipeline structure. Since
klue/bert-baseis a pre-trained model and not an NER fine-tuned model, running it as-is will likely produce no meaningful NER results. In production, search forklue neron the Hugging Face Model Hub to find community fine-tuned models (e.g., a model fine-tuned on the KLUE-NER dataset fromklue/bert-base) and insert that model ID in thetransformersfield.
Expected output when an NER fine-tuned model is applied:
PERSON: '홍길동' (신뢰도: 0.92)
PHONE_NUMBER: '010-1234-5678' (신뢰도: 0.95)
LOCATION: '서울시 강남구' (신뢰도: 0.88)Example 3: A Hybrid Strategy Combining Regex Recognizers
This is a situation you'll frequently encounter in practice. Korean PII broadly falls into two categories: structured PII with clear patterns, such as resident registration numbers, phone numbers, and bank account numbers, and unstructured PII requiring contextual understanding, such as person names, addresses, and organization names. No matter how good Transformer NER is, regex is more accurate and faster for resident registration number patterns.
from presidio_analyzer import PatternRecognizer, Pattern
# 주민등록번호 패턴 Recognizer
rrn_pattern = Pattern(
name="korean_rrn",
regex=r"(?<!\d)(\d{6}[-–]\d{7})(?!\d)",
score=0.9
)
rrn_recognizer = PatternRecognizer(
supported_entity="KR_RRN",
supported_language="ko",
patterns=[rrn_pattern],
context=["주민등록번호", "주민번호", "생년월일"]
)
# Analyzer에 커스텀 Recognizer 추가
analyzer.registry.add_recognizer(rrn_recognizer)Note the use of (?<!\d) and (?!\d) (negative lookbehind and lookahead) instead of \b in the regex. \b is an anchor that recognizes word boundaries, but it often doesn't work as expected in Korean text. In cases like "주민번호는123456-1234567입니다" where there are no spaces, \b may fail to match, so it's safer to explicitly specify non-digit character boundaries.
The context parameter is also important. When Presidio finds words specified in context (here: "주민등록번호", "주민번호", "생년월일") in the surrounding text of a pattern match, it boosts the confidence score of that result. In other words, for the text "주민번호는 123456-1234567," the context adjustment is added on top of the pattern match score (0.9), raising the final confidence. Simple regex alone makes it difficult to distinguish resident registration numbers from other number strings with similar formats, and this context analysis effectively reduces false positives.
This is exactly where Presidio's architecture shines. Transformer NER catches names, addresses, and organization names; regex Recognizers catch resident registration numbers, phone numbers, and bank account numbers; and context analysis adjusts confidence scores — these three layers naturally combine.
Key Insight: The key is not entrusting everything to a single model. A study published in Scientific Reports in 2025 also showed that a rule-based + machine learning hybrid approach outperformed single-model approaches for PII detection in financial documents. Presidio essentially supports this hybrid strategy at the framework level.
spaCy Alone vs. Transformer Replacement — Before/After Comparison
Just saying "accuracy improves" isn't very convincing, so let's compare with actual sentences. Below is a qualitative comparison of detection results between spaCy ko_core_news_sm alone and a Presidio pipeline with an NER fine-tuned Transformer model applied.
| Input Text | spaCy Alone Result | After Transformer Replacement |
|---|---|---|
| "서울에서 근무하는 홍길동입니다" | LOC: "서울에서" (includes postposition) / PER: missed | LOC: "서울" ✓ / PER: "홍길동" ✓ |
| "삼성전자의 이재용 부회장" | ORG: "삼성전자의" (includes postposition) / PER: missed | ORG: "삼성전자" ✓ / PER: "이재용" ✓ |
| "김철수 교수가 연구를 진행" | PER: missed | PER: "김철수" ✓ |
| "2025년 3월 15일 부산에서 개최" | DT: missed / LOC: "부산에서" | DT: "2025년 3월 15일" ✓ / LOC: "부산" ✓ |
Two patterns are clearly visible. First, spaCy's postposition inclusion problem ("서울에서", "삼성전자의") is cleanly resolved with the Transformer. Second, the Transformer catches person names and date entities that spaCy was missing. The reduction in person name false negatives is a decisive difference in PII detection.
Of course, this is a qualitative comparison, and in production, it's recommended to create your own test set and quantitatively measure Precision/Recall/F1. For reference, a Korean EMR de-identification study (npj Health Systems, 2025) reported that KLUE-BERT-based NER achieved an F1 of 94.30% on discharge summaries.
Pros and Cons Analysis
Pros
| Item | Details |
|---|---|
| Korean contextual understanding | KLUE-BERT, trained on 62GB of Korean corpora, significantly improves NER accuracy in complex sentence structures compared to spaCy's CNN |
| Improved postposition handling | Tokenization reflecting morpheme boundaries enables natural segmentation like "서울에서" → "서울" + "에서" |
| Reuse of existing infrastructure | Presidio's regex Recognizers, context analysis, and anonymization pipeline can be used as-is |
| Easy model replacement | Switch to other models like KoELECTRA or KLUE-RoBERTa from the Hugging Face Hub by simply changing the configuration |
| Proven in practice | KLUE-BERT-based NER achieved F1 94.30% on discharge summaries in a Korean EMR de-identification study |
Cons and Considerations
| Item | Details | Mitigation |
|---|---|---|
| Resource consumption | A 111M parameter model significantly increases GPU memory usage and inference time compared to spaCy | Batch processing, GPU instances, or ONNX conversion for inference optimization |
| Manual label mapping | Manual mapping required between KLUE tags (PS, LC, OG) and Presidio standard types (PERSON, LOCATION) | Define once using model_to_presidio_entity_mapping in the YAML configuration |
| Missing Korean context words | Presidio's default context words ("Mr.", "Mrs.") are English-centric | Define Korean context words like "씨", "님", "대표", "교수" in custom Recognizers |
| Base model limitations | KLUE-BERT base's Entity F1 of 83.97% (per the KLUE benchmark paper) may be insufficient for production | Consider domain-specific fine-tuning or KoELECTRA (Entity F1 92.6%, per the KLUE leaderboard) |
Most Common Mistakes in Practice
These are cases I've personally experienced or frequently seen around me. Bookmark this section — it'll come in handy later.
-
Confusing a pre-trained model with an NER fine-tuned model —
klue/bert-baseis a general-purpose pre-trained model. If you use it directly for NER tasks, it will barely detect any entities. Our team initially didn't know this and plugged inklue/bert-baseas-is, then spent a long time debugging why we kept getting empty results even though "홍길동" was clearly in the text. You need to either find a model fine-tuned on KLUE-NER on Hugging Face or fine-tune one yourself. -
Relying solely on Transformers for Korean-specific PII patterns — Korean-specific identifiers like resident registration numbers (XXXXXX-XXXXXXX) and business registration numbers (XXX-XX-XXXXX) require separate regex Recognizers. These patterns are often insufficiently represented in NER model training data, and since the patterns are well-structured, regex is far more accurate.
-
Confusing Entity F1 with Character F1 during evaluation — In Korean NER, the difference between these two metrics can be significant depending on whether postpositions are included. For example, when "서울에서" is detected as "서울," it's treated as incorrect in Entity F1 but may receive partial credit in Character F1. In the context of PII detection, the most important question is "did we miss any text spans containing personal information?" so it's appropriate to set recall-oriented evaluation criteria.
Conclusion
spaCy's Korean NER limitations are a structural issue, and Presidio's TransformersNlpEngine provides a realistic path to replacing it with a Transformer model while maintaining the existing pipeline. With the 2025 PIPA amendment making automated PII detection more critical than ever, this combination is one of the most promising choices for Korean personal information protection pipelines.
Three steps to get started right now:
-
Set up the environment: Install the base dependencies in your Python environment with
pip install presidio-analyzer presidio-anonymizer && python -m spacy download ko_core_news_sm && pip install torch transformers. -
Quick validation: Replace the
transformersfield in the YAML configuration or Python code examples above with an NER fine-tuned model, then test with the sentence "홍길동의 전화번호는 010-1234-5678이고 서울시 강남구에 거주합니다." -
Production expansion: Add
model_to_presidio_entity_mapping, register custom PatternRecognizers for resident registration numbers and phone numbers to complete the hybrid pipeline. Replacing just the model with KoELECTRA-base-v3 (monologg/koelectra-base-v3-discriminator) is also a good starting point for performance comparison.
Next article: A deep dive into Presidio custom Recognizers — covering how to perfectly detect Korean PII patterns including resident registration numbers, business registration numbers, and passport numbers using regex and context analysis.
References
Cited in this article
- KLUE: Korean Language Understanding Evaluation Paper | arXiv
- spaCy Korean NER Postposition Handling Issue #13705 | GitHub
- Korean EMR De-identification Study | npj Health Systems (2025)
- Hybrid PII Detection Study | Scientific Reports (2025)
- Microsoft Presidio - Transformers NLP Engine Official Documentation
- Microsoft Presidio - NER Model Configuration Guide
- KLUE-BERT base | Hugging Face
- KLUE Benchmark | GitHub
Further Reading
- Microsoft Presidio - Multilingual Support Documentation
- Microsoft Presidio | GitHub
- spaCy Korean Model Documentation
- NER Model PII Comparative Analysis | Protecto
- Comprehensive Korean NER Dataset Overview | Letr.ai
- Naver x Changwon University NER Data | Korpora
- KoBERT | GitHub
- KPF-BERT-NER | Hugging Face
- Presidio TransformersNlpEngine Source Code | GitHub