How to Safely Serve Production Data to Preview Environments

Neon Masking Branches vs Custom Anonymization Scripts: A Two-Month Side-by-Side Practical Comparison

"It worked fine in staging but broke in production." Most developers have heard this at least once, and a significant portion of the causes come from test data failing to reflect reality. With just 10 rows of seed data, it's hard to catch N+1 query problems or bugs where usernames with special characters break. But copying production data as-is runs into the massive wall of privacy protection.

Honestly, when I was working at an early-stage startup, I once loaded a production dump directly into staging thinking "it's only used internally anyway." Looking back, it was terrifying, and now that GDPR and SOC 2 audits are routine, it's unimaginable. So I tried two approaches and compared them directly. A pipeline that preserves the shape and relationships of production data while safely replacing only sensitive information for Preview environments — Neon masking branches and custom anonymization scripts. After reading this article, you should be able to build a prototype pipeline that automatically generates a masked DB per PR within 30 minutes.

Core Concepts

To summarize the key difference between the two approaches in one line: Neon masking branches are an "infrastructure handles the masking" approach, while custom scripts are an "I handle the masking" approach. This difference impacts operational burden, flexibility, and the overall cost structure.

Why Data Masking Is Necessary

Data masking for Preview environments refers to a pipeline that replicates the production DB's schema and data to development/staging environments while safely transforming PII (Personally Identifiable Information). The reason you need to understand this is that it's an entirely different problem from simply deleting data. If you bulk-replace an email column with test@test.com, unique constraints break, and joins between user IDs and order histories produce nonsensical results. The key is replacing only sensitive information while preserving foreign key relationships and data types intact.

For reference, the cost of compliance violations is never light. In 2023, Meta was fined 1.2 billion euros (~$1.3 billion) for GDPR violations, and smaller companies are no exception. Whether data masking is applied in non-production environments is also an audit item in SOC 2 reviews.

Key Masking Techniques at a Glance

Technique	Description	Best Suited For
Substitution	Replace actual values with fake values generated by Faker, etc.	Text fields like names, emails
Pseudonymization	Hash-based mapping ensuring same input → same output	Maintaining join relationships across multiple tables
Partial Scrambling	Mask only portions (`user@email.com` → `u***@email.com`)	Debugging where format verification is needed
Noise Addition	Add random deviation to numbers/dates	Statistical data like revenue, age
Generalization	Reduce precision (date of birth → birth year)	Analytical datasets

Pseudonymization vs Anonymization: Pseudonymization allows the original to be restored if the mapping key is available, while anonymization is irreversible. Under GDPR, only truly anonymized data is excluded from regulation, so using irreversible methods in non-production environments is the safer approach.

Neon Masking Branches: Branching Databases Like Git

Neon provides database branching functionality based on a copy-on-write storage architecture. Since November 2025, you can select an "Anonymized data" option when creating a branch, which I found quite innovative when I first saw it. The moment you create a new branch from the production branch, the postgresql_anonymizer extension internally applies masking rules, creating a completely independent environment that never touches the original.

I also didn't initially grasp the concept of "branching a database," but think of Git branches. Just as you create feature/login from main, you create preview/feature-login from the production DB. Thanks to copy-on-write, only changed pages incur additional storage, making storage costs efficient. In practice, creating a Neon anonymized branch for a users table with 500,000 rows completed in about 8 seconds.

Custom Anonymization Scripts: Traditional but Powerful

This is the traditional pipeline of pg_dump (PostgreSQL's backup tool) → transformation script → pg_restore (restoration tool). You replace PII using the Faker library in a script written in Python or Node.js, or leverage open-source tools like postgresql_anonymizer. For teams using PostgreSQL hosting other than Neon, or other databases like MySQL/MongoDB, this approach remains the primary method. For the same 500,000-row table, the entire dump-transform-restore cycle took about 12 minutes, so the speed difference is definitely noticeable.

Practical Application

Example 1: Neon + Vercel — Automatically Creating a Masked DB for Each PR

This is the cleanest scenario. When a PR is opened, an anonymized DB branch is automatically created and connected to the Vercel Preview deployment.

First, define masking rules on the production branch using the Neon Console or SQL. The SECURITY LABEL FOR anon syntax below is not Neon-specific but rather the standard postgresql_anonymizer syntax. Since Neon uses this extension internally, the same syntax applies.

sql

-- 프로덕션 브랜치에서 마스킹 규칙 선언 (postgresql_anonymizer 표준 구문)
SECURITY LABEL FOR anon ON COLUMN users.email
  IS 'MASKED WITH FUNCTION anon.fake_email()';
 
SECURITY LABEL FOR anon ON COLUMN users.name
  IS 'MASKED WITH FUNCTION anon.fake_first_name()';
 
SECURITY LABEL FOR anon ON COLUMN users.phone
  IS 'MASKED WITH FUNCTION anon.partial(phone, 3, $$***$$, 0)';
 
SECURITY LABEL FOR anon ON COLUMN orders.shipping_address
  IS 'MASKED WITH FUNCTION anon.fake_address()';

Next, create an anonymized branch using the Neon CLI in GitHub Actions. Since PR branch names can contain slashes (/) (like feature/login), we added handling to replace them with dashes.

yaml

# .github/workflows/preview.yml
name: Create Preview with Masked DB
 
on:
  pull_request:
    types: [opened, synchronize, reopened]
 
jobs:
  create-preview-db:
    runs-on: ubuntu-latest
    steps:
      - name: Install Neon CLI
        run: npm i -g neonctl
 
      - name: Create Anonymized Branch
        id: create-branch
        run: |
          SAFE_BRANCH=$(echo "${GITHUB_HEAD_REF}" | sed 's/\//-/g')
          BRANCH_NAME="preview-${SAFE_BRANCH}"
          neonctl branches create \
            --project-id ${{ secrets.NEON_PROJECT_ID }} \
            --name "$BRANCH_NAME" \
            --parent main \
            --api-key ${{ secrets.NEON_API_KEY }}
 
          CONNECTION_STRING=$(neonctl connection-string "$BRANCH_NAME" \
            --project-id ${{ secrets.NEON_PROJECT_ID }} \
            --api-key ${{ secrets.NEON_API_KEY }})
          echo "database_url=$CONNECTION_STRING" >> "$GITHUB_OUTPUT"
 
      - name: Set Vercel Environment Variable
        run: |
          vercel env add DATABASE_URL preview \
            --token ${{ secrets.VERCEL_TOKEN }} \
            <<< "${{ steps.create-branch.outputs.database_url }}"

Add a workflow to clean up branches when PRs are merged or closed.

yaml

# .github/workflows/cleanup-preview.yml
name: Cleanup Preview DB
 
on:
  pull_request:
    types: [closed]
 
jobs:
  cleanup:
    runs-on: ubuntu-latest
    steps:
      - name: Delete Anonymized Branch
        run: |
          npm i -g neonctl
          SAFE_BRANCH=$(echo "${GITHUB_HEAD_REF}" | sed 's/\//-/g')
          neonctl branches delete "preview-${SAFE_BRANCH}" \
            --project-id ${{ secrets.NEON_PROJECT_ID }} \
            --api-key ${{ secrets.NEON_API_KEY }}

The biggest advantage of this setup is that each PR has a completely isolated DB. Even if a colleague corrupts data while testing, it doesn't affect my Preview environment, and there are no version conflicts when testing schema migrations. Cost-wise, since Neon's billing model is based on compute time + storage, with copy-on-write only charging for changed data, even with 10 branches running, storage costs stayed at about 1.1x the original.

One-line verdict: If you're using a Neon + PostgreSQL combination, this approach is overwhelmingly convenient. With 30 minutes of setup, you'll have automated per-PR masked DB generation.

Example 2: Building a Pipeline with GitHub Actions + Custom Scripts

If your team doesn't use Neon, you'll need to build the pipeline yourself. Below is a practical example using Python + Faker, with improvements to several issues from the initial draft.

There are three things to watch out for. First, inserting table/column names directly into SQL via f-strings poses SQL injection risks, so use Identifier from the psycopg2.sql module. Second, DB connection credentials must always be read from environment variables. Third, be aware that row-by-row UPDATE can be fatally slow on large datasets (potentially taking hours for 1 million rows).

python

# scripts/anonymize.py
import os
import subprocess
import psycopg2
from psycopg2 import sql
from faker import Faker
import hashlib
 
fake = Faker('ko_KR')
 
MASKING_RULES = {
    'users': {
        'email': lambda val: fake.email(),
        'name': lambda val: fake.name(),
        'phone': lambda val: fake.phone_number(),
        'address': lambda val: fake.address(),
    },
    'orders': {
        'shipping_address': lambda val: fake.address(),
        'recipient_name': lambda val: fake.name(),
    },
    'payments': {
        'card_last_four': lambda val: fake.credit_card_number()[-4:],
    },
}
 
PSEUDONYMIZE_COLUMNS = {
    ('users', 'external_id'): lambda val: hashlib.sha256(
        f"salt-2026-{val}".encode()
    ).hexdigest()[:16],
}
 
 
def anonymize_table(conn, table: str, columns: dict):
    """row-by-row 방식 — 10만 행 이하 테이블에 적합"""
    cur = conn.cursor()
    table_id = sql.Identifier(table)
    for col, transform in columns.items():
        col_id = sql.Identifier(col)
        query = sql.SQL("SELECT id, {} FROM {}").format(col_id, table_id)
        cur.execute(query)
        rows = cur.fetchall()
 
        update_query = sql.SQL("UPDATE {} SET {} = %s WHERE id = %s").format(
            table_id, col_id
        )
        for row_id, original_value in rows:
            if original_value is None:
                continue
            masked = transform(original_value)
            cur.execute(update_query, (masked, row_id))
    conn.commit()
    cur.close()
 
 
def main():
    prod_url = os.environ['PROD_DATABASE_URL']
    staging_url = os.environ['STAGING_DATABASE_URL']
 
    subprocess.run([
        'pg_dump', '--no-owner', '--no-privileges',
        '-Fc', '-f', '/tmp/prod_dump.sql',
        prod_url
    ], check=True)
 
    subprocess.run([
        'pg_restore', '--clean', '--if-exists',
        '--no-owner', '-d', staging_url,
        '/tmp/prod_dump.sql'
    ], check=True)
 
    conn = psycopg2.connect(staging_url)
    for table, columns in MASKING_RULES.items():
        print(f"Anonymizing {table}...")
        anonymize_table(conn, table, columns)
 
    for (table, col), transform in PSEUDONYMIZE_COLUMNS.items():
        print(f"Pseudonymizing {table}.{col}...")
        anonymize_table(conn, table, {col: transform})
 
    conn.close()
    print("Anonymization complete.")
 
 
if __name__ == '__main__':
    main()

Performance Warning: The above code uses a row-by-row UPDATE approach, which can take hours on a table with 1 million rows. For large tables, batch UPDATE (specifying ranges with WHERE id BETWEEN ... AND ...), or a COPY-based approach (transforming while streaming the dump file and loading via COPY) is far more efficient.

Faker unique limitation: fake.unique.email() can raise a UniquenessException with large datasets. If uniqueness is required, periodically calling fake.unique.clear() or using hash-based pseudonymization to guarantee uniqueness is safer. In the example above, we used plain fake.email() to avoid this issue.

Run this in GitHub Actions.

yaml

# .github/workflows/sync-staging-data.yml
name: Sync Masked Production Data
 
on:
  schedule:
    - cron: '0 3 * * 1'  # Every Monday at 3 AM
  workflow_dispatch:
 
jobs:
  sync:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
 
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.12'
 
      - name: Install dependencies
        run: pip install psycopg2-binary faker
 
      - name: Run anonymization
        env:
          PROD_DATABASE_URL: ${{ secrets.PROD_READONLY_DATABASE_URL }}
          STAGING_DATABASE_URL: ${{ secrets.STAGING_DATABASE_URL }}
        run: python scripts/anonymize.py

This approach is powerful in that you have full control over the masking logic, but honestly, maintenance is no joke. Every time a new table is added or a column name changes, you need to update the script, and if you forget, PII gets passed through as-is. I once forgot to add a new user_profiles table to the masking rules, and it was barely caught during code review. Cost-wise, you'll incur CI runner time (GitHub Actions free tier: 2,000 minutes/month) and staging DB instance costs (~$15/month for RDS db.t3.micro).

One-line verdict: Maximum flexibility, but considering the operational burden and performance limitations, it's best suited for small-to-medium tables.

Improving with postgresql_anonymizer

If you want to reduce the maintenance burden of custom scripts, you can declare masking rules at the SQL DDL level using postgresql_anonymizer. This creates a clean structure where schema and masking rules are version-controlled together.

sql

-- 확장 설치 및 초기화
CREATE EXTENSION IF NOT EXISTS anon CASCADE;
SELECT anon.init();
 
-- 마스킹 규칙 선언 (마이그레이션 파일에 포함 가능)
SECURITY LABEL FOR anon ON COLUMN users.email
  IS 'MASKED WITH FUNCTION anon.fake_email()';
 
SECURITY LABEL FOR anon ON COLUMN users.name
  IS 'MASKED WITH FUNCTION anon.fake_first_name() || '' '' || anon.fake_last_name()';
 
SECURITY LABEL FOR anon ON COLUMN users.phone
  IS 'MASKED WITH FUNCTION anon.partial(phone, 3, $$***$$, 0)';
 
SECURITY LABEL FOR anon ON COLUMN users.birth_date
  IS 'MASKED WITH FUNCTION anon.generalize_daterange(birth_date, ''year'')';
 
SECURITY LABEL FOR anon ON COLUMN payments.card_number
  IS 'MASKED WITH FUNCTION anon.random_string(16)';
 
-- 익명화된 덤프 생성 (CLI에서 실행)
-- pg_dump_anon --host=prod-host --dbname=myapp > anonymized_dump.sql

The advantages over a Python script are clear. Since masking rules live inside SQL migration files, schema changes and masking rule updates naturally happen in the same PR. A single pg_dump_anon command produces an anonymized dump directly, eliminating the need to maintain a separate script.

postgresql_anonymizer 2.0 (January 2025): Completely rewritten in Rust (PGRX framework) with significantly improved performance. Masking speed on large tables has noticeably improved compared to previous versions, so if you're still on an older version, it's worth considering an upgrade.

Most Common Mistakes in Practice

After running both approaches side by side for about two months, here are the three most frequent mistakes I experienced or witnessed around me. I believe this section is actually the most practically valuable content in this article.

Forgetting to update masking rules when adding new tables/columns — If you don't manage schema migrations and masking rules in the same PR, you'll inevitably miss one at some point. Adding an automated check in CI that detects "PII candidate columns without defined masking rules" can prevent incidents proactively. Using postgresql_anonymizer's DDL approach naturally mitigates this problem.
Overlooking indirect identifiers — Everyone masks direct identifiers like names and emails, but indirect identifiers like IP addresses, device fingerprints, and behavioral sequences (logs of specific views in a specific order at specific times) are easy to miss. It's worth remembering that combinations of these alone can identify an individual.
Applying only simple random substitution without pseudonymization — If the email in the users table is changed to fake1@test.com and the email in audit_logs becomes fake2@test.com, functional testing that tracks the same user's activity becomes impossible. It's best to always apply consistent mapping for identical values across multiple tables.

Referential Integrity: This refers to data consistency between tables connected by foreign keys. For example, when orders.user_id references users.id, if users.id changes during the masking process, order data becomes orphaned. Neon branches preserve this automatically, but with custom scripts, you need to handle it manually.

Pros and Cons Analysis

Having run both approaches side by side for about two months, here's my summary.

Comprehensive Comparison

Category	Neon Masking Branches	Custom Anonymization Scripts
Initial Setup Time	~30 minutes	Half a day to a full day
Operational Burden	Extremely low (one branch creation and done)	High (script updates with every schema change)
Data Freshness	Always current (branch directly from production)	Depends on sync frequency (typically daily/weekly)
Processing Time for 500K Rows	~8 seconds	~12 minutes (row-by-row basis)
Per-PR Isolation	Native support	Requires additional infrastructure
DB Compatibility	PostgreSQL (Neon) only	Any DB
Customization	Within postgresql_anonymizer function scope	Unlimited (domain-specific logic freely)
Vendor Lock-in	Dependent on Neon	None
Monthly Cost (Small Scale)	Neon Free tier + minimal storage per branch	CI runner time + staging DB (~$15/month)

Drawbacks and Considerations

Category	Details	Mitigation
Neon Vendor Lock-in	Becomes dependent on Neon's service	Define masking rules in standard SQL (`SECURITY LABEL`) to ensure portability
Neon Masking Option Limitations	Limited functionality compared to full postgresql_anonymizer	January 2026 update enabled custom rule definition via API/SQL
Custom Script Maintenance	Missing script updates on schema changes → PII leak risk	Add automated checks in CI that verify coverage between column lists and masking rules
Custom Script Processing Time	Dump-transform-restore takes tens of minutes to hours on large DBs	Introduce subset extraction, incremental sync, or batch UPDATE
Custom Script Foreign Key Breakage	Must manually ensure consistent hash processing	Apply pseudonymization techniques (same input → same output)

Conclusion

Safely providing production data to Preview environments is essential infrastructure for both compliance and developer productivity. To put it clearly, if you're using a PostgreSQL + Neon combination, starting with masking branches is overwhelmingly efficient, and if you're in a multi-DB environment or using hosting other than Neon, custom scripts are your only option. Both approaches ultimately arrive at the same principle: "manage masking rules as code and execute them automatically in CI/CD."

Three steps you can start right now:

Identify and catalog PII columns in your current production DB — You can quickly find PII candidate columns with the query below.

sql

SELECT table_name, column_name
FROM information_schema.columns
WHERE column_name ~* '(email|phone|name|address|birth|ssn|ip_addr|device)'
  AND table_schema = 'public'
ORDER BY table_name, column_name;

Build a prototype with a single small table — If you're using Neon, you can apply masking rules to the users table with SECURITY LABEL FOR anon and create an anonymized branch. If not using Neon, you can install postgresql_anonymizer locally, or start by applying the example code above with Python + Faker to just the users table.
Integrate into CI/CD and add coverage checks — Once the prototype works well, move it into a GitHub Actions workflow and add automated checks that detect missing masking rules when new PII columns are added. This completes a pipeline you can operate with confidence.

Next article: How to run E2E tests on masked Preview databases — Building a test pipeline that reproduces real user scenarios with Playwright + masked DB

References

How to Safely Serve Production Data to Preview Environments | DEV BAK - 기술블로그

DevOps

How to Safely Serve Production Data to Preview Environments

Neon Masking Branches vs Custom Anonymization Scripts: A Two-Month Side-by-Side Practical Comparison

Core Concepts

Why Data Masking Is Necessary

Key Masking Techniques at a Glance

Technique	Description	Best Suited For
Substitution	Replace actual values with fake values generated by Faker, etc.	Text fields like names, emails
Pseudonymization	Hash-based mapping ensuring same input → same output	Maintaining join relationships across multiple tables
Partial Scrambling	Mask only portions (`user@email.com` → `u***@email.com`)	Debugging where format verification is needed
Noise Addition	Add random deviation to numbers/dates	Statistical data like revenue, age
Generalization	Reduce precision (date of birth → birth year)	Analytical datasets

Pseudonymization vs Anonymization: Pseudonymization allows the original to be restored if the mapping key is available, while anonymization is irreversible. Under GDPR, only truly anonymized data is excluded from regulation, so using irreversible methods in non-production environments is the safer approach.

Neon Masking Branches: Branching Databases Like Git

Custom Anonymization Scripts: Traditional but Powerful

Practical Application

Example 1: Neon + Vercel — Automatically Creating a Masked DB for Each PR

This is the cleanest scenario. When a PR is opened, an anonymized DB branch is automatically created and connected to the Vercel Preview deployment.

sql

-- 프로덕션 브랜치에서 마스킹 규칙 선언 (postgresql_anonymizer 표준 구문)
SECURITY LABEL FOR anon ON COLUMN users.email
  IS 'MASKED WITH FUNCTION anon.fake_email()';
 
SECURITY LABEL FOR anon ON COLUMN users.name
  IS 'MASKED WITH FUNCTION anon.fake_first_name()';
 
SECURITY LABEL FOR anon ON COLUMN users.phone
  IS 'MASKED WITH FUNCTION anon.partial(phone, 3, $$***$$, 0)';
 
SECURITY LABEL FOR anon ON COLUMN orders.shipping_address
  IS 'MASKED WITH FUNCTION anon.fake_address()';

Next, create an anonymized branch using the Neon CLI in GitHub Actions. Since PR branch names can contain slashes (/) (like feature/login), we added handling to replace them with dashes.

yaml

# .github/workflows/preview.yml
name: Create Preview with Masked DB
 
on:
  pull_request:
    types: [opened, synchronize, reopened]
 
jobs:
  create-preview-db:
    runs-on: ubuntu-latest
    steps:
      - name: Install Neon CLI
        run: npm i -g neonctl
 
      - name: Create Anonymized Branch
        id: create-branch
        run: |
          SAFE_BRANCH=$(echo "${GITHUB_HEAD_REF}" | sed 's/\//-/g')
          BRANCH_NAME="preview-${SAFE_BRANCH}"
          neonctl branches create \
            --project-id ${{ secrets.NEON_PROJECT_ID }} \
            --name "$BRANCH_NAME" \
            --parent main \
            --api-key ${{ secrets.NEON_API_KEY }}
 
          CONNECTION_STRING=$(neonctl connection-string "$BRANCH_NAME" \
            --project-id ${{ secrets.NEON_PROJECT_ID }} \
            --api-key ${{ secrets.NEON_API_KEY }})
          echo "database_url=$CONNECTION_STRING" >> "$GITHUB_OUTPUT"
 
      - name: Set Vercel Environment Variable
        run: |
          vercel env add DATABASE_URL preview \
            --token ${{ secrets.VERCEL_TOKEN }} \
            <<< "${{ steps.create-branch.outputs.database_url }}"

Add a workflow to clean up branches when PRs are merged or closed.

yaml

# .github/workflows/cleanup-preview.yml
name: Cleanup Preview DB
 
on:
  pull_request:
    types: [closed]
 
jobs:
  cleanup:
    runs-on: ubuntu-latest
    steps:
      - name: Delete Anonymized Branch
        run: |
          npm i -g neonctl
          SAFE_BRANCH=$(echo "${GITHUB_HEAD_REF}" | sed 's/\//-/g')
          neonctl branches delete "preview-${SAFE_BRANCH}" \
            --project-id ${{ secrets.NEON_PROJECT_ID }} \
            --api-key ${{ secrets.NEON_API_KEY }}

One-line verdict: If you're using a Neon + PostgreSQL combination, this approach is overwhelmingly convenient. With 30 minutes of setup, you'll have automated per-PR masked DB generation.

Example 2: Building a Pipeline with GitHub Actions + Custom Scripts

If your team doesn't use Neon, you'll need to build the pipeline yourself. Below is a practical example using Python + Faker, with improvements to several issues from the initial draft.

python

# scripts/anonymize.py
import os
import subprocess
import psycopg2
from psycopg2 import sql
from faker import Faker
import hashlib
 
fake = Faker('ko_KR')
 
MASKING_RULES = {
    'users': {
        'email': lambda val: fake.email(),
        'name': lambda val: fake.name(),
        'phone': lambda val: fake.phone_number(),
        'address': lambda val: fake.address(),
    },
    'orders': {
        'shipping_address': lambda val: fake.address(),
        'recipient_name': lambda val: fake.name(),
    },
    'payments': {
        'card_last_four': lambda val: fake.credit_card_number()[-4:],
    },
}
 
PSEUDONYMIZE_COLUMNS = {
    ('users', 'external_id'): lambda val: hashlib.sha256(
        f"salt-2026-{val}".encode()
    ).hexdigest()[:16],
}
 
 
def anonymize_table(conn, table: str, columns: dict):
    """row-by-row 방식 — 10만 행 이하 테이블에 적합"""
    cur = conn.cursor()
    table_id = sql.Identifier(table)
    for col, transform in columns.items():
        col_id = sql.Identifier(col)
        query = sql.SQL("SELECT id, {} FROM {}").format(col_id, table_id)
        cur.execute(query)
        rows = cur.fetchall()
 
        update_query = sql.SQL("UPDATE {} SET {} = %s WHERE id = %s").format(
            table_id, col_id
        )
        for row_id, original_value in rows:
            if original_value is None:
                continue
            masked = transform(original_value)
            cur.execute(update_query, (masked, row_id))
    conn.commit()
    cur.close()
 
 
def main():
    prod_url = os.environ['PROD_DATABASE_URL']
    staging_url = os.environ['STAGING_DATABASE_URL']
 
    subprocess.run([
        'pg_dump', '--no-owner', '--no-privileges',
        '-Fc', '-f', '/tmp/prod_dump.sql',
        prod_url
    ], check=True)
 
    subprocess.run([
        'pg_restore', '--clean', '--if-exists',
        '--no-owner', '-d', staging_url,
        '/tmp/prod_dump.sql'
    ], check=True)
 
    conn = psycopg2.connect(staging_url)
    for table, columns in MASKING_RULES.items():
        print(f"Anonymizing {table}...")
        anonymize_table(conn, table, columns)
 
    for (table, col), transform in PSEUDONYMIZE_COLUMNS.items():
        print(f"Pseudonymizing {table}.{col}...")
        anonymize_table(conn, table, {col: transform})
 
    conn.close()
    print("Anonymization complete.")
 
 
if __name__ == '__main__':
    main()

Performance Warning: The above code uses a row-by-row UPDATE approach, which can take hours on a table with 1 million rows. For large tables, batch UPDATE (specifying ranges with WHERE id BETWEEN ... AND ...), or a COPY-based approach (transforming while streaming the dump file and loading via COPY) is far more efficient.

Faker unique limitation: fake.unique.email() can raise a UniquenessException with large datasets. If uniqueness is required, periodically calling fake.unique.clear() or using hash-based pseudonymization to guarantee uniqueness is safer. In the example above, we used plain fake.email() to avoid this issue.

Run this in GitHub Actions.

yaml

# .github/workflows/sync-staging-data.yml
name: Sync Masked Production Data
 
on:
  schedule:
    - cron: '0 3 * * 1'  # Every Monday at 3 AM
  workflow_dispatch:
 
jobs:
  sync:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
 
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.12'
 
      - name: Install dependencies
        run: pip install psycopg2-binary faker
 
      - name: Run anonymization
        env:
          PROD_DATABASE_URL: ${{ secrets.PROD_READONLY_DATABASE_URL }}
          STAGING_DATABASE_URL: ${{ secrets.STAGING_DATABASE_URL }}
        run: python scripts/anonymize.py

One-line verdict: Maximum flexibility, but considering the operational burden and performance limitations, it's best suited for small-to-medium tables.

Improving with postgresql_anonymizer

sql

-- 확장 설치 및 초기화
CREATE EXTENSION IF NOT EXISTS anon CASCADE;
SELECT anon.init();
 
-- 마스킹 규칙 선언 (마이그레이션 파일에 포함 가능)
SECURITY LABEL FOR anon ON COLUMN users.email
  IS 'MASKED WITH FUNCTION anon.fake_email()';
 
SECURITY LABEL FOR anon ON COLUMN users.name
  IS 'MASKED WITH FUNCTION anon.fake_first_name() || '' '' || anon.fake_last_name()';
 
SECURITY LABEL FOR anon ON COLUMN users.phone
  IS 'MASKED WITH FUNCTION anon.partial(phone, 3, $$***$$, 0)';
 
SECURITY LABEL FOR anon ON COLUMN users.birth_date
  IS 'MASKED WITH FUNCTION anon.generalize_daterange(birth_date, ''year'')';
 
SECURITY LABEL FOR anon ON COLUMN payments.card_number
  IS 'MASKED WITH FUNCTION anon.random_string(16)';
 
-- 익명화된 덤프 생성 (CLI에서 실행)
-- pg_dump_anon --host=prod-host --dbname=myapp > anonymized_dump.sql

postgresql_anonymizer 2.0 (January 2025): Completely rewritten in Rust (PGRX framework) with significantly improved performance. Masking speed on large tables has noticeably improved compared to previous versions, so if you're still on an older version, it's worth considering an upgrade.

Most Common Mistakes in Practice

Forgetting to update masking rules when adding new tables/columns — If you don't manage schema migrations and masking rules in the same PR, you'll inevitably miss one at some point. Adding an automated check in CI that detects "PII candidate columns without defined masking rules" can prevent incidents proactively. Using postgresql_anonymizer's DDL approach naturally mitigates this problem.
Overlooking indirect identifiers — Everyone masks direct identifiers like names and emails, but indirect identifiers like IP addresses, device fingerprints, and behavioral sequences (logs of specific views in a specific order at specific times) are easy to miss. It's worth remembering that combinations of these alone can identify an individual.
Applying only simple random substitution without pseudonymization — If the email in the users table is changed to fake1@test.com and the email in audit_logs becomes fake2@test.com, functional testing that tracks the same user's activity becomes impossible. It's best to always apply consistent mapping for identical values across multiple tables.

Referential Integrity: This refers to data consistency between tables connected by foreign keys. For example, when orders.user_id references users.id, if users.id changes during the masking process, order data becomes orphaned. Neon branches preserve this automatically, but with custom scripts, you need to handle it manually.

Pros and Cons Analysis

Having run both approaches side by side for about two months, here's my summary.

Comprehensive Comparison

Category	Neon Masking Branches	Custom Anonymization Scripts
Initial Setup Time	~30 minutes	Half a day to a full day
Operational Burden	Extremely low (one branch creation and done)	High (script updates with every schema change)
Data Freshness	Always current (branch directly from production)	Depends on sync frequency (typically daily/weekly)
Processing Time for 500K Rows	~8 seconds	~12 minutes (row-by-row basis)
Per-PR Isolation	Native support	Requires additional infrastructure
DB Compatibility	PostgreSQL (Neon) only	Any DB
Customization	Within postgresql_anonymizer function scope	Unlimited (domain-specific logic freely)
Vendor Lock-in	Dependent on Neon	None
Monthly Cost (Small Scale)	Neon Free tier + minimal storage per branch	CI runner time + staging DB (~$15/month)

Drawbacks and Considerations

Category	Details	Mitigation
Neon Vendor Lock-in	Becomes dependent on Neon's service	Define masking rules in standard SQL (`SECURITY LABEL`) to ensure portability
Neon Masking Option Limitations	Limited functionality compared to full postgresql_anonymizer	January 2026 update enabled custom rule definition via API/SQL
Custom Script Maintenance	Missing script updates on schema changes → PII leak risk	Add automated checks in CI that verify coverage between column lists and masking rules
Custom Script Processing Time	Dump-transform-restore takes tens of minutes to hours on large DBs	Introduce subset extraction, incremental sync, or batch UPDATE
Custom Script Foreign Key Breakage	Must manually ensure consistent hash processing	Apply pseudonymization techniques (same input → same output)

Conclusion

Three steps you can start right now:

Identify and catalog PII columns in your current production DB — You can quickly find PII candidate columns with the query below.

sql

SELECT table_name, column_name
FROM information_schema.columns
WHERE column_name ~* '(email|phone|name|address|birth|ssn|ip_addr|device)'
  AND table_schema = 'public'
ORDER BY table_name, column_name;

Build a prototype with a single small table — If you're using Neon, you can apply masking rules to the users table with SECURITY LABEL FOR anon and create an anonymized branch. If not using Neon, you can install postgresql_anonymizer locally, or start by applying the example code above with Python + Faker to just the users table.
Integrate into CI/CD and add coverage checks — Once the prototype works well, move it into a GitHub Actions workflow and add automated checks that detect missing masking rules when new PII columns are added. This completes a pipeline you can operate with confidence.

Next article: How to run E2E tests on masked Preview databases — Building a test pipeline that reproduces real user scenarios with Playwright + masked DB

Core Concepts

Why Data Masking Is Necessary

Key Masking Techniques at a Glance

Neon Masking Branches: Branching Databases Like Git

Custom Anonymization Scripts: Traditional but Powerful

Practical Application

Example 1: Neon + Vercel — Automatically Creating a Masked DB for Each PR

Example 2: Building a Pipeline with GitHub Actions + Custom Scripts

Improving with postgresql_anonymizer

Most Common Mistakes in Practice

Pros and Cons Analysis

Comprehensive Comparison

Drawbacks and Considerations

Conclusion

References

Core Concepts

Why Data Masking Is Necessary

Key Masking Techniques at a Glance

Neon Masking Branches: Branching Databases Like Git

Custom Anonymization Scripts: Traditional but Powerful

Practical Application

Example 1: Neon + Vercel — Automatically Creating a Masked DB for Each PR

Example 2: Building a Pipeline with GitHub Actions + Custom Scripts

Improving with postgresql_anonymizer

Most Common Mistakes in Practice

Pros and Cons Analysis

Comprehensive Comparison

Drawbacks and Considerations

Conclusion

References

Recommended Posts

Running E2E Tests on a Masked Production DB — Building a Playwright + Neon Branching Pipeline

Catching Missing PII Masking Automatically Before Deployment

Building a Korean PII Detection Pipeline with Presidio + spaCy

How to Safely Use a Shared DB in Preview Environments — Per-PR Schema Isolation and Seed Data Automation

Automating PR Preview Environments with ArgoCD ApplicationSet

Automating Multi-Cluster Progressive Deployment with Argo CD ApplicationSet Matrix Generator