How to Safely Serve Production Data to Preview Environments
Neon Masking Branches vs Custom Anonymization Scripts: A Two-Month Side-by-Side Practical Comparison
"It worked fine in staging but broke in production." Most developers have heard this at least once, and a significant portion of the causes come from test data failing to reflect reality. With just 10 rows of seed data, it's hard to catch N+1 query problems or bugs where usernames with special characters break. But copying production data as-is runs into the massive wall of privacy protection.
Honestly, when I was working at an early-stage startup, I once loaded a production dump directly into staging thinking "it's only used internally anyway." Looking back, it was terrifying, and now that GDPR and SOC 2 audits are routine, it's unimaginable. So I tried two approaches and compared them directly. A pipeline that preserves the shape and relationships of production data while safely replacing only sensitive information for Preview environments — Neon masking branches and custom anonymization scripts. After reading this article, you should be able to build a prototype pipeline that automatically generates a masked DB per PR within 30 minutes.
Core Concepts
To summarize the key difference between the two approaches in one line: Neon masking branches are an "infrastructure handles the masking" approach, while custom scripts are an "I handle the masking" approach. This difference impacts operational burden, flexibility, and the overall cost structure.
Why Data Masking Is Necessary
Data masking for Preview environments refers to a pipeline that replicates the production DB's schema and data to development/staging environments while safely transforming PII (Personally Identifiable Information). The reason you need to understand this is that it's an entirely different problem from simply deleting data. If you bulk-replace an email column with test@test.com, unique constraints break, and joins between user IDs and order histories produce nonsensical results. The key is replacing only sensitive information while preserving foreign key relationships and data types intact.
For reference, the cost of compliance violations is never light. In 2023, Meta was fined 1.2 billion euros (~$1.3 billion) for GDPR violations, and smaller companies are no exception. Whether data masking is applied in non-production environments is also an audit item in SOC 2 reviews.
Key Masking Techniques at a Glance
| Technique | Description | Best Suited For |
|---|---|---|
| Substitution | Replace actual values with fake values generated by Faker, etc. | Text fields like names, emails |
| Pseudonymization | Hash-based mapping ensuring same input → same output | Maintaining join relationships across multiple tables |
| Partial Scrambling | Mask only portions (user@email.com → u***@email.com) |
Debugging where format verification is needed |
| Noise Addition | Add random deviation to numbers/dates | Statistical data like revenue, age |
| Generalization | Reduce precision (date of birth → birth year) | Analytical datasets |
Pseudonymization vs Anonymization: Pseudonymization allows the original to be restored if the mapping key is available, while anonymization is irreversible. Under GDPR, only truly anonymized data is excluded from regulation, so using irreversible methods in non-production environments is the safer approach.
Neon Masking Branches: Branching Databases Like Git
Neon provides database branching functionality based on a copy-on-write storage architecture. Since November 2025, you can select an "Anonymized data" option when creating a branch, which I found quite innovative when I first saw it. The moment you create a new branch from the production branch, the postgresql_anonymizer extension internally applies masking rules, creating a completely independent environment that never touches the original.
I also didn't initially grasp the concept of "branching a database," but think of Git branches. Just as you create feature/login from main, you create preview/feature-login from the production DB. Thanks to copy-on-write, only changed pages incur additional storage, making storage costs efficient. In practice, creating a Neon anonymized branch for a users table with 500,000 rows completed in about 8 seconds.
Custom Anonymization Scripts: Traditional but Powerful
This is the traditional pipeline of pg_dump (PostgreSQL's backup tool) → transformation script → pg_restore (restoration tool). You replace PII using the Faker library in a script written in Python or Node.js, or leverage open-source tools like postgresql_anonymizer. For teams using PostgreSQL hosting other than Neon, or other databases like MySQL/MongoDB, this approach remains the primary method. For the same 500,000-row table, the entire dump-transform-restore cycle took about 12 minutes, so the speed difference is definitely noticeable.
Practical Application
Example 1: Neon + Vercel — Automatically Creating a Masked DB for Each PR
This is the cleanest scenario. When a PR is opened, an anonymized DB branch is automatically created and connected to the Vercel Preview deployment.
First, define masking rules on the production branch using the Neon Console or SQL. The SECURITY LABEL FOR anon syntax below is not Neon-specific but rather the standard postgresql_anonymizer syntax. Since Neon uses this extension internally, the same syntax applies.
-- 프로덕션 브랜치에서 마스킹 규칙 선언 (postgresql_anonymizer 표준 구문)
SECURITY LABEL FOR anon ON COLUMN users.email
IS 'MASKED WITH FUNCTION anon.fake_email()';
SECURITY LABEL FOR anon ON COLUMN users.name
IS 'MASKED WITH FUNCTION anon.fake_first_name()';
SECURITY LABEL FOR anon ON COLUMN users.phone
IS 'MASKED WITH FUNCTION anon.partial(phone, 3, $$***$$, 0)';
SECURITY LABEL FOR anon ON COLUMN orders.shipping_address
IS 'MASKED WITH FUNCTION anon.fake_address()';Next, create an anonymized branch using the Neon CLI in GitHub Actions. Since PR branch names can contain slashes (/) (like feature/login), we added handling to replace them with dashes.
# .github/workflows/preview.yml
name: Create Preview with Masked DB
on:
pull_request:
types: [opened, synchronize, reopened]
jobs:
create-preview-db:
runs-on: ubuntu-latest
steps:
- name: Install Neon CLI
run: npm i -g neonctl
- name: Create Anonymized Branch
id: create-branch
run: |
SAFE_BRANCH=$(echo "${GITHUB_HEAD_REF}" | sed 's/\//-/g')
BRANCH_NAME="preview-${SAFE_BRANCH}"
neonctl branches create \
--project-id ${{ secrets.NEON_PROJECT_ID }} \
--name "$BRANCH_NAME" \
--parent main \
--api-key ${{ secrets.NEON_API_KEY }}
CONNECTION_STRING=$(neonctl connection-string "$BRANCH_NAME" \
--project-id ${{ secrets.NEON_PROJECT_ID }} \
--api-key ${{ secrets.NEON_API_KEY }})
echo "database_url=$CONNECTION_STRING" >> "$GITHUB_OUTPUT"
- name: Set Vercel Environment Variable
run: |
vercel env add DATABASE_URL preview \
--token ${{ secrets.VERCEL_TOKEN }} \
<<< "${{ steps.create-branch.outputs.database_url }}"Add a workflow to clean up branches when PRs are merged or closed.
# .github/workflows/cleanup-preview.yml
name: Cleanup Preview DB
on:
pull_request:
types: [closed]
jobs:
cleanup:
runs-on: ubuntu-latest
steps:
- name: Delete Anonymized Branch
run: |
npm i -g neonctl
SAFE_BRANCH=$(echo "${GITHUB_HEAD_REF}" | sed 's/\//-/g')
neonctl branches delete "preview-${SAFE_BRANCH}" \
--project-id ${{ secrets.NEON_PROJECT_ID }} \
--api-key ${{ secrets.NEON_API_KEY }}The biggest advantage of this setup is that each PR has a completely isolated DB. Even if a colleague corrupts data while testing, it doesn't affect my Preview environment, and there are no version conflicts when testing schema migrations. Cost-wise, since Neon's billing model is based on compute time + storage, with copy-on-write only charging for changed data, even with 10 branches running, storage costs stayed at about 1.1x the original.
One-line verdict: If you're using a Neon + PostgreSQL combination, this approach is overwhelmingly convenient. With 30 minutes of setup, you'll have automated per-PR masked DB generation.
Example 2: Building a Pipeline with GitHub Actions + Custom Scripts
If your team doesn't use Neon, you'll need to build the pipeline yourself. Below is a practical example using Python + Faker, with improvements to several issues from the initial draft.
There are three things to watch out for. First, inserting table/column names directly into SQL via f-strings poses SQL injection risks, so use Identifier from the psycopg2.sql module. Second, DB connection credentials must always be read from environment variables. Third, be aware that row-by-row UPDATE can be fatally slow on large datasets (potentially taking hours for 1 million rows).
# scripts/anonymize.py
import os
import subprocess
import psycopg2
from psycopg2 import sql
from faker import Faker
import hashlib
fake = Faker('ko_KR')
MASKING_RULES = {
'users': {
'email': lambda val: fake.email(),
'name': lambda val: fake.name(),
'phone': lambda val: fake.phone_number(),
'address': lambda val: fake.address(),
},
'orders': {
'shipping_address': lambda val: fake.address(),
'recipient_name': lambda val: fake.name(),
},
'payments': {
'card_last_four': lambda val: fake.credit_card_number()[-4:],
},
}
PSEUDONYMIZE_COLUMNS = {
('users', 'external_id'): lambda val: hashlib.sha256(
f"salt-2026-{val}".encode()
).hexdigest()[:16],
}
def anonymize_table(conn, table: str, columns: dict):
"""row-by-row 방식 — 10만 행 이하 테이블에 적합"""
cur = conn.cursor()
table_id = sql.Identifier(table)
for col, transform in columns.items():
col_id = sql.Identifier(col)
query = sql.SQL("SELECT id, {} FROM {}").format(col_id, table_id)
cur.execute(query)
rows = cur.fetchall()
update_query = sql.SQL("UPDATE {} SET {} = %s WHERE id = %s").format(
table_id, col_id
)
for row_id, original_value in rows:
if original_value is None:
continue
masked = transform(original_value)
cur.execute(update_query, (masked, row_id))
conn.commit()
cur.close()
def main():
prod_url = os.environ['PROD_DATABASE_URL']
staging_url = os.environ['STAGING_DATABASE_URL']
subprocess.run([
'pg_dump', '--no-owner', '--no-privileges',
'-Fc', '-f', '/tmp/prod_dump.sql',
prod_url
], check=True)
subprocess.run([
'pg_restore', '--clean', '--if-exists',
'--no-owner', '-d', staging_url,
'/tmp/prod_dump.sql'
], check=True)
conn = psycopg2.connect(staging_url)
for table, columns in MASKING_RULES.items():
print(f"Anonymizing {table}...")
anonymize_table(conn, table, columns)
for (table, col), transform in PSEUDONYMIZE_COLUMNS.items():
print(f"Pseudonymizing {table}.{col}...")
anonymize_table(conn, table, {col: transform})
conn.close()
print("Anonymization complete.")
if __name__ == '__main__':
main()Performance Warning: The above code uses a row-by-row UPDATE approach, which can take hours on a table with 1 million rows. For large tables, batch UPDATE (specifying ranges with
WHERE id BETWEEN ... AND ...), or a COPY-based approach (transforming while streaming the dump file and loading via COPY) is far more efficient.
Faker unique limitation:
fake.unique.email()can raise aUniquenessExceptionwith large datasets. If uniqueness is required, periodically callingfake.unique.clear()or using hash-based pseudonymization to guarantee uniqueness is safer. In the example above, we used plainfake.email()to avoid this issue.
Run this in GitHub Actions.
# .github/workflows/sync-staging-data.yml
name: Sync Masked Production Data
on:
schedule:
- cron: '0 3 * * 1' # Every Monday at 3 AM
workflow_dispatch:
jobs:
sync:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.12'
- name: Install dependencies
run: pip install psycopg2-binary faker
- name: Run anonymization
env:
PROD_DATABASE_URL: ${{ secrets.PROD_READONLY_DATABASE_URL }}
STAGING_DATABASE_URL: ${{ secrets.STAGING_DATABASE_URL }}
run: python scripts/anonymize.pyThis approach is powerful in that you have full control over the masking logic, but honestly, maintenance is no joke. Every time a new table is added or a column name changes, you need to update the script, and if you forget, PII gets passed through as-is. I once forgot to add a new user_profiles table to the masking rules, and it was barely caught during code review. Cost-wise, you'll incur CI runner time (GitHub Actions free tier: 2,000 minutes/month) and staging DB instance costs (~$15/month for RDS db.t3.micro).
One-line verdict: Maximum flexibility, but considering the operational burden and performance limitations, it's best suited for small-to-medium tables.
Improving with postgresql_anonymizer
If you want to reduce the maintenance burden of custom scripts, you can declare masking rules at the SQL DDL level using postgresql_anonymizer. This creates a clean structure where schema and masking rules are version-controlled together.
-- 확장 설치 및 초기화
CREATE EXTENSION IF NOT EXISTS anon CASCADE;
SELECT anon.init();
-- 마스킹 규칙 선언 (마이그레이션 파일에 포함 가능)
SECURITY LABEL FOR anon ON COLUMN users.email
IS 'MASKED WITH FUNCTION anon.fake_email()';
SECURITY LABEL FOR anon ON COLUMN users.name
IS 'MASKED WITH FUNCTION anon.fake_first_name() || '' '' || anon.fake_last_name()';
SECURITY LABEL FOR anon ON COLUMN users.phone
IS 'MASKED WITH FUNCTION anon.partial(phone, 3, $$***$$, 0)';
SECURITY LABEL FOR anon ON COLUMN users.birth_date
IS 'MASKED WITH FUNCTION anon.generalize_daterange(birth_date, ''year'')';
SECURITY LABEL FOR anon ON COLUMN payments.card_number
IS 'MASKED WITH FUNCTION anon.random_string(16)';
-- 익명화된 덤프 생성 (CLI에서 실행)
-- pg_dump_anon --host=prod-host --dbname=myapp > anonymized_dump.sqlThe advantages over a Python script are clear. Since masking rules live inside SQL migration files, schema changes and masking rule updates naturally happen in the same PR. A single pg_dump_anon command produces an anonymized dump directly, eliminating the need to maintain a separate script.
postgresql_anonymizer 2.0 (January 2025): Completely rewritten in Rust (PGRX framework) with significantly improved performance. Masking speed on large tables has noticeably improved compared to previous versions, so if you're still on an older version, it's worth considering an upgrade.
Most Common Mistakes in Practice
After running both approaches side by side for about two months, here are the three most frequent mistakes I experienced or witnessed around me. I believe this section is actually the most practically valuable content in this article.
-
Forgetting to update masking rules when adding new tables/columns — If you don't manage schema migrations and masking rules in the same PR, you'll inevitably miss one at some point. Adding an automated check in CI that detects "PII candidate columns without defined masking rules" can prevent incidents proactively. Using postgresql_anonymizer's DDL approach naturally mitigates this problem.
-
Overlooking indirect identifiers — Everyone masks direct identifiers like names and emails, but indirect identifiers like IP addresses, device fingerprints, and behavioral sequences (logs of specific views in a specific order at specific times) are easy to miss. It's worth remembering that combinations of these alone can identify an individual.
-
Applying only simple random substitution without pseudonymization — If the email in the
userstable is changed tofake1@test.comand the email inaudit_logsbecomesfake2@test.com, functional testing that tracks the same user's activity becomes impossible. It's best to always apply consistent mapping for identical values across multiple tables.
Referential Integrity: This refers to data consistency between tables connected by foreign keys. For example, when
orders.user_idreferencesusers.id, ifusers.idchanges during the masking process, order data becomes orphaned. Neon branches preserve this automatically, but with custom scripts, you need to handle it manually.
Pros and Cons Analysis
Having run both approaches side by side for about two months, here's my summary.
Comprehensive Comparison
| Category | Neon Masking Branches | Custom Anonymization Scripts |
|---|---|---|
| Initial Setup Time | ~30 minutes | Half a day to a full day |
| Operational Burden | Extremely low (one branch creation and done) | High (script updates with every schema change) |
| Data Freshness | Always current (branch directly from production) | Depends on sync frequency (typically daily/weekly) |
| Processing Time for 500K Rows | ~8 seconds | ~12 minutes (row-by-row basis) |
| Per-PR Isolation | Native support | Requires additional infrastructure |
| DB Compatibility | PostgreSQL (Neon) only | Any DB |
| Customization | Within postgresql_anonymizer function scope | Unlimited (domain-specific logic freely) |
| Vendor Lock-in | Dependent on Neon | None |
| Monthly Cost (Small Scale) | Neon Free tier + minimal storage per branch | CI runner time + staging DB (~$15/month) |
Drawbacks and Considerations
| Category | Details | Mitigation |
|---|---|---|
| Neon Vendor Lock-in | Becomes dependent on Neon's service | Define masking rules in standard SQL (SECURITY LABEL) to ensure portability |
| Neon Masking Option Limitations | Limited functionality compared to full postgresql_anonymizer | January 2026 update enabled custom rule definition via API/SQL |
| Custom Script Maintenance | Missing script updates on schema changes → PII leak risk | Add automated checks in CI that verify coverage between column lists and masking rules |
| Custom Script Processing Time | Dump-transform-restore takes tens of minutes to hours on large DBs | Introduce subset extraction, incremental sync, or batch UPDATE |
| Custom Script Foreign Key Breakage | Must manually ensure consistent hash processing | Apply pseudonymization techniques (same input → same output) |
Conclusion
Safely providing production data to Preview environments is essential infrastructure for both compliance and developer productivity. To put it clearly, if you're using a PostgreSQL + Neon combination, starting with masking branches is overwhelmingly efficient, and if you're in a multi-DB environment or using hosting other than Neon, custom scripts are your only option. Both approaches ultimately arrive at the same principle: "manage masking rules as code and execute them automatically in CI/CD."
Three steps you can start right now:
- Identify and catalog PII columns in your current production DB — You can quickly find PII candidate columns with the query below.
SELECT table_name, column_name
FROM information_schema.columns
WHERE column_name ~* '(email|phone|name|address|birth|ssn|ip_addr|device)'
AND table_schema = 'public'
ORDER BY table_name, column_name;-
Build a prototype with a single small table — If you're using Neon, you can apply masking rules to the
userstable withSECURITY LABEL FOR anonand create an anonymized branch. If not using Neon, you can installpostgresql_anonymizerlocally, or start by applying the example code above with Python + Faker to just theuserstable. -
Integrate into CI/CD and add coverage checks — Once the prototype works well, move it into a GitHub Actions workflow and add automated checks that detect missing masking rules when new PII columns are added. This completes a pipeline you can operate with confidence.
Next article: How to run E2E tests on masked Preview databases — Building a test pipeline that reproduces real user scenarios with Playwright + masked DB
References
- Create Environments with Masked Production Data Using Neon Branches | Neon Blog
- Branching With or Without PII: The Future of Environments | Neon Blog
- How to Handle PII in Staging Databases Without Losing Realistic Data | Neon Blog
- Data Anonymization | Neon Docs
- Data Anonymization API Reference | Neon Docs
- The anon Extension | Neon Docs
- A Database for Every Preview Environment Using Neon, GitHub Actions, and Vercel | Neon Blog
- Automate Branching with GitHub Actions | Neon Docs
- PostgreSQL Anonymizer Documentation
- PostgreSQL Anonymizer 2.0: Better, Faster, Safer | PostgreSQL News
- Masking Functions | PostgreSQL Anonymizer
- Data Masking Best Practices: In-Place vs In-Flight | Synthesized
- What is Data Masking? | AWS
- Neon vs Supabase vs Xata: Postgres Branching Compared | Xata Blog