MLOps Model Deployment Automation: Building a CI/CD/CT Pipeline with GitHub Actions + Kubeflow
Honestly, I used to think, "How hard can model deployment be? Just spin up an API server, right?" But when you actually do it, a completely different world unfolds. When the data pipeline changes, the model has to be retrained. A model that worked fine yesterday suddenly starts producing strange predictions today. Something that worked on server A inexplicably fails on server B — one mystery after another.
By the end of this post, you'll be able to: ① understand the 5-stage MLOps pipeline structure, ② work with real-world GitHub Actions YAML code, and ③ integrate drift monitoring with Evidently AI — all ready to follow along immediately. MLOps is the methodology that brings order to this chaos. It ties together the development (Dev) and operations (Ops) of machine learning models into a single system, so that a model from the experimental stage flows reliably and repeatably all the way to production. Whether you're a backend developer, a data engineer, or someone just getting started with ML, this post is packed with content you can apply directly in practice.
Core Concepts
MLOps Maturity: Where Is Your Team Right Now?
The MLOps maturity model defined by Google Cloud is divided into three levels. This framework works well as a self-diagnostic tool for figuring out "where our team currently stands."
| Level | Name | Characteristics |
|---|---|---|
| Level 0 | Manual Process | Experiments in notebooks, manual deployment. No reproducibility |
| Level 1 | ML Pipeline Automation | Training pipeline automated, continuous training (CT) possible |
| Level 2 | CI/CD Pipeline Automation | Code changes trigger automatic pipeline build and deployment |
Most teams start at Level 0. The moment you think "we need to automate this" is the inflection point into Level 1, and Level 2 — automating the pipeline that builds pipelines — is a stage you see only in fairly mature ML organizations.
The Five Stages of an End-to-End Pipeline
A model deployment automation pipeline flows through roughly five stages.
[1. Data Collection & Preprocessing]
↓
[2. Model Training & Experiment Tracking] ← MLflow / W&B
↓
[3. Model Validation & Registration] ← Quality gate (AUC ≥ 0.85, etc.)
↓
[4. Deployment (Serving)] ← REST/gRPC API
↓
[5. Monitoring & Feedback] ← Drift detection → automatic retraining triggerThe most frequently overlooked parts of this flow are step 3, the quality gate, and step 5, monitoring. In particular, if you automate deployment without monitoring, you end up silently running a model whose performance is quietly degrading — and nobody knows. This is a situation you encounter often in practice, and the later the discovery, the greater the damage.
Data Drift vs. Concept Drift: Confusing Them Leads to the Wrong Response
Data Drift: The statistical distribution of the input data itself changes. Example: consumer patterns that shifted completely after COVID-19.
Concept Drift: The relationship between inputs and outputs (the pattern the model learned) changes. Example: fraud patterns evolve to the point where the existing model can no longer catch them.
The two types of drift have different causes and require different responses. Data Drift calls for inspecting the data pipeline first; Concept Drift is fundamentally about retraining. And one of the main causes of both is Training-Serving Skew, explained below. These concepts may seem unrelated, but in real-world incidents they tend to cascade and compound.
Training-Serving Skew: The Most Dangerous Trap in Practice
Training-Serving Skew: A mismatch between the data processing logic used during model training and the preprocessing logic applied at serving time. A significant share of cases where the model seems fine but predictions come out wrong originates here.
I missed this once and lost two days because of it. In the training code, missing values were filled with the mean; in the serving code, they were being filled with 0. Managing feature engineering code as a single source of truth is the most reliable way to prevent this problem.
Pros and Cons
Advantages
| Item | Details |
|---|---|
| Deployment Speed | 70%+ faster model deployment compared to manual processes (based on industry cases) |
| Reproducibility | Results can be reproduced at any time using the same pipeline |
| Collaboration | Clear role definition between data scientists ↔ infrastructure engineers |
| Stability | Safe, incremental rollouts via canary deployment and A/B testing |
| Experiment Tracking | 60% reduction in experiment tracking overhead when adopting MLflow (based on industry cases) |
| Business Impact | Financial industry case: time-to-production for models reduced from 6 months → 2 weeks |
Disadvantages and Caveats
| Item | Details | Mitigation |
|---|---|---|
| Initial Build Cost | Platform setup can take months to 2 years | Start with managed services (SageMaker, Vertex AI) |
| Organizational Culture Change | ML team and infra team collaboration structure needs restructuring | Designate an MLOps champion role; adopt incrementally |
| Lack of Standardization | 55% of companies cite lack of standardized practices as a major obstacle (ScienceDirect, 2025) | Start by documenting team conventions |
| Training-Serving Skew | Prediction errors caused by mismatched training and inference environments | Manage feature engineering code from a single source |
| Edge Deployment Complexity | Limited compute resources, unstable network environments | Lightweight models with TFLite or ONNX Runtime |
The Most Common Mistakes in Practice
-
Implementing deployment automation without monitoring — Deployment is automated, but nobody notices when performance degrades. Drift detection and alerting should always be built alongside deployment automation. I once set up deployment automation first and later discovered a degraded model had been running for two weeks before anyone caught it.
-
Thinking about rollback strategy after deployment — By the time something goes wrong and you're looking for a rollback plan, it's already too late. It's strongly recommended to decide on the right strategy for your team — blue-green, canary, or shadow deployment — during the deployment design phase, before anything is shipped. Anyone who has experienced a same-day rollback knows exactly how painful this lesson is.
-
Not tracking data lineage — If you can't answer "what data was this model trained on?", auditing and regulatory compliance become extremely difficult. Using DVC or MLflow's dataset logging to keep records from the very beginning will save you a lot of pain down the road.
Practical Application
The three examples below are not independent code snippets — they represent a single flow. The data-validation step in Example 1 calls the drift_monitor.py in Example 3; once model quality passes the gate, the model is registered in the MLflow registry from Example 2, and Seldon adjusts traffic accordingly. Keeping the full picture in mind as you read will make things much easier to understand.
Example 1: CI/CD/CT Pipeline with GitHub Actions + Kubeflow
This is the most commonly used pattern. A code push serves as the trigger, and everything from training to deployment flows automatically.
# .github/workflows/mlops-pipeline.yml
name: MLOps CI/CD/CT Pipeline
on:
push:
branches: [main]
paths:
- 'src/models/**'
- 'src/data/**'
schedule:
- cron: '0 2 * * 1' # 매주 월요일 새벽 2시 정기 재학습
jobs:
data-validation:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install dependencies
run: pip install -r requirements.txt
- name: Validate data schema and drift
# drift_monitor.py가 예시 3에서 구현한 Evidently AI 기반 스크립트입니다
run: |
python src/monitoring/drift_monitor.py \
--reference data/reference_dataset.parquet \
--current data/current_dataset.parquet \
--threshold 0.05
train-and-evaluate:
needs: data-validation
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install dependencies
run: pip install -r requirements.txt
- name: Train model
run: python src/train.py --experiment-name ${{ github.sha }}
- name: Evaluate quality gate
id: eval
run: |
AUC=$(python src/evaluate.py --model-path outputs/model.pkl)
echo "auc=$AUC" >> $GITHUB_OUTPUT
# bc -l은 Linux 전용이므로, 이식성을 위해 Python으로 비교합니다
python -c "import sys; sys.exit(0 if float('$AUC') >= 0.85 else 1)" || \
(echo "Quality gate failed: AUC=$AUC" && exit 1)
canary-deploy:
needs: train-and-evaluate
runs-on: ubuntu-latest
steps:
- name: Deploy canary (10% traffic)
run: |
kubectl apply -f k8s/seldon-canary-deployment.yaml
# 30분짜리 모니터링을 Actions 잡 내에서 블로킹으로 돌리면
# 비용·타임아웃 문제가 생깁니다. 실무에서는 Argo Rollouts나
# 별도 스케줄러에 위임하거나, 아래처럼 비동기 모니터링을 트리거합니다.
python scripts/trigger_canary_monitor.py --run-id ${{ github.run_id }}| Stage | Role | Behavior on Failure |
|---|---|---|
data-validation |
Detect data drift | Halt the entire pipeline |
train-and-evaluate |
Training + AUC quality gate | Block deployment and send alert |
canary-deploy |
Canary deployment at 10% traffic | Trigger automatic rollback |
Canary Deployment: A deployment strategy that exposes a new version to only a portion of total traffic (e.g., 10%) first, verifies stability, then gradually increases the percentage. Because the blast radius is limited if something goes wrong, it is especially useful for ML model deployments.
Blue-Green Deployment: A strategy that keeps both the old version (Blue) and the new version (Green) environments running simultaneously, then switches traffic over all at once. Rollbacks are nearly instant, but infrastructure costs are doubled.
Example 2: Connecting Model Registry & Serving with MLflow + Seldon Core
This is the key link connecting experiment tracking to deployment. It's the pattern where Seldon Core automatically serves models registered in MLflow. The moment I first saw Seldon automatically adjust traffic right after transitioning an MLflow stage to Production — that was genuinely exciting.
# src/train.py — MLflow 실험 추적 및 모델 등록
# 데이터 로딩 구현은 src/data/loader.py 참조
import mlflow
import mlflow.sklearn
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score
def train_and_register(X_train, y_train, X_val, y_val, params: dict):
with mlflow.start_run() as run:
mlflow.log_params(params)
model = GradientBoostingClassifier(**params)
model.fit(X_train, y_train)
auc = roc_auc_score(y_val, model.predict_proba(X_val)[:, 1])
mlflow.log_metric("auc", auc)
if auc >= 0.85:
mlflow.sklearn.log_model(
model,
artifact_path="model",
registered_model_name="fraud-detector"
)
client = mlflow.tracking.MlflowClient()
# run_id 기반으로 조회해야 동시 실험 시 레이스 컨디션을 방지할 수 있습니다
# get_latest_versions(stages=["None"]) 패턴은 동시 실행 환경에서 위험합니다
versions = client.search_model_versions(
f"run_id='{run.info.run_id}'"
)
client.transition_model_version_stage(
name="fraud-detector",
version=versions[0].version,
stage="Production"
)
print(f"[OK] 모델 등록 완료 -- AUC: {auc:.4f}")
else:
print(f"[FAIL] 품질 게이트 실패 -- AUC: {auc:.4f} (기준: 0.85)")
return auc# k8s/seldon-deployment.yaml — MLflow 모델 레지스트리와 연동
# kubectl apply -f k8s/seldon-deployment.yaml 로 클러스터에 적용합니다
# 참고: Seldon Core는 2024년 이후 유지보수가 축소되는 추세입니다.
# 현재는 KServe가 커뮤니티에서 더 활발히 유지되고 있으니 신규 도입 시 참고하세요.
apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
name: fraud-detector
spec:
predictors:
- name: default
traffic: 90
graph:
name: classifier
implementation: MLFLOW_SERVER
modelUri: "models:/fraud-detector/Production"
- name: canary
traffic: 10
graph:
name: classifier-canary
implementation: MLFLOW_SERVER
modelUri: "models:/fraud-detector/Staging"The key to this setup is that the model registry stages (Staging / Production) are directly linked to the canary traffic split. When you promote a stage in MLflow, Seldon automatically adjusts the traffic. SeldonDeployment is a Kubernetes CRD (Custom Resource Definition); you apply the YAML above to a cluster using kubectl apply -f.
Example 3: Automated Data Drift Detection with Evidently AI
The data-validation step in Example 1 calls this script. When drift is detected, it can be wired to trigger a Kubeflow pipeline retraining run or send a notification via a Slack webhook.
# src/monitoring/drift_monitor.py
import requests
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, ModelPerformancePreset
import pandas as pd
def trigger_retraining(reason: str):
"""Kubeflow 파이프라인 API 또는 Slack 웹훅으로 재학습을 트리거합니다."""
webhook_url = os.environ.get("SLACK_WEBHOOK_URL")
if webhook_url:
requests.post(webhook_url, json={"text": f"[MLOps] 재학습 트리거: {reason}"})
# Kubeflow 파이프라인 연동 예시:
# kfp.Client().create_run_from_pipeline_func(training_pipeline, arguments={})
def run_drift_report(reference_path: str, current_path: str) -> dict:
reference = pd.read_parquet(reference_path)
current = pd.read_parquet(current_path)
report = Report(metrics=[
DataDriftPreset(drift_share_threshold=0.3), # 30% 이상 피처 드리프트 시 경고
ModelPerformancePreset(),
])
report.run(reference_data=reference, current_data=current)
result = report.as_dict()
drift_detected = result["metrics"][0]["result"]["dataset_drift"]
if drift_detected:
trigger_retraining(reason="data_drift_detected")
report.save_html("reports/drift_report.html")
return {"drift_detected": drift_detected, "report": "reports/drift_report.html"}Closing Thoughts
The essence of an MLOps pipeline is "operating good models quickly, safely, and repeatably." Rather than trying to build the perfect system from day one, it's better to start by automating the single most painful point for your team right now.
Here are three steps you can start on immediately. These steps are not independent — each one naturally creates the need for the next.
-
Start experiment tracking with MLflow — After
pip install mlflow, just addmlflow.start_run()andmlflow.log_metric()to your existing training code. Runmlflow uito bring up the dashboard, and your experiment history will start becoming visible at a glance. As you use it, you'll naturally start thinking "we need a clear criterion for which model to deploy" — that's the signal to move to step 2. -
Connect a quality gate with GitHub Actions — Write a workflow in
.github/workflows/model-eval.ymlthat automatically evaluates the model when a PR is opened and blocks the merge if metrics like AUC fall below the threshold. This alone will prevent a significant number of "bad model slips into production" incidents. Once the quality gate is in place, the next natural question becomes "how do we catch performance degradation after deployment?" — which leads right into step 3. -
Add production monitoring with Evidently AI — After
pip install evidently, run a drift report comparing one week of serving data against training data as a weekly batch job. Even starting with just a weekly report will let you catch data quality issues much faster than before.
I should also be honest about what this post didn't cover. Feature Store integration, model explainability (SHAP, LIME), on-premises environment setup, and LLMOps — managing LLM fine-tuning and RAG pipelines — are each topics substantial enough to warrant their own dedicated post. I hope this article serves as a starting point for getting your bearings in the broader MLOps landscape: figuring out where you are right now, and where to head next.
Next post: A practical LLMOps guide to managing LLM fine-tuning and RAG pipelines in production — from prompt version control to building guardrails
References
- MLOps: Continuous delivery and automation pipelines in machine learning | Google Cloud
- MLOps in 2026: Best Practices for Scalable ML Deployment | Kernshell
- MLOps in 2026: What You Need to Know to Stay Competitive | Hatchworks
- The Evolution of MLOps: Rise of Automation | Pragmatic AI Labs
- MLOps Best Practices 2025: CI/CD & Model Monitoring | TensorBlue
- Samsung Tech Blog — Kubeflow와 MLflow를 활용한 AI 개발 시스템 구축
- Samsung Tech Blog — MLOps를 통한 AI 모델 개발 및 배포 효율화
- Kubeflow MLOps: Automatic pipeline deployment with CI/CD/CT | Towards Data Science
- MLOps Pipeline with MLflow, Seldon Core and Kubeflow | Ubuntu
- What is MLOps? Benefits, Challenges & Best Practices | LakeFS
- MLOps best practices, challenges and maturity models | ScienceDirect (2025)
- MLOps Principles | ml-ops.org
- What is MLOps? | AWS