Privacy Policy© 2026 DEV BAK - TECH BLOG. All rights reserved.
DEV BAK - TECH BLOG
DevOps

Managing Kubernetes Multi-Cluster Operations with Rancher Fleet — A Pattern for Managing Dozens of Clusters from a Single Git Repo Without Drift

This article is written for DevOps/infrastructure engineers who have hands-on experience operating Kubernetes. It assumes familiarity with Helm, Kustomize, and kubectl.

When you only had one or two clusters, kubectl apply was enough. But as clusters started splitting by region and environment, a quiet anxiety sets in. "What's actually running in eu-west right now? Is the config different from us-east?" I started out looping through shell scripts to deploy, but one day I discovered that slightly different configurations had accumulated across each cluster — and I only found out during an incident. That night's on-call shift ran pretty long.

This article covers GitOps multi-cluster operation patterns based on Rancher Fleet. It explains how the Hub & Spoke architecture works, how to structure your repository, practical fleet.yaml patterns you can apply immediately, and common mistakes — organized at a level where you can take this structure directly to your team.

Whether you're running 5 clusters or 50, if you feel like drift is accumulating in your current environment, this article should help.


Core Concepts

What Is a GitOps Fleet?

GitOps is an approach where a Git commit becomes the deployment intent. Instead of operators connecting directly to clusters and running commands, you declare the desired state in Git, and a controller continuously watches it and reconciles the actual cluster state to match. This loop is called Reconciliation — simply put, it's a continuous cycle of "if Git says so, make the cluster match."

Drift: The gap between the intent declared in Git (Desired State) and the actual cluster state (Actual State). It happens when someone quietly makes a change with kubectl edit, or when deployments are applied separately to each cluster. The classic symptom is "one cluster is running a different version than the others."

When you have a single cluster, one ArgoCD instance is sufficient. The problem arises when clusters grow to "Fleet" scale. If you manage ArgoCD separately on each cluster, you have to repeat agent upgrades, policy changes, and shared config deployments for every cluster. That's what GitOps Fleet solves, and the most mature implementation is Rancher Fleet.

Hub & Spoke Architecture

Rancher Fleet operates with a Hub & Spoke structure. There is a central management cluster acting as the Hub, and Spoke clusters running the actual workloads are connected around it.

                    ┌─────────────────────┐
                    │   Management Cluster │
                    │        (Hub)        │
                    │                     │
                    │  GitRepo Controller │
                    │  Bundle Scheduler   │
                    └──────────┬──────────┘
                               │ Watches Git state
              ┌────────────────┼────────────────┐
              ▼                ▼                ▼
    ┌──────────────┐  ┌──────────────┐  ┌──────────────┐
    │  us-east     │  │  eu-west     │  │  ap-northeast│
    │  (Spoke)     │  │  (Spoke)     │  │  (Spoke)     │
    │              │  │              │  │              │
    │  Fleet Agent │  │  Fleet Agent │  │  Fleet Agent │
    └──────────────┘  └──────────────┘  └──────────────┘

A lightweight Fleet Agent runs on each spoke cluster and pulls configuration from the hub to apply it. Because the agent initiates the connection toward the hub, spoke clusters behind private networks do not need to open inbound ports. This is the key difference from ArgoCD's default Push model, where the control plane connects directly to each cluster.

4 Core Components

Grasp these four concepts and the rest of Fleet follows naturally. In practice, the most common source of confusion is the relationship between Bundle and ClusterGroup — think of Bundle as "what to deploy" and ClusterGroup as "where to deploy it."

Component Role Analogy
GitRepo Defines the Git repository to watch "Subscribe to this Git address"
Bundle The actual deployment unit (internally converted to a Helm release) "A packaged deployment bundle"
ClusterGroup A logical grouping of clusters based on labels "All clusters tagged with the prod label"
Fleet Agent The agent running on each spoke cluster "The on-site operator"

ClusterSelector: Selects clusters using the same mechanism as Kubernetes Label Selectors. Attach labels like region: us-east and env: production to your clusters, and Fleet automatically selects and deploys to those clusters.

A GitRepo resource looks like this. Apply this single YAML to the hub cluster and Fleet will begin watching the specified repository.

yaml
# GitRepo — applied to the hub cluster
apiVersion: fleet.cattle.io/v1alpha1
kind: GitRepo
metadata:
  name: fleet-app
  namespace: fleet-default
spec:
  repo: https://github.com/your-org/gitops-fleet
  branch: main
  paths:
    - clusters/production
  targets:
    - name: production
      clusterSelector:
        matchLabels:
          env: production

Notable Changes in 2025–2026

Honestly, this space has been moving quite fast over the past year or two.

ArgoCD Agent Architecture: The ArgoCD Agent released by Red Hat as a Technology Preview in Q4 2025 distributes core components to each cluster to overcome the limitations of a single control plane. It's targeting GA with OpenShift GitOps 1.19 in 2026, so it's worth watching if you're in an OpenShift environment.

OCI Registry Adoption: The pattern of packaging YAML manifests as immutable artifacts like container images and pushing them to OCI registries is spreading among large enterprises. Pulling a compressed single artifact instead of cloning an entire Git history improves network efficiency, and signing the artifacts strengthens supply chain security. Flux CD has strong support for this pattern.

Multi-tool Combinations: Instead of "one tool for everything," there's a growing trend of combining tools by role. Combinations like ArgoCD (app deployment) + Rancher Fleet (cluster config management) and Flux (infrastructure) + Sveltos (add-on management) are appearing in production environments.


Practical Application

Example 1: Multi-Region Production — Region-Specific Overlay Configuration

This is the most common scenario you'll encounter. You have clusters in the us-east, eu-west, and ap-northeast regions, you want to share a baseline app configuration, but apply different DB endpoints or resource sizes per region.

Using targetCustomizations in Rancher Fleet's fleet.yaml, you can apply overlays based on cluster labels. One important caveat: helm.valuesFiles inside targetCustomizations works additively on top of the helm.valuesFiles declared at the top level. So if you re-specify values.yaml inside each target, it will be applied twice.

yaml
# fleet.yaml — Helm overlay per region
defaultNamespace: production
helm:
  chart: ./charts/app
  valuesFiles:
    - values.yaml          # Shared defaults (automatically applied to all clusters)
targetCustomizations:
  - name: us-east
    clusterSelector:
      matchLabels:
        region: us-east
    helm:
      valuesFiles:
        - values-us-east.yaml   # Overlay added on top of the shared values.yaml
  - name: eu-west
    clusterSelector:
      matchLabels:
        region: eu-west
    helm:
      valuesFiles:
        - values-eu-west.yaml
  - name: ap-northeast
    clusterSelector:
      matchLabels:
        region: ap-northeast
    helm:
      valuesFiles:
        - values-ap-northeast.yaml
Field Description
defaultNamespace The default namespace for deployment
helm.valuesFiles Shared values files (applied to all clusters)
targetCustomizations Overlay definitions per label matcher
clusterSelector.matchLabels Conditions for selecting clusters where this config applies

Changing the shared configuration applies it across all regions at once, while only the necessary parts are overridden in per-region files. This cleanly separates shared config from differences, so when more clusters are added later, it's as simple as adding one more file.

Example 2: Multi-Cluster Multi-Environment Repository Structure

Repository structure design needs to be done right from the start — otherwise managing it becomes unmanageable as clusters grow. I started with a flat directory layout and had to completely rework it once the cluster count went past double digits. I had to recreate all the existing GitRepo resources, and that process was quite painful.

python
gitops-fleet/
├── clusters/
│   ├── production/
│   │   ├── us-east/
│   │   │   └── fleet.yaml        # us-east prod cluster config
│   │   └── eu-west/
│   │       └── fleet.yaml        # eu-west prod cluster config
│   └── staging/
│       └── us-east/
│           └── fleet.yaml        # staging cluster config
├── apps/
│   ├── frontend/
│   │   ├── base/
│   │   │   ├── deployment.yaml
│   │   │   └── service.yaml
│   │   ├── values.yaml               # Shared defaults
│   │   ├── values-us-east.yaml       # Region-specific overlay
│   │   ├── values-eu-west.yaml
│   │   └── overlays/
│   │       ├── production/
│   │       │   └── kustomization.yaml
│   │       └── staging/
│   │           └── kustomization.yaml
│   └── backend/
│       ├── base/
│       ├── values.yaml
│       └── overlays/
│           ├── production/
│           └── staging/
└── infra/
    ├── monitoring/
    └── ingress/

The clusters/ directory defines "what to deploy to which cluster," while apps/ contains the Kustomize definitions and Helm values for actual applications. The combinations of environment (production/staging) and region (us-east/eu-west) are expressed as file paths. The flow connects clusters/production/us-east/fleet.yaml referencing apps/frontend/values-us-east.yaml.

Example 3: Advanced — AWS EKS + Flux + Crossplane Full-Stack GitOps

This example requires prior knowledge to follow along. If you're new to Flux CD and Crossplane, it's recommended to read each tool's official documentation first before returning here.

When you want to handle the entire process from infrastructure provisioning to app deployment with GitOps, the combination of Flux CD and Crossplane fits well. The flow works like this: Crossplane sees a Cluster resource declared in Git and creates an EKS cluster; Flux automatically detects the kubeconfig for that cluster and starts deploying apps according to Kustomization resources. Adding a cluster spec to Git is all it takes to create a new cluster and deploy apps onto it.

yaml
# Crossplane — EKS cluster provisioning
# eks.aws.upbound.io/v1beta1 is an API provided by
# Upbound's AWS Provider (provider-aws-eks), not Crossplane core.
apiVersion: eks.aws.upbound.io/v1beta1
kind: Cluster
metadata:
  name: production-us-east
  labels:
    region: us-east
    env: production
spec:
  forProvider:
    region: us-east-1
    version: "1.29"
  writeConnectionSecretToRef:
    namespace: crossplane-system
    name: eks-production-us-east
---
# Flux — automatic app deployment to the new cluster
# Once Crossplane creates the secret above, Flux references it to access the cluster.
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps-production
  namespace: flux-system
spec:
  interval: 5m
  path: ./clusters/production/us-east
  prune: true
  sourceRef:
    kind: GitRepository
    name: fleet-repo
Stage Tool Role
Infrastructure Provisioning Crossplane (Upbound AWS Provider) Creates EKS cluster, configures VPC/node groups
GitOps Reconciliation Flux CD Synchronizes Git → cluster state
App Deployment Flux Kustomization Applies app manifests
Secret Management External Secrets Operator Integrates with AWS Secrets Manager

Pros and Cons

Advantages

Item Details
Consistency All clusters share the same Git state, completely preventing configuration drift
Auditability Git commit history is the change log — track who changed what and when with git log
Automated Rollback Restore previous state with just git revert
Large-Scale Expansion Rancher Fleet can manage thousands of clusters from a single management cluster
Automation Merging a PR triggers deployment — no need for anyone to run kubectl commands manually
Declarative Operations Just declare the desired state and the system converges automatically

Disadvantages and Caveats

I've been burned by some of these personally, so I've added honest notes on the ones I have direct experience with.

Item Details Mitigation
Initial Complexity The architecture can feel heavy for 2–3 clusters Consider adopting when you have 5+ clusters or expect rapid growth
Secret Management Cannot store plaintext secrets in Git — migrating later after "moving fast" is quite painful Use SOPS + Age/GPG, HashiCorp Vault, or External Secrets Operator
Blast Radius A bad commit can be deployed to the entire Fleet simultaneously — experiencing this once makes you acutely aware of the need for Progressive Delivery A staged pipeline through staging group → canary group → production group is essential
Learning Curve Requires learning Hub & Spoke, ClusterSelector, Bundle, and other native concepts Practice first with a local kind cluster
Tool Lock-in Choosing Rancher Fleet increases dependency on the Rancher ecosystem If CNCF Graduated tools are preferred, consider Flux CD + Sveltos
GitRepo Structure Complexity Overlay structure design becomes complex as environment × region combinations grow Invest sufficient time in directory structure design from the start

Progressive Delivery: A strategy of deploying a new version to a subset of traffic or clusters first, then gradually expanding if no issues are found, rather than deploying to all clusters at once. In GitOps Fleet, this can be implemented by dividing ClusterGroups into stages to control deployment order. Argo Rollouts or Flagger can be used alongside this approach.


The Most Common Mistakes in Practice

1. Committing Secrets to Git in Plaintext When setting things up initially, the temptation to hardcode secrets with a "just quickly" mindset is real. If you later need to migrate to External Secrets Operator or SOPS, you'll have to rewrite all existing secrets from scratch. Getting familiar with the pattern of declaring ExternalSecret resources from the start is ultimately faster.

2. Deploying to the Entire Fleet at Once Without Progressive Delivery The biggest risk with Fleet is Blast Radius. A single line of code can be deployed simultaneously to hundreds of clusters, so a pipeline that deploys in stages — staging group → canary group → production group — is absolutely necessary. Anyone who has experienced a full fleet-wide failure understands this importance viscerally.

3. Leaving Repository Structure as an Afterthought Overhauling the directory structure after clusters have grown forces you to recreate all existing GitRepo resources. It's strongly recommended to establish the clusters/ + apps/ + infra/ separation structure when setting up initially.


Closing Thoughts

GitOps Fleet is a structural solution that absorbs the operational burden that scales proportionally with the number of clusters into a single Git workflow. Get secret management and Progressive Delivery right, and you can consistently operate dozens of clusters from a single repository.

Here are 3 steps you can take right now.

Step 1: Experience a local mini Fleet with kind — Create two clusters with kind create cluster --name hub and kind create cluster --name spoke-1, install Rancher Fleet or Flux CD on the hub, then create a simple GitRepo resource and deploy nginx to the spoke. Following the official Quick Start, you can see the Hub & Spoke flow in action with your own eyes within 30 minutes.

Step 2: Design your repository structure first — List your current clusters along with their environment (prod/staging) and region combinations, then sketch out the clusters/ + apps/ + infra/ separation structure on paper. This design determines how much pain you'll face later.

Step 3: Finalize your secret strategy on day one — It's strongly recommended to set up External Secrets Operator + AWS Secrets Manager (or Vault) from the beginning. Once you get comfortable declaring secrets as ExternalSecret resources instead of kubectl create secret, the rest of your GitOps setup will be much cleaner.


References

  • Rancher Fleet — Core Concepts — Official documentation covering Fleet's GitRepo, Bundle, and ClusterGroup concepts
  • GitHub: rancher/fleet — Source code and example YAML
  • AWS Prescriptive Guidance — Rancher Fleet — Fleet setup guide for AWS EKS environments
  • AWS Prescriptive Guidance — GitOps Tools Comparison — Comparative analysis of ArgoCD, Flux, and Fleet
  • Red Hat — Multi-cluster GitOps with Argo CD Agent (Technology Preview) — Introduction to the ArgoCD Agent architecture (Q4 2025 Technology Preview)
  • Platform Engineering — How to Scale GitOps in the Enterprise — Case studies on integrating GitOps Fleet into platform engineering at large organizations
  • AWS Blog — Multi-Cluster GitOps using Amazon EKS, Flux, and Crossplane — Detailed implementation of the Flux + Crossplane full-stack combination
  • Plural.sh — GitOps for Multiple Clusters: The Ultimate Guide — General overview of multi-cluster GitOps
  • ITNEXT — GitOps: Hub and Spoke Agent-Based Architecture with Sveltos — Sveltos agent-based Hub & Spoke implementation case study
  • Sveltos Official Documentation — Add-on management tool suited for edge and private network environments
#RancherFleet#GitOps#Kubernetes#멀티클러스터#HubAndSpoke#FluxCD#Crossplane#Helm#Kustomize#ProgressiveDelivery
Share

Table of Contents

Core ConceptsWhat Is a GitOps Fleet?Hub & Spoke Architecture4 Core ComponentsNotable Changes in 2025–2026Practical ApplicationExample 1: Multi-Region Production — Region-Specific Overlay ConfigurationExample 2: Multi-Cluster Multi-Environment Repository StructureExample 3: Advanced — AWS EKS + Flux + Crossplane Full-Stack GitOpsPros and ConsAdvantagesDisadvantages and CaveatsThe Most Common Mistakes in PracticeClosing ThoughtsReferences

Recommended Posts

Canary Deployments Across 500 Kubernetes Clusters Using Rancher Fleet and Argo Rollouts — Progressive Delivery That Limits Blast Radius by Partition
DevOps

Canary Deployments Across 500 Kubernetes Clusters Using Rancher Fleet and Argo Rollouts — Progressive Delivery That Limits Blast Radius by Partition

Honestly, even I felt pretty overwhelmed the first time I had to manage dozens of clusters simultaneously. Running a single canary deployment on one cluster isn...

May 26, 202624 min read
Argo Rollouts Automated Rollback Pipeline | Datadog · CloudWatch Multi-Provider AnalysisTemplate Progressive Threshold Hardening Strategy
DevOps

Argo Rollouts Automated Rollback Pipeline | Datadog · CloudWatch Multi-Provider AnalysisTemplate Progressive Threshold Hardening Strategy

There was a time when I'd wait in Slack during every deployment and manually type rollback commands whenever error rates spiked. I thought introducing canary de...

May 26, 202620 min read
Automating Kubernetes Canary Deployments with a Single PR Merge: An ArgoCD + Argo Rollouts Pipeline
DevOps

Automating Kubernetes Canary Deployments with a Single PR Merge: An ArgoCD + Argo Rollouts Pipeline

Honestly, when I first introduced canary deployments, I was running deployment scripts by hand. I'd type in the terminal, post "canary is now at 5%" in Slack, ...

May 26, 202623 min read
Eliminating Vercel CDN Bill Shock: Building Predictable Infrastructure Costs with Flat Rate CDN and FinOps (2026)
DevOps

Eliminating Vercel CDN Bill Shock: Building Predictable Infrastructure Costs with Flat Rate CDN and FinOps (2026)

If you've used Vercel for any length of time, you've probably had this experience at least once: you open your end-of-month invoice and see a number far larger ...

May 25, 202624 min read
Canary Deployment with Istio + Argo Rollouts: From Pod-Level Metric Isolation to Header-Based Test Routing
DevOps

Canary Deployment with Istio + Argo Rollouts: From Pod-Level Metric Isolation to Header-Based Test Routing

When I first introduced canary deployments, I made a similar mistake. I thought splitting replica ratios with a basic Kubernetes Deployment and calling it a "10...

May 25, 202623 min read
Burn Rate SLO-Based Canary Auto-Rollback with Kubernetes Argo Rollouts AnalysisTemplate and Datadog
DevOps

Burn Rate SLO-Based Canary Auto-Rollback with Kubernetes Argo Rollouts AnalysisTemplate and Datadog

Have you ever been jolted awake at 3 AM by a PagerDuty alert? I have. More than once. Every time, I'd dig through logs and eventually land on the same thought: ...

May 25, 202624 min read