Managing Kubernetes Multi-Cluster Operations with Rancher Fleet — A Pattern for Managing Dozens of Clusters from a Single Git Repo Without Drift
This article is written for DevOps/infrastructure engineers who have hands-on experience operating Kubernetes. It assumes familiarity with Helm, Kustomize, and kubectl.
When you only had one or two clusters, kubectl apply was enough. But as clusters started splitting by region and environment, a quiet anxiety sets in. "What's actually running in eu-west right now? Is the config different from us-east?" I started out looping through shell scripts to deploy, but one day I discovered that slightly different configurations had accumulated across each cluster — and I only found out during an incident. That night's on-call shift ran pretty long.
This article covers GitOps multi-cluster operation patterns based on Rancher Fleet. It explains how the Hub & Spoke architecture works, how to structure your repository, practical fleet.yaml patterns you can apply immediately, and common mistakes — organized at a level where you can take this structure directly to your team.
Whether you're running 5 clusters or 50, if you feel like drift is accumulating in your current environment, this article should help.
Core Concepts
What Is a GitOps Fleet?
GitOps is an approach where a Git commit becomes the deployment intent. Instead of operators connecting directly to clusters and running commands, you declare the desired state in Git, and a controller continuously watches it and reconciles the actual cluster state to match. This loop is called Reconciliation — simply put, it's a continuous cycle of "if Git says so, make the cluster match."
Drift: The gap between the intent declared in Git (Desired State) and the actual cluster state (Actual State). It happens when someone quietly makes a change with
kubectl edit, or when deployments are applied separately to each cluster. The classic symptom is "one cluster is running a different version than the others."
When you have a single cluster, one ArgoCD instance is sufficient. The problem arises when clusters grow to "Fleet" scale. If you manage ArgoCD separately on each cluster, you have to repeat agent upgrades, policy changes, and shared config deployments for every cluster. That's what GitOps Fleet solves, and the most mature implementation is Rancher Fleet.
Hub & Spoke Architecture
Rancher Fleet operates with a Hub & Spoke structure. There is a central management cluster acting as the Hub, and Spoke clusters running the actual workloads are connected around it.
┌─────────────────────┐
│ Management Cluster │
│ (Hub) │
│ │
│ GitRepo Controller │
│ Bundle Scheduler │
└──────────┬──────────┘
│ Watches Git state
┌────────────────┼────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ us-east │ │ eu-west │ │ ap-northeast│
│ (Spoke) │ │ (Spoke) │ │ (Spoke) │
│ │ │ │ │ │
│ Fleet Agent │ │ Fleet Agent │ │ Fleet Agent │
└──────────────┘ └──────────────┘ └──────────────┘A lightweight Fleet Agent runs on each spoke cluster and pulls configuration from the hub to apply it. Because the agent initiates the connection toward the hub, spoke clusters behind private networks do not need to open inbound ports. This is the key difference from ArgoCD's default Push model, where the control plane connects directly to each cluster.
4 Core Components
Grasp these four concepts and the rest of Fleet follows naturally. In practice, the most common source of confusion is the relationship between Bundle and ClusterGroup — think of Bundle as "what to deploy" and ClusterGroup as "where to deploy it."
| Component | Role | Analogy |
|---|---|---|
| GitRepo | Defines the Git repository to watch | "Subscribe to this Git address" |
| Bundle | The actual deployment unit (internally converted to a Helm release) | "A packaged deployment bundle" |
| ClusterGroup | A logical grouping of clusters based on labels | "All clusters tagged with the prod label" |
| Fleet Agent | The agent running on each spoke cluster | "The on-site operator" |
ClusterSelector: Selects clusters using the same mechanism as Kubernetes Label Selectors. Attach labels like
region: us-eastandenv: productionto your clusters, and Fleet automatically selects and deploys to those clusters.
A GitRepo resource looks like this. Apply this single YAML to the hub cluster and Fleet will begin watching the specified repository.
# GitRepo — applied to the hub cluster
apiVersion: fleet.cattle.io/v1alpha1
kind: GitRepo
metadata:
name: fleet-app
namespace: fleet-default
spec:
repo: https://github.com/your-org/gitops-fleet
branch: main
paths:
- clusters/production
targets:
- name: production
clusterSelector:
matchLabels:
env: productionNotable Changes in 2025–2026
Honestly, this space has been moving quite fast over the past year or two.
ArgoCD Agent Architecture: The ArgoCD Agent released by Red Hat as a Technology Preview in Q4 2025 distributes core components to each cluster to overcome the limitations of a single control plane. It's targeting GA with OpenShift GitOps 1.19 in 2026, so it's worth watching if you're in an OpenShift environment.
OCI Registry Adoption: The pattern of packaging YAML manifests as immutable artifacts like container images and pushing them to OCI registries is spreading among large enterprises. Pulling a compressed single artifact instead of cloning an entire Git history improves network efficiency, and signing the artifacts strengthens supply chain security. Flux CD has strong support for this pattern.
Multi-tool Combinations: Instead of "one tool for everything," there's a growing trend of combining tools by role. Combinations like ArgoCD (app deployment) + Rancher Fleet (cluster config management) and Flux (infrastructure) + Sveltos (add-on management) are appearing in production environments.
Practical Application
Example 1: Multi-Region Production — Region-Specific Overlay Configuration
This is the most common scenario you'll encounter. You have clusters in the us-east, eu-west, and ap-northeast regions, you want to share a baseline app configuration, but apply different DB endpoints or resource sizes per region.
Using targetCustomizations in Rancher Fleet's fleet.yaml, you can apply overlays based on cluster labels. One important caveat: helm.valuesFiles inside targetCustomizations works additively on top of the helm.valuesFiles declared at the top level. So if you re-specify values.yaml inside each target, it will be applied twice.
# fleet.yaml — Helm overlay per region
defaultNamespace: production
helm:
chart: ./charts/app
valuesFiles:
- values.yaml # Shared defaults (automatically applied to all clusters)
targetCustomizations:
- name: us-east
clusterSelector:
matchLabels:
region: us-east
helm:
valuesFiles:
- values-us-east.yaml # Overlay added on top of the shared values.yaml
- name: eu-west
clusterSelector:
matchLabels:
region: eu-west
helm:
valuesFiles:
- values-eu-west.yaml
- name: ap-northeast
clusterSelector:
matchLabels:
region: ap-northeast
helm:
valuesFiles:
- values-ap-northeast.yaml| Field | Description |
|---|---|
defaultNamespace |
The default namespace for deployment |
helm.valuesFiles |
Shared values files (applied to all clusters) |
targetCustomizations |
Overlay definitions per label matcher |
clusterSelector.matchLabels |
Conditions for selecting clusters where this config applies |
Changing the shared configuration applies it across all regions at once, while only the necessary parts are overridden in per-region files. This cleanly separates shared config from differences, so when more clusters are added later, it's as simple as adding one more file.
Example 2: Multi-Cluster Multi-Environment Repository Structure
Repository structure design needs to be done right from the start — otherwise managing it becomes unmanageable as clusters grow. I started with a flat directory layout and had to completely rework it once the cluster count went past double digits. I had to recreate all the existing GitRepo resources, and that process was quite painful.
gitops-fleet/
├── clusters/
│ ├── production/
│ │ ├── us-east/
│ │ │ └── fleet.yaml # us-east prod cluster config
│ │ └── eu-west/
│ │ └── fleet.yaml # eu-west prod cluster config
│ └── staging/
│ └── us-east/
│ └── fleet.yaml # staging cluster config
├── apps/
│ ├── frontend/
│ │ ├── base/
│ │ │ ├── deployment.yaml
│ │ │ └── service.yaml
│ │ ├── values.yaml # Shared defaults
│ │ ├── values-us-east.yaml # Region-specific overlay
│ │ ├── values-eu-west.yaml
│ │ └── overlays/
│ │ ├── production/
│ │ │ └── kustomization.yaml
│ │ └── staging/
│ │ └── kustomization.yaml
│ └── backend/
│ ├── base/
│ ├── values.yaml
│ └── overlays/
│ ├── production/
│ └── staging/
└── infra/
├── monitoring/
└── ingress/The clusters/ directory defines "what to deploy to which cluster," while apps/ contains the Kustomize definitions and Helm values for actual applications. The combinations of environment (production/staging) and region (us-east/eu-west) are expressed as file paths. The flow connects clusters/production/us-east/fleet.yaml referencing apps/frontend/values-us-east.yaml.
Example 3: Advanced — AWS EKS + Flux + Crossplane Full-Stack GitOps
This example requires prior knowledge to follow along. If you're new to Flux CD and Crossplane, it's recommended to read each tool's official documentation first before returning here.
When you want to handle the entire process from infrastructure provisioning to app deployment with GitOps, the combination of Flux CD and Crossplane fits well. The flow works like this: Crossplane sees a Cluster resource declared in Git and creates an EKS cluster; Flux automatically detects the kubeconfig for that cluster and starts deploying apps according to Kustomization resources. Adding a cluster spec to Git is all it takes to create a new cluster and deploy apps onto it.
# Crossplane — EKS cluster provisioning
# eks.aws.upbound.io/v1beta1 is an API provided by
# Upbound's AWS Provider (provider-aws-eks), not Crossplane core.
apiVersion: eks.aws.upbound.io/v1beta1
kind: Cluster
metadata:
name: production-us-east
labels:
region: us-east
env: production
spec:
forProvider:
region: us-east-1
version: "1.29"
writeConnectionSecretToRef:
namespace: crossplane-system
name: eks-production-us-east
---
# Flux — automatic app deployment to the new cluster
# Once Crossplane creates the secret above, Flux references it to access the cluster.
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: apps-production
namespace: flux-system
spec:
interval: 5m
path: ./clusters/production/us-east
prune: true
sourceRef:
kind: GitRepository
name: fleet-repo| Stage | Tool | Role |
|---|---|---|
| Infrastructure Provisioning | Crossplane (Upbound AWS Provider) | Creates EKS cluster, configures VPC/node groups |
| GitOps Reconciliation | Flux CD | Synchronizes Git → cluster state |
| App Deployment | Flux Kustomization | Applies app manifests |
| Secret Management | External Secrets Operator | Integrates with AWS Secrets Manager |
Pros and Cons
Advantages
| Item | Details |
|---|---|
| Consistency | All clusters share the same Git state, completely preventing configuration drift |
| Auditability | Git commit history is the change log — track who changed what and when with git log |
| Automated Rollback | Restore previous state with just git revert |
| Large-Scale Expansion | Rancher Fleet can manage thousands of clusters from a single management cluster |
| Automation | Merging a PR triggers deployment — no need for anyone to run kubectl commands manually |
| Declarative Operations | Just declare the desired state and the system converges automatically |
Disadvantages and Caveats
I've been burned by some of these personally, so I've added honest notes on the ones I have direct experience with.
| Item | Details | Mitigation |
|---|---|---|
| Initial Complexity | The architecture can feel heavy for 2–3 clusters | Consider adopting when you have 5+ clusters or expect rapid growth |
| Secret Management | Cannot store plaintext secrets in Git — migrating later after "moving fast" is quite painful | Use SOPS + Age/GPG, HashiCorp Vault, or External Secrets Operator |
| Blast Radius | A bad commit can be deployed to the entire Fleet simultaneously — experiencing this once makes you acutely aware of the need for Progressive Delivery | A staged pipeline through staging group → canary group → production group is essential |
| Learning Curve | Requires learning Hub & Spoke, ClusterSelector, Bundle, and other native concepts | Practice first with a local kind cluster |
| Tool Lock-in | Choosing Rancher Fleet increases dependency on the Rancher ecosystem | If CNCF Graduated tools are preferred, consider Flux CD + Sveltos |
| GitRepo Structure Complexity | Overlay structure design becomes complex as environment × region combinations grow | Invest sufficient time in directory structure design from the start |
Progressive Delivery: A strategy of deploying a new version to a subset of traffic or clusters first, then gradually expanding if no issues are found, rather than deploying to all clusters at once. In GitOps Fleet, this can be implemented by dividing ClusterGroups into stages to control deployment order. Argo Rollouts or Flagger can be used alongside this approach.
The Most Common Mistakes in Practice
1. Committing Secrets to Git in Plaintext
When setting things up initially, the temptation to hardcode secrets with a "just quickly" mindset is real. If you later need to migrate to External Secrets Operator or SOPS, you'll have to rewrite all existing secrets from scratch. Getting familiar with the pattern of declaring ExternalSecret resources from the start is ultimately faster.
2. Deploying to the Entire Fleet at Once Without Progressive Delivery The biggest risk with Fleet is Blast Radius. A single line of code can be deployed simultaneously to hundreds of clusters, so a pipeline that deploys in stages — staging group → canary group → production group — is absolutely necessary. Anyone who has experienced a full fleet-wide failure understands this importance viscerally.
3. Leaving Repository Structure as an Afterthought
Overhauling the directory structure after clusters have grown forces you to recreate all existing GitRepo resources. It's strongly recommended to establish the clusters/ + apps/ + infra/ separation structure when setting up initially.
Closing Thoughts
GitOps Fleet is a structural solution that absorbs the operational burden that scales proportionally with the number of clusters into a single Git workflow. Get secret management and Progressive Delivery right, and you can consistently operate dozens of clusters from a single repository.
Here are 3 steps you can take right now.
Step 1: Experience a local mini Fleet with kind — Create two clusters with kind create cluster --name hub and kind create cluster --name spoke-1, install Rancher Fleet or Flux CD on the hub, then create a simple GitRepo resource and deploy nginx to the spoke. Following the official Quick Start, you can see the Hub & Spoke flow in action with your own eyes within 30 minutes.
Step 2: Design your repository structure first — List your current clusters along with their environment (prod/staging) and region combinations, then sketch out the clusters/ + apps/ + infra/ separation structure on paper. This design determines how much pain you'll face later.
Step 3: Finalize your secret strategy on day one — It's strongly recommended to set up External Secrets Operator + AWS Secrets Manager (or Vault) from the beginning. Once you get comfortable declaring secrets as ExternalSecret resources instead of kubectl create secret, the rest of your GitOps setup will be much cleaner.
References
- Rancher Fleet — Core Concepts — Official documentation covering Fleet's GitRepo, Bundle, and ClusterGroup concepts
- GitHub: rancher/fleet — Source code and example YAML
- AWS Prescriptive Guidance — Rancher Fleet — Fleet setup guide for AWS EKS environments
- AWS Prescriptive Guidance — GitOps Tools Comparison — Comparative analysis of ArgoCD, Flux, and Fleet
- Red Hat — Multi-cluster GitOps with Argo CD Agent (Technology Preview) — Introduction to the ArgoCD Agent architecture (Q4 2025 Technology Preview)
- Platform Engineering — How to Scale GitOps in the Enterprise — Case studies on integrating GitOps Fleet into platform engineering at large organizations
- AWS Blog — Multi-Cluster GitOps using Amazon EKS, Flux, and Crossplane — Detailed implementation of the Flux + Crossplane full-stack combination
- Plural.sh — GitOps for Multiple Clusters: The Ultimate Guide — General overview of multi-cluster GitOps
- ITNEXT — GitOps: Hub and Spoke Agent-Based Architecture with Sveltos — Sveltos agent-based Hub & Spoke implementation case study
- Sveltos Official Documentation — Add-on management tool suited for edge and private network environments