The Fleet Visibility Gap: Why Teams With 5+ Clusters Hit a Wall

It's the third time this week someone asked me which cluster our staging API actually runs in.

Not "is staging healthy." Not "what's the error rate." Just: where does it live. The answer turned out to be stg-eu-2, a cluster spun up four months ago for a GDPR workload that quietly became the default for anything EU-adjacent. Nobody wrote that down. The person who knew left the company in September.

Fleet visibility gap hero

One cluster is a debugging problem. Five clusters are an inventory problem, an access problem, and an incident coordination problem. If you're running one cluster, skip this post. If you're running two, bookmark it. If you're somewhere between five and fifty, you already know where this is going.

How you end up with twelve clusters without planning to

Nobody sits down and says "let's run twelve Kubernetes clusters." You start with one. Then you need a staging environment, because prod-shaped accidents are expensive. Two.

Then someone points out that running staging and prod on the same control plane defeats the point, so you split the cloud accounts. Three, if you count dev. Then a customer in Frankfurt needs data residency. Four. Then your biggest customer wants a dedicated tenant for compliance. Five. Then the platform team decides per-team clusters are cleaner than shared namespaces for the new ML workloads. Eight. Then you acquire a company and inherit their GKE footprint. Eleven.

That's a real trajectory. I spoke to a platform lead at a 60-person company last month who has 14 clusters across 3 regions - EKS in us-east-1 and eu-west-1, GKE in asia-southeast1 for latency reasons, plus four "temporary" clusters that have existed for over a year. They have two engineers on the platform team.

The clusters aren't the problem. Kubernetes handles that part fine. The problem is that the tools you used to operate one cluster don't compose.

The single-cluster assumption baked into every tool

Walk through the standard debugging kit. Every single one of these assumes you're looking at one cluster at a time.

kubectl has contexts. That's the official answer. In practice you end up with a ~/.kube/config that's 400 lines long, a handful of KUBECONFIG env vars pointing at different files, and a shell alias wall that looks something like this:

# ~/.zshrc, three months into the job
alias kprod-us="kubectl --context=arn:aws:eks:us-east-1:1234:cluster/prod-us"
alias kprod-eu="kubectl --context=arn:aws:eks:eu-west-1:1234:cluster/prod-eu"
alias kstg-us="kubectl --context=arn:aws:eks:us-east-1:1234:cluster/stg-us"
alias kstg-eu="kubectl --context=arn:aws:eks:eu-west-1:1234:cluster/stg-eu-2"
alias kdev="kubectl --context=kind-dev"
alias kacme="kubectl --context=gke_acme-prod_us-central1_acme"
# ... and six more
 
# the function everyone writes eventually
kall() {
  for ctx in prod-us prod-eu stg-us stg-eu-2; do
    echo "=== $ctx ==="
    kubectl --context="$ctx" "$@"
  done
}

That kall function is a tell. It's what you write the first time you need to check "are any of our clusters seeing the same CrashLoopBackOff?" It works for trivial commands. It falls apart the moment you need to correlate anything, or the moment one context hangs for 30 seconds because the VPN dropped.

k9s is excellent. I use it daily. But it shows one cluster at a time. You switch contexts with :ctx and the whole UI reloads - events, pods, the lot. There's no "show me every failing pod across my fleet." Not k9s's job, and it's honest about that.

Lens and its forks (OpenLens, Freelens) technically let you add multiple kubeconfigs to the sidebar. Each one opens in a separate workspace pane. You can't see them at once in any useful way, and switching between them triggers a full reload of the cluster state. On a machine with six clusters loaded, memory usage gets unpleasant. Lens itself has the mandatory cloud-login baggage that pushed a lot of us to look for an alternative in the first place.

Headlamp supports multiple clusters better than most - it'll show them in the sidebar and you can click between them. But the views are per-cluster. There's no aggregated event feed, no cross-cluster search, no fleet dashboard. It's a competent single-cluster UI that tolerates being pointed at several kubeconfigs.

Per-cluster dashboards (Grafana, Datadog cluster view, the EKS console) work well for what they show, but they're separate dashboards. You end up with a bookmark folder of twelve URLs and a habit of opening them in sequence every morning.

The honest comparison

Tool	Multiple clusters supported	Aggregate view	Persistent cross-cluster history	Who it's for
kubectl	Via context switching	No	No	Everyone, always
k9s	One at a time (fast switch)	No	No	Terminal natives debugging one cluster
Lens / OpenLens	Multiple kubeconfigs loaded	No	No	GUI users on a single laptop
Headlamp	Sidebar-style multi-cluster	Partial	No	Teams who want a browser-based UI
Per-cluster dashboards	Yes, separately	No	Yes, per cluster	Ops teams with dedicated dashboards per env

None of these are bad. They're solving the single-cluster problem well. The fleet problem is a different problem.

The hidden cost of not having fleet visibility

You don't notice the gap all at once. It accumulates.

Longer incidents. The on-call engineer pages in, opens their laptop, and spends the first four minutes figuring out which cluster the alert came from. Was it prod-us or prod-eu-2? The alert says api-gateway but there are three of those across the fleet. By the time they've switched context, opened logs, and cross-referenced the deploy history, the customer impact is already on Twitter.

Missed signal. A ConfigMap change goes out to all production clusters via Argo CD. One of them rejects it because a CRD version is pinned to an older release. Nobody notices for six hours because the failure is buried in one cluster's events feed and nobody was looking at that tab.

Onboarding tax. The new engineer joins. They need to know how to connect to every cluster, which ones are safe to click around in, which ones have stricter RBAC, which ones have the weird Istio setup from 2023. The only documentation is a Notion page that's been out of date since February. They spend their first two weeks building a mental map that the rest of the team carries in their heads.

Context reconstruction. "What did we deploy to prod-eu on Tuesday?" is a question that should take five seconds. Without persistent cross-cluster history, it takes a Slack thread, a Git log, a CI run search, and someone eventually guessing.

These costs don't show up on a dashboard. They show up as fatigue, as longer MTTR, as "we should rebuild our platform" conversations that don't lead anywhere.

What a fleet tool needs to do

If you were designing for the multi-cluster case from the start, a few properties fall out naturally:

Aggregate problems first, drill-down second. The default landing page should show every cluster you operate, with health, recent events, and workload counts. Clicking into a cluster gets you the familiar single-cluster view. Most tools flip that - you pick a cluster, then see its state.

Search across clusters. If you know a resource name, host, label, or package, you should not have to remember which cluster owns it.

Roll up checks. A policy failure in one cluster is a ticket. The same failure across twenty clusters is a platform issue.

Show package drift. Teams need to know which clusters are behind on cert-manager, Cilium, Argo CD, Karpenter, Linkerd, or any other package the tool can detect.

A shared context that isn't a kubeconfig. Kubeconfigs were designed for one user on one machine. For a team, you want shared enrollment - a cluster gets connected once, everyone authorized can see it, access is managed centrally. No more Slack DMs with kubeconfig.yaml attached.

Outbound-only connectivity. Nobody is opening an inbound port on their production cluster for a vendor dashboard. Any sane fleet tool dials out from the cluster, not the reverse.

What Radar Cloud ships today

Radar Cloud's fleet section answers four of those questions today: Problems, Search, Checks, and Packages.

The views fan out to connected clusters over the tunnel and render progressively. For small fleets that is usually sub-second. For large fleets, expect the first load to take a few seconds as clusters answer.

Two things are still evolving beyond that first set of fleet views:

Cross-cluster topology is not shipped. Radar topology is per-cluster today. Kubernetes does not have a native multi-cluster service graph, and pretending otherwise produces pretty diagrams that lie.
Longer Cloud-backed timeline history. Per-cluster timeline lives with the connected Radar instance: memory-backed by default, SQLite-backed if you configure it with a PVC. Cloud-backed retention as a hosted history layer is on the roadmap.

That's the honest line. Cloud closes the practical fleet gap for current health, search, audit checks, and package drift, without making the hosted backend the live source of truth for every cluster object.

The access problem

Fleet visibility isn't only a UI problem.

The team needs one enrollment path and one identity model. Otherwise every new engineer gets a zip file of kubeconfigs and a wiki page nobody trusts.

Radar Cloud handles that through org membership, SSO, roles, and Kubernetes impersonation. Cloud decides who the user is. Kubernetes still decides what they can do in each cluster.

That lets a platform team connect clusters once and manage humans centrally, without replacing Kubernetes RBAC.

The honest state

If what you need is one team URL that tells you where the problems are, where a resource lives, which checks are failing, and which packages are drifting - that is exactly the product.

If you need a perfect service graph across meshes, regions, and DNS conventions, that is still hard. We're not pretending we shipped it.

In the meantime, if this post describes your week, you're not alone. The tooling gap is real. Recognizing it is the first step toward not papering over it with another wall of shell aliases.

The Fleet Visibility Gap: Why Teams With 5+ Clusters Hit a Wall

How you end up with twelve clusters without planning to

The single-cluster assumption baked into every tool

The honest comparison

The hidden cost of not having fleet visibility

What a fleet tool needs to do

What Radar Cloud ships today

The access problem

The honest state

Keep reading

Introducing Radar Cloud: Multi-Cluster Kubernetes Visibility for Teams

Why Radar Cloud Doesn't Cache Your Cluster State

SSO and RBAC Without Passing Around Kubeconfigs

Bring your first cluster online in 60 seconds.