All posts
Engineering·November 1, 2025· 7 min read

The Fleet Visibility Gap: Why Teams With 5+ Clusters Hit a Wall

Every tool that worked at 2 clusters breaks at 8. kubectl, Lens, k9s, Headlamp are all single-cluster-at-a-time. Here's where the wall is and what it looks like.

Nadav Erell
CEO, Skyhook
The Fleet Visibility Gap: Why Teams With 5+ Clusters Hit a Wall

It's the third time this week someone asked me which cluster our staging API actually runs in.

Not "is staging healthy." Not "what's the error rate." Just: where does it live. The answer turned out to be stg-eu-2, a cluster spun up four months ago for a GDPR workload that quietly became the default for anything EU-adjacent. Nobody wrote that down. The person who knew left the company in September.

Fleet visibility gap hero

If you're running one cluster, skip this post. If you're running two, bookmark it. If you're somewhere between five and fifty, you already know where this is going.

How you end up with twelve clusters without planning to

Nobody sits down and says "let's run twelve Kubernetes clusters." You start with one. Then you need a staging environment, because prod-shaped accidents are expensive. Two.

Then someone points out that running staging and prod on the same control plane defeats the point, so you split the cloud accounts. Three, if you count dev. Then a customer in Frankfurt needs data residency. Four. Then your biggest customer wants a dedicated tenant for compliance. Five. Then the platform team decides per-team clusters are cleaner than shared namespaces for the new ML workloads. Eight. Then you acquire a company and inherit their GKE footprint. Eleven.

That's a real trajectory. I spoke to a platform lead at a 60-person company last month who has 14 clusters across 3 regions - EKS in us-east-1 and eu-west-1, GKE in asia-southeast1 for latency reasons, plus four "temporary" clusters that have existed for over a year. They have two engineers on the platform team.

The clusters aren't the problem. Kubernetes handles that part fine. The problem is that the tools you used to operate one cluster don't compose.

The single-cluster assumption baked into every tool

Walk through the standard debugging kit. Every single one of these assumes you're looking at one cluster at a time.

kubectl has contexts. That's the official answer. In practice you end up with a ~/.kube/config that's 400 lines long, a handful of KUBECONFIG env vars pointing at different files, and a shell alias wall that looks something like this:

# ~/.zshrc, three months into the job
alias kprod-us="kubectl --context=arn:aws:eks:us-east-1:1234:cluster/prod-us"
alias kprod-eu="kubectl --context=arn:aws:eks:eu-west-1:1234:cluster/prod-eu"
alias kstg-us="kubectl --context=arn:aws:eks:us-east-1:1234:cluster/stg-us"
alias kstg-eu="kubectl --context=arn:aws:eks:eu-west-1:1234:cluster/stg-eu-2"
alias kdev="kubectl --context=kind-dev"
alias kacme="kubectl --context=gke_acme-prod_us-central1_acme"
# ... and six more
 
# the function everyone writes eventually
kall() {
  for ctx in prod-us prod-eu stg-us stg-eu-2; do
    echo "=== $ctx ==="
    kubectl --context="$ctx" "$@"
  done
}

That kall function is a tell. It's what you write the first time you need to check "are any of our clusters seeing the same CrashLoopBackOff?" It works for trivial commands. It falls apart the moment you need to correlate anything, or the moment one context hangs for 30 seconds because the VPN dropped.

kubeconfig sprawl

k9s is excellent. I use it daily. But it shows one cluster at a time. You switch contexts with :ctx and the whole UI reloads - events, pods, the lot. There's no "show me every failing pod across my fleet." Not k9s's job, and it's honest about that.

Lens and its forks (OpenLens, Freelens) technically let you add multiple kubeconfigs to the sidebar. Each one opens in a separate workspace pane. You can't see them at once in any useful way, and switching between them triggers a full reload of the cluster state. On a machine with six clusters loaded, memory usage gets unpleasant. Lens itself has the Mirantis-cloud-login baggage we wrote about when we introduced Radar.

Headlamp supports multiple clusters better than most - it'll show them in the sidebar and you can click between them. But the views are per-cluster. There's no aggregated event feed, no cross-cluster search, no fleet dashboard. It's a competent single-cluster UI that tolerates being pointed at several kubeconfigs.

Per-cluster dashboards (Grafana, DataDog cluster view, the EKS console) work well for what they show, but they're separate dashboards. You end up with a bookmark folder of twelve URLs and a habit of opening them in sequence every morning.

The honest comparison

ToolMultiple clusters supportedAggregate viewPersistent cross-cluster historyWho it's for
kubectlVia context switchingNoNoEveryone, always
k9sOne at a time (fast switch)NoNoTerminal natives debugging one cluster
Lens / OpenLensMultiple kubeconfigs loadedNoNoGUI users on a single laptop
HeadlampSidebar-style multi-clusterPartialNoTeams who want a browser-based UI
Per-cluster dashboardsYes, separatelyNoYes, per clusterOps teams with dedicated dashboards per env

None of these are bad. They're solving the single-cluster problem well. The fleet problem is a different problem.

The hidden cost of not having fleet visibility

You don't notice the gap all at once. It accumulates.

Longer incidents. The on-call engineer pages in, opens their laptop, and spends the first four minutes figuring out which cluster the alert came from. Was it prod-us or prod-eu-2? The alert says api-gateway but there are three of those across the fleet. By the time they've switched context, opened logs, and cross-referenced the deploy history, the customer impact is already on Twitter.

Missed signal. A ConfigMap change goes out to all production clusters via Argo CD. One of them rejects it because a CRD version is pinned to an older release. Nobody notices for six hours because the failure is buried in one cluster's events feed and nobody was looking at that tab.

Onboarding tax. The new engineer joins. They need to know how to connect to every cluster, which ones are safe to click around in, which ones have stricter RBAC, which ones have the weird Istio setup from 2023. The only documentation is a Notion page that's been out of date since February. They spend their first two weeks building a mental map that the rest of the team carries in their heads.

Context reconstruction. "What did we deploy to prod-eu on Tuesday?" is a question that should take five seconds. Without persistent cross-cluster history, it takes a Slack thread, a Git log, a CI run search, and someone eventually guessing.

These costs don't show up on a dashboard. They show up as fatigue, as longer MTTR, as "we should rebuild our platform" conversations that don't lead anywhere.

What a fleet tool needs to do

If you were designing for the multi-cluster case from the start, a few properties fall out naturally:

Aggregate view first, drill-down second. The default landing page should show every cluster you operate, with health, recent events, and workload counts. Clicking into a cluster gets you the familiar single-cluster view. Right now most tools flip that - you pick a cluster, then see its state.

Persistent timeline across restarts. Kubernetes events expire after an hour by default. If your tool only shows current events, you've lost the ability to answer "what happened overnight." Cross-cluster history needs to survive process restarts and be queryable by time range.

A shared context that isn't a kubeconfig. Kubeconfigs were designed for one user on one machine. For a team, you want shared enrollment - a cluster gets connected once, everyone authorized can see it, access is managed centrally. No more Slack DMs with kubeconfig.yaml attached.

Access control beyond kubeconfigs. RBAC at the cluster level is fine for services. Humans need something else: per-cluster and per-namespace scoping, roles that distinguish "can view" from "can exec into pods," and an audit trail of who did what.

Notifications that know about more than one cluster. A Slack message that says "pod X crashlooped in prod-eu-2" is useful. A Slack message that says "the same image hash just started crashlooping in three clusters, here's the correlation" is actually actionable.

Outbound-only connectivity. Nobody is opening an inbound port on their production cluster for a vendor dashboard. Any sane fleet tool has the agent dial out, not the reverse.

What we're working on

We've been using Radar internally at Skyhook for a while - a local, single-binary Kubernetes visibility tool. It's solved the single-cluster UX problem well enough that we're releasing it as open source in January 2026. You point it at your kubeconfig, it opens a dashboard in your browser, done. No account, no cloud.

Radar single-cluster view

A hosted version is next. Same UI, but the agent runs in each cluster and ships state to a shared backend, so the fleet view works the way it should - one page, every cluster, persistent history, shared access for the whole team. We're not ready to talk about that in detail yet. More when it's built.

In the meantime, if this post describes your week, you're not alone. The tooling gap is real. Recognizing it is the first step toward not papering over it with another wall of shell aliases.

kubernetesmulti-clusterobservabilityplatform-engineeringfor-devops

Bring your first cluster online in 60 seconds.

Install the Helm chart, paste a token, see your cluster. No credit card required.

Apache 2.0 OSS · Unlimited clusters self-hosted · Hosted free tier for up to 3 clusters