2025-11-27 Contra Kubernetes
It is so hard to argue against Kubernetes. It's popular, you can find people doing basically anything with it, and if someone decries it, it's easy to just say it's a skill issue.
And yet, after years of looking at, learning and using Kubernetes, I still consider it the wrong solution to almost any problem. I constantly go between "maybe it's not that bad" and "it's even worse than I thought". I do not, however, want my argumentation to be about subjective measures. So let's try to look at some facts.
Kubernetes is a good idea. Infrastructures are dynamic systems, where things often fail, and we have to be prepared for such failures. We want them to be handled automatically to the greatest extent possible, and approaching the problem as a cybernetic one makes a lot of sense.
Cybernetic systems are fundamentally both very simple and very generic: they are a graph of state machines, where each state machine has some inputs and an output looping back to itself. This loop is called a control loop, and is used to bring the state machine to the desired state. Yay.
This description can be used for everything from infrastructure to the human mind. It also has a very inconvenient property: for the control loop to work at all, it has to be perfect.
Kubernetes: the good parts
But first, let's briefly discuss what's good about Kubernetes, so that you won't think I'm just a guy who just like to hate on it.
First, it's popular, which is a very important thing for a bunch of reasons:
- If you want to find a job doing infrastructure, it's most likely going to require Kubernetes experience.
- There's a lot of tooling around it.
- Your knowledge somewhat transfers to other Kubernetes deployments.
But that's not the only reason:
- It can help you quickly deploy a lot of software using Helm charts.
- It forces some good practices on both infrastructure engineers and developers.
- It offers a standardized API to manage resources.
- It makes it easier to shift the responsibility left (due to things above).
- It can automatically recover from some incorrect states.
Did I miss something? If so, do tell. I've asked a lot of people, did a lot of my own research, used it in production deployments, and can't think of other reasons to use Kubernetes. Not that these ones are not important. What we're going to consider, however, is the cost of having these.
Execution model
Consider a Pod. You tell Kubernetes that you have a Pod (which is now Pending). It says, okay, I do not see a pod, I'll create one. So it creates a Pod, maybe downloads an image, then tries to run it. What happened to me, for example, was that it couldn't run the image, because the format was sligtly invalid (for k8s, but not e. g. podman). Of course, there was no way to learn that, as there was no error message. Maybe node logs would have helped, but then, what if it happens on a managed cluster?
That's just a bug, you'll say, and sure, it is, but it's a good example of how the abstraction can leak in nasty ways. And this is in addition to imagePullPolicy, readiness checks, liveliness checks etc. Did you know you can run a command inside a container once it starts? It's called a lifecycle hook. Did you know that if it fails, you get no logs whatsoever and the container crashes? Have fun debugging that.
Kubernetes' model requires so much complexity on containers that you unavoidably end up with hacks.
Now, while Pods are complex, they are also the simple case, for two reasons:
- Kubernetes can safely assume that it's the source of truth for all information about the containers
- All relevant information about the containers is managed by Kubernetes
Point 1 ensures that you don't have to care about most external factors, such as people manually creating containers on a node, changing their configuration in place, etc. I'll get back to it.
Point 2 starts being problematic with resources such as PersistentVolumeClaim where you can't just delete and recreate the resource and expect things to be fine. Persistent volumes specifically are one of the major reasons that Kubernetes is difficult to run on-prem, because it effectively requires having distributed storage, and there's no easy way to do that. But more than anything else—it just doesn't fit the cybernetic model all that well. You can't have a control loop that magically brings your data back, and as such, we now have to face difficult questions like "will this resource get recreated if I deploy this".
And now the fun part, operators. Consider this very simple case that you want to manage databases with an operator. So you have a Database object, and a real database backing it. What could possibly go wrong?
- Someone creates a database manually, do we delete it?
- Someone deletes a Database object from kubernetes, do we delete the backing database?
- The backing database is now gone and has to be restored from a backup, do we automatically do it? Do we automatically create a new empty one?
- The whole cluster has been recreated, what do we do with the Database objects that now point to the orphaned backend database. Is it definitely orphaned though?
- Someone changed a configuration parameter for the backing database. What do we do with that?
- You've just recovered a backup, and the backing DBMS has one more database than your set of k8s objects would suggest. What do you do with it?
All of these questions are possible to answer. Every single one is probably easy to answer for a particular scenario. The amount of information you have to keep in your head in order not to shoot yourself in the foot is staggering. But also consider, even if you do remember that, does your operator pick the strategies you want? What if it doesn't?
I regularly see people struggling with trying to sidestep some automation that does the wrong thing. Even more often, people simply don't know what will happen, and testing it requires a lot of effort, because there's no way to tell Kubernetes to confirm its actions.
Somewhat related is the fact that there's no simple way to report unrecoverable errors in Kubernetes. While one might think that there shouldn't be all that many of them, consider for example that some resources might simply not be allowed to deploy due to some policy. Now, if you're deploying with a script, how are you going to deduce that the error isn't recoverable? Helm will simply time out with no feedback. Argo will show you which resource failed, but you have to look at it. And at any rate, it won't really know why.
There are three reasons for all of this mess:
- Kubernetes assumes that all errors are eventually recoverable, which is a pretty natural assumption for a cybernetic system. After all, it implements control loops for all possible states, right?
- Kubernetes does what it does automatically and asynchronously, stripping you both of control and feedback from the process. Again, a natural (and desirable) consequence of a cybernetic approach.
- Kubernetes, for all the flexibility and decoupling it tries to provide, is very rigid in a lot of places.
So in practice people are relying on a ton of workarounds for its limitations. None of this happens with non-cybernetic systems.
Culture
Kubernetes, being so complex, requies a large amount of software designed specifically to run it, from admin interfaces to network planes. This has two interesting consequences.
First, most of this software seeks to make money, and while open source, it's also prone to rugpulls. Remember bitnami? They claim their charts are open source, and they are, but also depend on propertiary, non-trivial docker images. And as such, they stopped providing the docker images for free. Oh well. But that's the easy case.
What if your service mesh did that? Or another component you simply cannot live without. Yes, you can migrate, but the effort to do that is enormous, and possibly includes setting up a new cluster from scratch and migrating to it.
That's one problem.
Another is the abysmal quality of this tooling. I mean, come on, helm is the single worst idea anyone could have come up with, and yet, it's a major reson for people to like Kubernetes. After all, it makes it easy to deploy things! This is, of course, short-sighted. Something being easy do deploy doesn't necessarily make it easy to maintain. But also, saying "helm makes things easy to deploy" isn't all that true either. It is easy—with default settings. But dare to change something that the chart authors didn't think of. Yes, you can use a post-renderer, but it's a hack. Then again, the entirety of Helm is. And don't tell me I can use something else, it doesn't matter. Everyone uses helm. I could simply not use Kubernetes just as well.
And the same story goes for most other software. People who use Kubernetes have a penchant for choosing the absolute worst way of solving a problem. Well, that's on par for something originally built on docker.
To reiterate: I know Kubernetes itself isn't the problem here. Alas, it's also unavoidable. Most of k8s' value comes from the ecosystem, and the ecosystem is what it is.
Operators
So if you want to actually make good use of Kubernetes, you'd better manage every resource using it. Many people, however, would tell you that operators are absolutely not a thing you should use. Why? Because of all the above reasons.
An operator is a piece of software that translates between the state of some external resource and an object Kubernetes understands (that is, an isntance of a custom resource). It needs to be able to automatically react both to all state changes of the k8s resource, and the backing external one. While doing that, it needs to figure out how not to overload either side with API calls, how to make sure not to generate invalid ones, how to maintain security in face of probably having high privileges (in order to e.g. manage a DBMS).
That is to say, writing an operator is a very complex endeavor. You can't really go half the way, the framework you picked will probably get in the way more than it will help, and if you do manage to write a good one, it will be either very rigid, or will export 2137 settings that will make it difficult to configure.
At any rate, your users will not be sure what it does, and when they are, they will likely want it to do something different occasionally.
Even if you solve all that, you've poured a lot of effort into it, for a relatively minor benefit, so now you'd like to get some money for your work. Of course, people are not exactly willing to pay, so you figure out a way to provide it for free for a time, and then pull the rug from under them. A real modern solution to open source funding!
So many people opt for simplicity and avoid operators altogether. But now they have to manage a heterogenous infrastructure, which comes with its own can of worms.
Configuration management
Because see, once you have something external to your cluster, you'll probably have to connect to that somehow, or update it, or depend on it, or something. So now you have to coordinate your changes with the changes in the external world. Yes, you have to do the job of an operator anyway.
But! You probably thoguht GitOps was a good idea, so your whole cluster state is kind of set in stone. What's even worse, you've probably generated the relevant configuration file based on a string template during build, so you'd have to go through the whole deploy pipeline just to change a parameter. So you invent ways for configuration to remain constant, and make the changes in dynamic systems such as DNS. This is not automated configuration management. This is a cope.
See, Kubernetes is terrible at configuration management, and for a good reason. It's a system designed for simple, stateless apps that can be configured via environment variables and all their external connections are to cloud resources. In such a model, there's not much to worry about with regards to configuration management.
In most other contexts, however, the effort put into configuration management, regardless of how you do it, easily dwarfs the importance of orchestration. If you have such a case, and still tend to focus on Kubernetes' upsides, I'd say you're either lying to yourself, or never seen a good infrastructure without Kubernetes.
Performance and debugging
Kubernetes' philosophy can be summed up as "let it crash". This is a deliberate choice, with tradeoffs that people responsible for it surely understand. After all, why debug a failure if you can just restart a container every ten minutes?
See, the problem is that sometimes the problem isn't causing a pod to crash. Instead, it causes a pod to exhibit weird performance characteristics, and dynamically moving it around isn't going to help you track them down and solve them.
In general, dynamic systems are much harder to reason about than static ones, so you need a ton of metrics and other tooling to support you. That's, of course, in addition to distributed storage you probaly have to use. All of this means that a Kubernetes cluster adds a lot of overhead, both in terms of physical resources used, and cognitive load.
And last but not least, consider the primary objective of an orchestrator: to pack workloads tightly and utilize all available resources. Thing is though, it only makes sense if you don't care about the response times of your app. Because if you do, you'll want your CPU to have a good single-thread performance, which means relatively few cores, and in turns, whatever per-node overhead will hurt more. But also, you want to have your CPU to be underutilized to minimize contention. Try that out yourself and see what happens.
What if not Kubernetes?
In order to shut me up, people often retort with a question: what to use if not Kubernetes? Don't get me wrong, it's a valid one, but the premise is wrong in that they expect something that does the same thing, but is not Kubernetes. That would be pointless.
My answer is simple though: for configuration management, use NixOS. Oh, you don't like Nix? Fine, use Debian and Ansible, you'll eventually grow to appreciate Nix. Or not.
For orchestration, use a python script. Yes, write it yourself. Yes, it's fine. No, you do not want a generic solution that tries to cover all possible use cases, you'll end up badly reimplementing half of Kubernetes. Maybe someday I'll write mroe on designing infrastructure management software that doesn't hurt in the long term. For now, a few pointers.
- Use containers if you want, but I'd advise agains using application containers, and specifically anything built using a Dockerfile, unless you wrote it yourself. The quality problems start there.
- Use boring technology, and if something is very Kubernetes-centric, it's probably not worth paying attention to anyway.
- Prioritize good user experience over readily available software.
- Have a vision for your infrastructure. If something doesn't fit your vision, either do not use it, or realign your vision (along with everything else) so that it fits.
- And most of all: Remember so solve the problem you have, not the problem someone maybe will have. You're not planet-scale, and you'll have enough time to adjust to growth.
That being said, if you don't like programming, please stay away from platform engineering. There are plenty of other fields that need you, in this one, you'll only make things worse and won't even realize it.
Summary
TL;DR: Kubernetes is complex and rigid, requiring a lot of hacks to make it work. Not many people would know how to use it well, and even those don't really have a choice other than to use low-quality solutions, because that's what everyone else does. Even if you do manage to do everything right, somehow, you end up with otherwise unnecessary runtime overhead, that is, your apps will run slower.
In most cases, you'll do yourself a favor if you just avoid orchestrators altogether and just roll your own that's well aligned with your needs.
Disagree? Prove me wrong.