Over the last decade, the industry has widely adopted the cloud as the rightful home for runtime environments. In some cases, it has completely replaced on-premises or edge compute environments. However, in the midst of the AI boom, we’re witnessing a resurgence of on-prem computing environments. Many companies want to leverage all that AI has to offer but must maintain the confidentiality of their own or their customers’ data, so naturally they are gravitating towards running their agents in self-hosted environments (typically Kubernetes).
What many people don’t foresee is how many hurdles on-prem and edge environments introduce. While working at Liatrio, I’ve encountered and conquered many of these hurdles. This short read offers insight into a specific example and broader themes in these computing environments.
Who’s This For?
If you’re thinking about deploying to bare-metal Kubernetes with a GitOps tool like Argo CD or Flux CD, you’re in the right place. And even if you’re running bare metal without Kubernetes or GitOps, you’ll still get something out of this article. Now let’s jump into the problem and how we solved it.
The Problem
It’s easy to forget and take for granted how much the cloud provides for free. For example, cloud workloads typically have a guaranteed internet connection and incredibly low latency. While some geographic locations are more prone than others, on-prem or edge runtimes simply can’t offer those guarantees. When considering a tool like Argo CD, keep in mind on-prem or edge environments aren’t as stable as the cloud, which means Argo CD might not always operate as expected. Argo CD runs in a Kubernetes cluster and continuously reconciles the manifests within it with their desired state defined in Git. If Argo can’t reach Git because the device is offline, then it can no longer do its job. If the cluster’s offline, Argo can no longer read the desired state for resources, and depending on how long the device has been offline, it may also be unable to perform a reliable rollback. Additionally, container images can’t be pulled when the device is offline, so any workloads added to the cluster just before going down are dead in the water.
To put it simply: When the cluster is offline, Argo can’t reconcile drift. Any changes it does make might break, and if they do, it likely can’t even roll them back.
Why is this important? Is this really a big deal just because Argo CD gets some new failure modes when we can’t rely on the foundation of the cloud? No, of course not. This is a big deal because of the premise of how brittle edge or on-prem computing environments can be, and that many people don’t consider that until it’s too late. You can do everything right, but still experience downtime of your critical services because a nasty storm cloud takes your power out or kills your internet connectivity. At the edge, the only guarantee we have is unpredictability, which demands more clever engineering.
Our Solution
The goal here isn’t to drag you through a bunch of architecture design decisions. But it also wouldn’t be fair to describe the problem without sharing what actually solved it.
In environments that aren’t always connected (which is common for on-prem and edge), reliability comes down to one thing: making sure the system has everything it needs locally. This makes caching the name of the game. Our services must be able to access all their assets, with or without an internet connection, to remain operational. Whether you’re trying to keep Argo CD functional or maintain a first-class observability experience, you’ll likely need to leverage caching quite a bit.
To keep deployments reliable, we needed Argo CD to stay fully functional even without a steady connection to upstream Git. We did this by running a local Git server in the cluster alongside Argo CD, which served as a pull-through cache for the upstream repository. The Git server obviously needs an internet connection to read from upstream, but Argo can access the local Git server at any point, even when the environment is air-gapped. This meant Argo could successfully reconcile cluster drift and perform rollbacks–at any time.
We also needed to ensure the correct container images were available before anything started. Instead of relying on last-minute pulls, we packaged a versioned set of images alongside each release of our manifests. During deployment, Argo CD downloads the bundle in advance as part of the sync process and only then proceeds with the rollout. It still takes connectivity to fetch the bundle when updates are available, but it prevents partial deployments and makes releases far more predictable. Once this change is applied, workloads can start cleanly and remain healthy even if the device goes offline unexpectedly.
Takeaway
If there’s one takeaway from this, it’s that edge and on-prem environments don’t reward optimism–they reward preparation. If you plan to run critical services in these environments, assume connectivity will be inconsistent, dependencies won’t always be reachable, and recovery will need to operate under less-than-ideal conditions. That shift in mindset changes how you design, deploy, and define “reliable.”
The good news is that the constraints are also what make this work interesting. Building for the edge forces practical, resilient solutions that hold up in the real world. As more workloads move closer to where data is generated and decisions are made, these lessons will continue to shape what modern infrastructure looks like outside the cloud.
For a much more in-depth look at what you just read, check out our ArgoCon talk!
Akuity also presented a similar topic in a talk at the larger KubeCon event.
