Restore Kubernetes Objects from etcd Without Downtime
Did you know you can recover deleted Kubernetes resources from etcd snapshots without downtime or cluster rollback? Most don’t, it’s surprisingly simple.
Over the years of maintaining Kubernetes infrastructure, I’ve encountered several situations where something small caused noticeable disruption, a deleted ConfigMap, a misapplied manifest, or a CI job wiping out a critical resource. One such case occurred in our QA environment, where a teammate accidentally deleted a ConfigMap needed by a test application. The app began failing silently, and our CI pipelines were blocked mid-run. A full etcd restore was technically an option, but restoring the entire cluster state just to recover one object didn’t make sense.
That incident led me to adopt a more focused recovery strategy, one that addresses exactly what’s broken without touching anything else. This post outlines that approach: extracting and restoring individual Kubernetes resources from an etcd snapshot. If you're responsible for cluster stability and need minimal-impact recovery, this method is worth integrating into your operational toolkit.
etcd is the central datastore for every object in a Kubernetes cluster. A snapshot restore doesn’t just replace a single resource, it rewinds the entire cluster to an earlier state. That rollback can have unintended consequences: orphaned pods, outdated secrets, or corrupted controller caches.
In high-availability environments, the goal is often to fix a single broken piece without disturbing the rest of the system. This is where surgical recovery comes in. Instead of resetting the entire state, we target exactly what was lost, recover it, and leave everything else untouched.
This guide explains how to extract and restore specific resources, like ConfigMaps, Secrets, Deployments, and more, from an etcd snapshot. You’ll learn how to:
Mount a snapshot locally and run a throwaway etcd instance
Navigate etcd's internal structure to locate the exact resource
Decode the binary etcd values into clean YAML
Reintroduce only the affected resource back into the live cluster
This workflow avoids the need for a full cluster rollback and reduces both downtime and risk.
If your cluster uses encryption at rest (such as with KMS), make sure the temporary etcd instance is configured with the appropriate keys. Otherwise, you won’t be able to decode the data.
Precision recovery from etcd snapshots is a critical skill that’s often overlooked in production operations. Having the ability to restore exactly what’s broken, not more, not less, saves time, reduces blast radius, and builds confidence in incident response procedures.
I’ve used this method in real-world scenarios, and it’s proven to be one of the most effective tools for stabilizing clusters without introducing new risk. Whether you’re supporting a high-availability production cluster or simply want a safer recovery strategy, this is one workflow you should have on hand.
FAQs
Why shouldn't I restore the full etcd snapshot to recover a single resource?
A full etcd restore reverts the entire cluster state, which can lead to unintended side effects like outdated secrets, orphaned pods, or controller cache inconsistencies. It's better to surgically extract and reapply only the missing resource to avoid unnecessary disruption.
What tools are required to restore a specific Kubernetes object from an etcd snapshot?
You need:
etcdctl (v3.4 or higher)
A recent etcd snapshot
auger (to decode etcd entries)
kubectl (to reapply the resource)
How do I locate a Kubernetes object inside an etcd snapshot?
You must:
Launch a temporary local etcd instance from the snapshot
Use etcdctl to list keys (e.g., /registry/configmaps/<namespace>/<name>)
Extract the binary value and decode it using auger to produce valid YAML
Can I recover a resource to a different namespace or environment?
Yes. You can modify the decoded YAML to change the namespace field before applying it. This is useful for restoring production data into staging or test environments.
What precautions should I take before and after performing this recovery?
Before: Backup the current cluster state
During: Run kubectl apply --dry-run=client to validate the resource
After: Clean up the temporary etcd instance to avoid running stray services locally
Like what you read? Support my work so I can keep writing more for you.
GitHub cut hosted runner prices but planned fees for self-hosted Actions, triggering backlash. The change was paused. Here’s what happened and what to expect.