Restore Kubernetes Objects from etcd Without Downtime
Over the years of maintaining Kubernetes infrastructure, I’ve encountered several situations where something small caused noticeable disruption, a deleted ConfigMap, a misapplied manifest, or a CI job wiping out a critical resource. One such case occurred in our QA environment, where a teammate accidentally deleted a ConfigMap needed by a test application. The app began failing silently, and our CI pipelines were blocked mid-run. A full etcd restore was technically an option, but restoring the entire cluster state just to recover one object didn’t make sense.
That incident led me to adopt a more focused recovery strategy, one that addresses exactly what’s broken without touching anything else. This post outlines that approach: extracting and restoring individual Kubernetes resources from an etcd snapshot. If you're responsible for cluster stability and need minimal-impact recovery, this method is worth integrating into your operational toolkit.
Why a Full Restore Should Be the Last Resort¶
etcd is the central datastore for every object in a Kubernetes cluster. A snapshot restore doesn’t just replace a single resource, it rewinds the entire cluster to an earlier state. That rollback can have unintended consequences: orphaned pods, outdated secrets, or corrupted controller caches.
In high-availability environments, the goal is often to fix a single broken piece without disturbing the rest of the system. This is where surgical recovery comes in. Instead of resetting the entire state, we target exactly what was lost, recover it, and leave everything else untouched.
What This Method Enables¶
This guide explains how to extract and restore specific resources, like ConfigMaps, Secrets, Deployments, and more, from an etcd snapshot. You’ll learn how to:
- Mount a snapshot locally and run a throwaway etcd instance
- Navigate etcd's internal structure to locate the exact resource
- Decode the binary etcd values into clean YAML
- Reintroduce only the affected resource back into the live cluster
This workflow avoids the need for a full cluster rollback and reduces both downtime and risk.
Prerequisites¶
Before starting, make sure you have access to the following tools:
- etcd v3.4 or higher
- etcdctl (etcd’s CLI interface)
- auger (for decoding binary values from etcd snapshots into readable YAML)
- kubectl (for applying Kubernetes objects)
- A recent etcd snapshot file (
live-cluster-snapshot.db
)
Create a backup of your current state before making any changes:
etcdctl snapshot save live-cluster-snapshot.db
If you’re starting with a compressed snapshot:
gunzip live-cluster-snapshot.db.gz
Step 1: Restore the Snapshot to a Local Directory¶
Use etcdctl to unpack the snapshot into a temporary directory:
etcdctl snapshot restore live-cluster-snapshot.db --data-dir=recovery-etcd
This creates a local copy of the cluster state as it existed at the time of the snapshot.
Step 2: Start a Temporary Local etcd Instance¶
This step launches etcd in standalone mode, purely for data inspection:
etcd --data-dir=recovery-etcd --listen-client-urls=http://localhost:2379
Ensure it’s listening properly:
etcdctl --endpoints=localhost:2379 endpoint status
No need to join this instance to your live cluster. We’re only using it for read access to the snapshot data.
Step 3: Locate and Extract the Resource¶
To recover a specific ConfigMap, you need to know its etcd key path. For a ConfigMap named app-config
in the production
namespace:
etcdctl --endpoints=localhost:2379 get --prefix "/registry/configmaps/production" --keys-only
This will return entries like:
/registry/configmaps/production/app-config
To fetch and decode it:
etcdctl --endpoints=localhost:2379 get /registry/configmaps/production/app-config --print-value-only | auger decode > app-config.yaml
The resulting file should look like a standard Kubernetes manifest:
apiVersion: v1
kind: ConfigMap
metadata:
name: app-config
namespace: production
data:
api-url: "https://api.example.com"
log-level: "debug"
Step 4: Apply It to the Live Cluster¶
First, run a dry test to ensure the YAML is valid:
kubectl apply -f app-config.yaml --dry-run=server
Then apply:
kubectl apply -f app-config.yaml
If the object was truly missing, you should see:
configmap/app-config created
In some cases, if the object exists in a broken state, you may need to delete and reapply.
Step 5: Clean Up¶
Once recovery is complete, tear down the temporary setup:
pkill etcd
rm -rf recovery-etcd app-config.yaml
Leaving stray etcd processes running on your system is never a good idea, even if they’re running locally.
etcd Key Patterns for Common Kubernetes Resources¶
Here are some quick references for locating other resource types in etcd:
Resource Type | etcd Key Path Format |
---|---|
ConfigMap | /registry/configmaps/{namespace}/{name} |
Secret | /registry/secrets/{namespace}/{name} |
Deployment | /registry/deployments/{namespace}/{name} |
Pod | /registry/pods/{namespace}/{name} |
ServiceAccount | /registry/serviceaccounts/{namespace}/{name} |
CRDs | /registry/{group}/{resource}/{namespace}/{name} |
Handling More Complex Scenarios¶
Reapplying to a Different Namespace¶
You can easily repurpose a resource for a different namespace:
cat app-config.yaml | yq eval '.metadata.namespace = "dev"' | kubectl apply -f -
This is helpful in testing environments where you want to restore production data in isolation.
Encrypted etcd¶
If your cluster uses encryption at rest (such as with KMS), make sure the temporary etcd instance is configured with the appropriate keys. Otherwise, you won’t be able to decode the data.
Bulk Recovery¶
To restore all ConfigMaps in a namespace:
etcdctl --endpoints=localhost:2379 get --prefix "/registry/configmaps/production" --print-value-only | auger decode > all-configmaps.yaml
You can then review and selectively apply the relevant ones.
Final Thoughts¶
Precision recovery from etcd snapshots is a critical skill that’s often overlooked in production operations. Having the ability to restore exactly what’s broken, not more, not less, saves time, reduces blast radius, and builds confidence in incident response procedures.
I’ve used this method in real-world scenarios, and it’s proven to be one of the most effective tools for stabilizing clusters without introducing new risk. Whether you’re supporting a high-availability production cluster or simply want a safer recovery strategy, this is one workflow you should have on hand.
FAQs
Why shouldn't I restore the full etcd snapshot to recover a single resource?
A full etcd restore reverts the entire cluster state, which can lead to unintended side effects like outdated secrets, orphaned pods, or controller cache inconsistencies. It's better to surgically extract and reapply only the missing resource to avoid unnecessary disruption.
What tools are required to restore a specific Kubernetes object from an etcd snapshot?
You need:
etcdctl
(v3.4 or higher)- A recent etcd snapshot
auger
(to decode etcd entries)kubectl
(to reapply the resource)
How do I locate a Kubernetes object inside an etcd snapshot?
You must:
- Launch a temporary local etcd instance from the snapshot
- Use
etcdctl
to list keys (e.g.,/registry/configmaps/<namespace>/<name>
) - Extract the binary value and decode it using
auger
to produce valid YAML
Can I recover a resource to a different namespace or environment?
Yes. You can modify the decoded YAML to change the namespace field before applying it. This is useful for restoring production data into staging or test environments.
What precautions should I take before and after performing this recovery?
- Before: Backup the current cluster state
- During: Run
kubectl apply --dry-run=client
to validate the resource - After: Clean up the temporary etcd instance to avoid running stray services locally