Most Kubernetes outages I've seen weren't infrastructure failures. They were graceful shutdown done wrong.

If you've ever seen dropped requests during a deployment, a pod that wouldn't die, or health checks failing during a rollout - this post is for you.

SIGTERM sounds simple. It's not.

What you think happens

The mental model most engineers carry:

mental model
kubectl delete pod  →  pod gets SIGTERM  →  pod shuts down  →  done

┌─────────────┐    SIGTERM     ┌─────────────┐    exit 0     ┌──────────┐
│   kubectl   │ ─────────────► │     Pod     │ ────────────► │  Gone    │
└─────────────┘                └─────────────┘               └──────────┘

                    Simple. Clean. Wrong.

What actually happens is messier, has a race condition baked in, and will bite you in production if you don't account for it.

The full termination sequence

When a pod is deleted (rolling update, node drain, or kubectl delete pod), Kubernetes kicks off a sequence that most engineers have never seen written down completely.

termination flow
                       kubectl delete pod
                              │
                              ▼
                        API Server
                        marks pod
                       "Terminating"
                              │
              ┌───────────────┼───────────────┐
              │               │               │
              ▼               ▼               ▼
          kubelet        Endpoints        ReplicaSet
         sends SIGTERM   controller      controller
         to container    removes pod     detects count
              │          from Service    below desired
              │               │               │
              ▼               ▼               ▼
         App starts      kube-proxy       Scheduler
         shutdown        updates          places new
                        iptables         pod on node
                       (takes seconds)
timeline
t=0s   API Server marks pod Terminating in etcd

t=0s   THREE things happen simultaneously:

       1. kubelet sends SIGTERM to PID 1 in the container
          → your application's shutdown handler fires (if you wrote one)

       2. Endpoints controller removes pod IP from Service endpoints
          → kube-proxy starts updating iptables rules on every node
          → new connections stop being routed to this pod... eventually

       3. ReplicaSet controller detects pod count below desired
          → creates a new pod → Scheduler assigns it to a node

t=0–?  kube-proxy propagates iptables changes across all nodes
       (this takes seconds - it is NOT instant)

t=Xs   Your application finishes in-flight work and exits cleanly
       kubelet sees process exit → container removed

t=300s If process hasn't exited by terminationGracePeriodSeconds
       kubelet sends SIGKILL → container forcefully terminated

Notice what happens at t=0. SIGTERM fires at the exact same moment the Endpoints controller starts removing the pod. The key word: starts.

The race condition nobody talks about

kube-proxy doesn't update iptables instantly. It watches the Endpoints object, detects the change, and rewrites iptables rules on every node. This takes seconds - sometimes 5-10 seconds on a large cluster.

During those seconds:

the race condition
t=0s  ┌─────────────────────────────────────────────────────────┐
      │ SIGTERM sent to pod         iptables update starts      │
      │ Pod: "I'm shutting down"    kube-proxy: "working on it" │
      └─────────────────────────────────────────────────────────┘

t=3s  ┌─────────────────────────────────────────────────────────┐
      │ Pod: refuses new connections                            │
      │ iptables: STILL routing traffic to this pod  ← BUG      │
      │ User: gets 502 / connection refused                    │
      └─────────────────────────────────────────────────────────┘

t=8s  ┌─────────────────────────────────────────────────────────┐
      │ kube-proxy: iptables updated, traffic stops             │
      │ Pod: already refusing connections (too late)            │
      └─────────────────────────────────────────────────────────┘

Your pod got SIGTERM and started refusing connections. But traffic is still being sent to it because kube-proxy hasn't caught up yet. This is the bug.

You can reproduce it yourself:

bash
# Watch your error rate during a rolling deploy
kubectl rollout restart deployment/your-app -n your-namespace

# In another terminal, hammer the service
while true; do curl -s -o /dev/null -w "%{http_code}\n" http://your-service/health; sleep 0.1; done

# You'll see 502s and connection refused errors during the rollout

The fix: preStop hook

The solution is elegant once you understand the problem. You need to delay your application's shutdown long enough for kube-proxy to finish propagating the iptables changes.

yaml
containers:
  - name: your-app
    lifecycle:
      preStop:
        exec:
          command: ["sleep", "15"]   # wait for kube-proxy to catch up
before vs after
Without preStop:                    With preStop (sleep 15):

t=0s  SIGTERM → app shuts down      t=0s  preStop starts (sleep 15)
t=0s  iptables still routing   ←bug  t=0s  iptables update starts
t=3s  app refuses connections       t=8s  iptables update complete
t=3s  users get 502s           ←bug  t=15s preStop done → SIGTERM fires
t=8s  iptables finally updated      t=15s app shuts down gracefully
                                    t=15s no traffic arriving → no 502s ✅

By the time your application gets SIGTERM, no new requests are arriving. The race condition is gone.

15 seconds works for most clusters. On larger clusters with many nodes, kube-proxy propagation can take longer. Measure it: watch your iptables update latency during a test rollout and tune the sleep value to match your actual environment.

Important
preStop time counts against terminationGracePeriodSeconds. If your grace period is 30s and preStop sleeps 15s, your application only has 15s to drain. Set your grace period accordingly: terminationGracePeriodSeconds: 60 gives you preStop(15s) + app drain(45s).

Setting terminationGracePeriodSeconds

The default is 30 seconds. That's fine for a simple web server. It's not fine if your application has long-running operations.

Ask yourself: what is the longest operation my application can be in the middle of when it receives SIGTERM?

examples
Simple web server:     30s is plenty (requests are fast)
Database:              60-120s (transactions need to commit)
Video transcoding:     300s+ (can't restart mid-transcode)
LLM inference:         depends on your workload
                       (model loading + inference × max concurrent requests)

Set it based on your worst case, not your average case.

yaml
spec:
  terminationGracePeriodSeconds: 300

What your application needs to do

Kubernetes sends the signal. Your application has to handle it.

python — minimum viable shutdown handler
import signal
import sys

shutdown_requested = False

def handle_sigterm(signum, frame):
    global shutdown_requested
    shutdown_requested = True  # tell main loop to stop accepting work
    # do NOT call sys.exit here — let the main loop drain and clean up

signal.signal(signal.SIGTERM, handle_sigterm)

# Main loop checks the flag
while not shutdown_requested:
    handle_next_request()

# Cleanup runs after the loop drains
flush_buffers()
close_db_connections()
sys.exit(0)

What your handler should do:

If PID 1 in your container doesn't handle SIGTERM, the signal is ignored. Kubernetes waits for terminationGracePeriodSeconds, then sends SIGKILL. Your application is force-killed with no chance to clean up - data loss, corrupted state, dropped requests.

The PID 1 trap

This catches a lot of people.

dockerfile
# Wrong — runs your app as a child of sh, not as PID 1
CMD ["sh", "-c", "python app.py"]

# Right — your app IS PID 1, receives signals directly
ENTRYPOINT ["python", "-m", "your.app"]

When you use shell form (sh -c), the shell becomes PID 1. SIGTERM goes to the shell. The shell may or may not forward it to your app - usually it doesn't. Your app never gets the signal, never shuts down cleanly, gets SIGKILL'd after the grace period.

Always use exec form in your Dockerfile ENTRYPOINT.

If you can't restructure the Dockerfile (third-party image, complex entrypoint script), use tini or dumb-init as a lightweight init process. They run as PID 1, forward signals correctly to child processes, and reap zombie processes. One line in your Dockerfile: ENTRYPOINT ["/tini", "--", "your-entrypoint.sh"].

Readiness probe: your last line of defense

Even with preStop and a proper SIGTERM handler, you want your readiness probe to fail immediately on shutdown:

python
def handle_sigterm(signum, frame):
    global is_ready, shutdown_requested
    is_ready = False           # readiness probe returns 503 immediately
    shutdown_requested = True  # tell main loop to stop accepting work
    # do NOT exit here — let the main loop drain first

When readiness returns 503, Kubernetes removes the pod from Service endpoints immediately - faster than waiting for kube-proxy to detect the Terminating state. Belt and suspenders.

The complete picture

rolling update — zero dropped requests
Rolling update starts
│
├── t=0s  ────────────────────────────────────────────────────────
│         OLD POD                          NEW POD
│         preStop fires (sleep 15s)        starts booting
│         readiness → 503                  readiness → 503
│         removed from endpoints           not yet in endpoints
│
├── t=10s ────────────────────────────────────────────────────────
│         OLD POD                          NEW POD
│         preStop still sleeping           passes readiness ✅
│         no new traffic arriving          added to endpoints
│                                          starts receiving traffic
│
├── t=15s ────────────────────────────────────────────────────────
│         OLD POD                          NEW POD
│         preStop done → SIGTERM fires     serving 100% of traffic
│         finishes in-flight requests
│         flushes state to storage
│         exits cleanly (code 0)
│
└── t=20s ────────────────────────────────────────────────────────
          OLD POD                          NEW POD
          container removed                serving 100% of traffic

          Zero dropped requests. Zero data loss. ✅

The checklist

Before your next deploy
  • ✅ Use exec form in Dockerfile ENTRYPOINT (not shell form)
  • ✅ Handle SIGTERM in your application
  • ✅ preStop sleep 15s (measure on large clusters) - let kube-proxy propagate
  • ✅ terminationGracePeriodSeconds = preStop + worst-case drain time
  • ✅ Readiness probe fails immediately on shutdown
  • ✅ Finish in-flight work before exiting
  • ✅ Exit with code 0

Miss any of these and you're relying on luck during your next deployment.

Most K8s outages I've seen weren't infrastructure failures. They were graceful shutdown done wrong. The good news: once you understand the sequence, it's completely preventable.