~/DOCS/

Troubleshooting

Pod Not Hibernating

# Check idle timeout
kubectl get pod <pod-name> \
  -o jsonpath='{.metadata.annotations.architect\.loopholelabs\.io/scaledown-durations}'

# Verify container is managed
kubectl get pod <pod-name> \
  -o jsonpath='{.metadata.annotations.architect\.loopholelabs\.io/managed-containers}'

# Check status label
kubectl get pod <pod-name> \
  -o jsonpath='{.metadata.labels.status\.architect\.loopholelabs\.io/<container-name>}'

# Review daemon logs
kubectl logs -n architect -l app.kubernetes.io/name=architectd | grep <pod-name>

The Architect Console also shows per-pod events, timings, and detailed debugging info.

Pod Not Waking

# Test wake via exec
kubectl exec -it <pod-name> -- /bin/sh -c "echo test"

# Test wake via network
kubectl port-forward <pod-name> <port>:<port>
curl localhost:<port>

# Check events
kubectl describe pod <pod-name>

# Verify daemon is running on the pod's node
kubectl get pod <pod-name> -o wide
kubectl get pods -n architect -o wide | grep <node-name>

Scale Down and Wake

Health probes wake the container

If a managed container with health-check-proxy configured still wakes whenever kubelet probes it:

# Confirm the sidecar was added
kubectl get pod <pod-name> \
  -o jsonpath='{.spec.containers[*].name}'

# Confirm probe ports target the shadow port, not the app port
kubectl get pod <pod-name> \
  -o jsonpath='{.spec.containers[?(@.name=="<container>")].livenessProbe}'

# Check the admission controller didn't skip the sidecar
kubectl logs -n architect -l app=architect-admission-controller \
  | grep -i 'health check proxy'

The first command lists every container in the pod; you should see architect-health-check-proxy alongside your application container, e.g.:

my-app architect-health-check-proxy

Checklist:

  • The probe's port field on each managed container must reference the shadowPort, not the appPort. Probes that still target the application port bypass the sidecar entirely.
  • Both managed-containers and network-monitor annotations must be present. Without either, the admission controller logs a warning and skips sidecar injection.
  • The sidecar (architect-health-check-proxy) must be present in spec.containers. If it isn't, check admission controller logs.

Scrape traffic wakes the container

If a Prometheus scrape (or other external poller) wakes a managed container that has shadow-ports configured:

# Confirm the shadow port is on the container spec
kubectl get pod <pod-name> \
  -o jsonpath='{.spec.containers[?(@.name=="<container>")].ports}'

# Check the admission controller didn't skip the shadow ports
kubectl logs -n architect -l app=architect-admission-controller \
  | grep -i 'shadow ports'

The first command lists the container's ports; the shadow port appears with a shadow- name prefix, e.g.:

[{"containerPort":9090} {"containerPort":29090,"name":"shadow-29090","protocol":"TCP"}]

Checklist:

  • The scraper must target the shadowPort, not the appPort. Verify your ServiceMonitor, PodMonitor, or scrape_configs references the shadow port (named shadow-<port> on the container spec).
  • Both managed-containers and network-monitor annotations must be present. Without either, the admission controller logs a warning and skips injection.
  • If you can't move the scraper to a new port, swap shadow-ports for ignore-activity-ports so the existing app port is exempted from activity tracking.

Sidecar fails to inject

If health-check-proxy is set but no sidecar appears on the pod:

kubectl logs -n architect -l app=architect-admission-controller \
  | grep -i 'health check proxy\|shadow ports'

Checklist:

  • The annotation JSON must parse — invalid JSON is logged and the feature is skipped.
  • managed-containers must list the container referenced in each mapping.
  • network-monitor must be set on the pod.
  • All ports must be in the 1–65535 range; mappings outside the range are dropped with a warning.
  • Duplicate shadowPort values across mappings are dropped with a warning. Only the first mapping per shadow port is used.

High Wake Times

If wake times exceed 50ms:

  • Check node CPU and memory availability — contention slows restore
  • Large memory footprints produce larger checkpoints
  • Verify no resource contention on the node
  • Check daemon logs or the Architect Console for per-pod restore timings:
kubectl logs -n architect -l app.kubernetes.io/name=architectd --tail=500 \
  | grep -E "checkpoint|restore|error"

Checkpoint Failures

  • GPU workloads are not supported yet
  • Checkpoints use 50-200MB per pod; check node disk space:
kubectl get nodes \
  -o custom-columns=NAME:.metadata.name,DISK:.status.allocatable.ephemeral-storage
  • Verify runtimeClassName is set and the node has the architect.loopholelabs.io/node=true label
  • Check the Architect Console for checkpoint error details

Runtime Class Errors After Uninstall

Pods still referencing runc-architect or runsc-architect will error. Remove runtimeClassName from affected workloads. See Uninstalling.