Troubleshooting

Diagnose common issues with self-hosted Dreadnode installations.

Start here when something isn’t working. Sections are organized by what you see, not what’s broken — pick the symptom that matches.

Diagnostic commands

These are useful regardless of the problem. Assume dreadnode as the release name throughout — substitute yours if different.

# All pods for the release
kubectl -n <namespace> get pods -l app.kubernetes.io/instance=dreadnode

# Events (scheduling failures, image pull errors, probe failures)
kubectl -n <namespace> get events --sort-by='.lastTimestamp'

# API logs
kubectl -n <namespace> logs deploy/dreadnode-api

# API init container logs (migrations run here)
kubectl -n <namespace> logs deploy/dreadnode-api -c migrations

# Health check
curl http(s)://<your-domain>/api/v1/health

Pods stuck in Pending

The pod can’t be scheduled. Check events:

kubectl -n <namespace> describe pod <pod-name>

“no nodes available to schedule pods” or “Insufficient cpu/memory” — Your cluster doesn’t have enough allocatable resources. The small preset totals roughly 4 vCPU and 8 Gi across all components. Free up resources or add nodes.

“pod has unbound immediate PersistentVolumeClaims” — No StorageClass can provision the requested PVC. Check that a StorageClass exists:

kubectl get storageclass

If empty, install a storage provisioner (local-path, EBS CSI, Rook, etc.) before deploying Dreadnode. The preflight checks catch this, but only if you ran them.

Pods in CrashLoopBackOff

The container starts and immediately exits. Check logs for the crashing container.

API pod: init container crash

The migrations init container runs alembic upgrade head before the API starts. If it fails, the pod shows Init:CrashLoopBackOff and the API never boots.

kubectl -n <namespace> logs deploy/dreadnode-api -c migrations

connection refused or could not translate host name — The API can’t reach PostgreSQL. If using in-cluster Postgres, check that the dreadnode-postgresql StatefulSet has a Ready pod. If using an external database, verify the host, port, and network connectivity from inside the cluster.

password authentication failed or FATAL: role "..." does not exist — Wrong credentials. For in-cluster Postgres, the password lives in the dreadnode-postgresql Secret. If you deleted and recreated the Secret without deleting the PVC, the password on disk no longer matches. Delete the PVC and let both regenerate together.

ValidationError or missing required env — A required environment variable is missing or malformed. The API validates its config with Pydantic on startup. The error message names the exact field. Check the ConfigMap and Secrets for the API pod.

API pod: main container crash

If the init container succeeds but the main container crashes:

kubectl -n <namespace> logs deploy/dreadnode-api

Look for Python tracebacks. The most common cause is a config value that passes validation but fails at runtime — a ClickHouse host that resolves but rejects connections, an S3 endpoint that times out, etc.

StatefulSet pods (PostgreSQL, ClickHouse, MinIO)

kubectl -n <namespace> logs sts/dreadnode-postgresql
kubectl -n <namespace> logs sts/dreadnode-clickhouse
kubectl -n <namespace> logs sts/dreadnode-minio

If a stateful pod crashes after a reinstall, the most likely cause is a password mismatch: the Secret was regenerated but the PVC still holds data encrypted with the old password. Delete both the PVC and the Secret, then let the chart recreate them:

kubectl -n <namespace> delete pvc data-dreadnode-postgresql-0
kubectl -n <namespace> delete secret dreadnode-postgresql
# Then: helm upgrade (or redeploy via Admin Console)

Pods in ImagePullBackOff

The container runtime can’t pull the image.

kubectl -n <namespace> describe pod <pod-name>

“unauthorized” or “authentication required” — The Replicated pull secret is missing or invalid. Check that the enterprise-pull-secret Secret exists in the namespace:

kubectl -n <namespace> get secret enterprise-pull-secret

If missing, the license may not have been applied correctly. For Helm CLI installs, verify you logged in to the registry (helm registry login registry.replicated.com). For Embedded Cluster / KOTS, the license is injected automatically — check the Admin Console for license status.

“manifest unknown” or “not found” — The image tag doesn’t exist in the registry. This usually means the chart version and the published images are out of sync. Verify you’re installing a version that was promoted to your channel.

UI loads but API calls fail

You can see the Dreadnode login page, but interactions fail (login doesn’t work, pages show errors, network tab shows 404 or 502 on /api/* requests).

Check ingress routing. The frontend and API share a single hostname (<your-domain>). The ingress must route /api/* to the API service and / to the frontend service. If you see 404s on /api/*, the ingress isn’t routing correctly.

kubectl -n <namespace> get ingress

Verify the API ingress has the correct host and paths configured.

Check the API pod is Ready. If the API pod isn’t passing health checks, the ingress controller won’t route traffic to it:

kubectl -n <namespace> get pods -l app.kubernetes.io/name=dreadnode-api

You enter credentials, the page reloads, but you’re not logged in. No error message.

Scheme mismatch. This is almost always caused by global.scheme being set to https while you’re connecting over plain HTTP. The API sets Secure on authentication cookies when scheme is https. Browsers silently refuse to store Secure cookies over HTTP connections.

Fix: either connect over HTTPS, or set global.scheme: http and redeploy.

CORS mismatch. If you’re accessing the platform on a URL that doesn’t match global.domain (e.g., via IP address or a different hostname), the browser blocks cross-origin cookie writes. Access the platform on the exact domain you configured.

A previous install left PostgreSQL data behind. The platform sees existing users and enforces invite-only signups. If this is supposed to be a fresh install, delete the PostgreSQL PVC and redeploy:

kubectl -n <namespace> delete pvc data-dreadnode-postgresql-0
kubectl -n <namespace> delete secret dreadnode-postgresql

Model deployment creation returns 503

If Admin → Model Deployments reports that LiteLLM model storage is not enabled, add this setting to your external LiteLLM configuration and restart it. The bundled LiteLLM chart already includes the setting.

general_settings:
  store_model_in_db: true

If the page reports that LiteLLM integration is disabled, enable both dreadnode-litellm.enabled and dreadnode-api.config.litellm.enabled in your Helm values, then upgrade the release.

TLS issues

Browser shows certificate warning

The TLS Secret exists but the certificate doesn’t cover the hostname you’re visiting. The cert must cover both <your-domain> and storage.<your-domain>. Check the certificate’s SANs:

kubectl -n <namespace> get secret dreadnode-tls -o jsonpath='{.data.tls\.crt}' \
    | base64 -d | openssl x509 -noout -text | grep -A1 "Subject Alternative Name"

Ingress not terminating TLS

Verify the TLS Secret is in the correct namespace and the ingress references it:

kubectl -n <namespace> get ingress -o yaml | grep -A3 tls

If the ingress shows no TLS block, check that global.tls.secretName is set in your values overlay and you redeployed after setting it.

TLS terminates upstream (load balancer, service mesh)

If a cloud load balancer or service mesh handles TLS before traffic reaches the cluster, set global.scheme: https and global.tls.skipCheck: true. This tells the chart to emit https:// URLs without requiring a TLS Secret in the namespace.

S3 / MinIO issues

Presigned URL errors

The platform generates presigned S3 URLs for file downloads. If these fail, check that storage.<your-domain> resolves and is reachable from the user’s browser — presigned URLs point at the external S3 endpoint, not the internal one.

For in-cluster MinIO, verify the MinIO ingress exists and routes correctly:

kubectl -n <namespace> get ingress dreadnode-minio

“Access Denied” or “NoSuchBucket”

The API creates buckets (python-packages, org-data, user-data-logs) on startup. If the MinIO pod was unhealthy when the API started, the buckets may not exist. Restart the API pod after MinIO is Ready:

kubectl -n <namespace> rollout restart deploy/dreadnode-api

Support bundles

Support bundles collect logs, cluster state, and diagnostic information into a single archive you can share with us for debugging.

From the Admin Console (Embedded Cluster / KOTS): Go to Troubleshoot and click Generate a support bundle.

From the CLI (Helm installs):

kubectl support-bundle --load-cluster-specs -n <namespace>

This requires the troubleshoot kubectl plugin. The bundle spec is baked into the chart as a Secret with the troubleshoot.sh/kind: support-bundle label — the plugin discovers it automatically.

The bundle includes pod logs (up to 720 hours, 10,000 lines per pod), Helm release history, cluster resource state, and reachability probes for in-cluster data stores. Credentials are automatically redacted.