Skip to main content
After fixing the communication outage in Level 1, the Intergalactic Union welcomed a new species: the Zephyrians.
Intermediate
Solution

Intermediate Solution: The Silent Canary

We'll approach this exactly as you would: start with the objectives, break them down one by one, and systematically fix what's broken.

This walkthrough contains the full solution. Try solving the challenge yourself first, then come back if you get stuck or want to compare approaches.

Understanding the Setup

All files are in adventures/01-echoes-lost-in-orbit/intermediate/manifests. The structure looks like this:

  • appset.yaml: An Argo CD ApplicationSet that generates Applications for staging and prod using the Git directory generator.
  • base/: The base Kustomize configuration for the echo-server app, containing the AnalysisTemplate, Rollout, Service, and kustomization.yaml.
  • overlays/: Environment-specific overlays that adjust replica counts for staging and prod.

Info

All steps in this guide use the staging environment. Since staging and production are identical except for replica count, the same fixes apply to both. We'll only mention staging throughout.

  1. Pod Info Version 6.9.3 Deployed

    The rollout already targets image 6.9.3, but the stable replica set is still running 6.8.0. Something is blocking progression.

    Check which version is running and why the rollout is stuck:

    kubectl -n echo-staging get rollout echo-server -o yaml

    The spec.template.spec.containers[0].image field shows stefanprodan/podinfo:6.9.3 as desired. But scrolling down to status.conditions reveals an error:

    - message: 'Rollout aborted update to revision 2: Metric "container-restarts" assessed
        Error due to consecutiveErrors (1) > consecutiveErrorLimit (0): "Error Message:
        Post "http://prom-server.prometheus.svc.cluster.local/api/v1/query": dial tcp:
        lookup prom-server.prometheus.svc.cluster.local on 10.96.0.10:53: no such host"'
      reason: RolloutAborted
      status: "False"
      type: Progressing

    To confirm what is actually serving traffic, look up the status.stableRS hash and inspect that replica set directly:

    # status.stableRS: 6fdd67656d
    kubectl -n echo-staging get replicaset echo-server-6fdd67656d -o yaml
    spec:
      template:
        spec:
          containers:
            - name: echo-server
              image: stefanprodan/podinfo:6.8.0
    Argo Rollouts UI showing a failed release with an aborted rollout status
    The Argo Rollouts UI confirms the rollout is aborted

    The rollout aborted due to an error in the AnalysisTemplate. Fixing the analysis is what will unblock the version upgrade. Move on to the next objective to investigate.

    Key Takeaways

    • Argo Rollouts won't progress a rollout if there are errors in the configuration.
    • Check rollout status conditions first when a rollout appears stuck.
  2. Automatic Canary Progression Based on Health Metrics

    The rollout is aborted because the AnalysisTemplate is referencing a Prometheus service that does not exist. There are also two bugs in the metric definitions themselves.

    Inspect the AnalysisTemplate directly:

    kubectl -n echo-staging get analysistemplate echo-analysis -o yaml

    Or open the file at adventures/01-echoes-lost-in-orbit/intermediate/manifests/base/analysis-template.yaml. You will find two metrics: container-restarts and ready-containers.

    Bug 1: wrong Prometheus service name. The error says it cannot reach prom-server.prometheus.svc.cluster.local. Let's check if this service exists. The URL follows the structure http://<service-name>.<namespace>.svc.cluster.local, so we run:

    kubectl -n prometheus get service

    This outputs one service called prometheus-server, not prom-server. There's a typo in the manifest. Fix the address in container-restarts:

    address: http://prometheus-server.prometheus.svc.cluster.local

    Commit, push, then retry the rollout:

    kubectl argo rollouts retry rollout echo-server -n echo-staging
    kubectl argo rollouts -n echo-staging status echo-server

    A new error appears:

    Degraded - RolloutAborted: Rollout aborted update to revision 2: Metric "container-restarts" assessed Failed due to failed (1) > failureLimit (0)

    Bug 2: inverted success condition. The container-restarts query returns 0 when there are no restarts, which is the healthy state. But the current condition result[0] > 0 only passes when restarts exist. Invert it:

    successCondition: result[0] == 0

    Commit, push, and retry again. Open the Argo Rollouts UI, select the echo-staging namespace, and click the rollout card. You will see multiple AnalysisRuns accumulate as each retry creates a new one:

    Argo Rollouts UI showing three analysis runs for the same rollout revision
    Each retry creates a new AnalysisRun; earlier ones show the previous errors

    Click the latest AnalysisRun. The container-restarts metric is now passing, but ready-containers is failing:

    AnalysisRun detail showing container-restarts passing and ready-containers failing

    Clicking the ready-containers metric on the left reveals the query body is empty:

    AnalysisTemplate ready-containers metric showing an empty query body

    Bug 3: missing query implementation. The ready-containers metric has no query. Open the Prometheus UI (port 30102 in VS Code's Ports tab) and explore metrics beginning with kube_pod_container_status_. Choose kube_pod_container_status_ready:

    Prometheus UI autocomplete showing kube_pod_container_status_ready selected

    Filter to your namespace and pods, then run:

    kube_pod_container_status_ready{
      namespace="echo-staging",
      pod=~"echo-server-.*"
    }
    Prometheus query returning a list of ready container metrics per pod

    This returns a list. Wrap it in sum() to get a single count:

    sum(
      kube_pod_container_status_ready{
        namespace="echo-staging",
        pod=~"echo-server-.*"
      }
    )
    Prometheus query result showing a single value of 1 for ready containers
    The aggregated query returns 1, which satisfies successCondition: result[0] >= 1

    Replace the hardcoded namespace with {{args.namespace}} so it works for both environments, and add or vector(0) as a fallback for when no data exists yet:

    query: |-
      sum(kube_pod_container_status_ready{
        namespace="{{args.namespace}}",
        pod=~"echo-server-.*"
      }) or vector(0)

    Commit, push, and retry once more. This time the rollout progresses through all canary stages and completes successfully.

    Key Takeaways

    • There are multiple effective ways to debug Argo Rollouts. Try them and use your favorite.
    • Service references follow the format service-name.namespace.svc.cluster.local. A typo here causes a silent DNS lookup failure at runtime.
    • Prometheus queries are a simple and effective way to validate application health during rollouts.
    • The Prometheus UI is a great way to test and build your queries before adding them to an AnalysisTemplate.
  3. Two Working PromQL Queries in the AnalysisTemplate

    This objective is resolved by the previous step. Both metrics are now implemented and validated.

    With the fixes in place, the AnalysisTemplate contains two working health checks:

    • container-restarts: confirms zero container restarts during the rollout window.
    • ready-containers: confirms at least one container is ready before the rollout progresses.
  4. All Rollouts Complete Successfully

    Once both metrics pass, Argo Rollouts advances through the canary stages automatically and marks the rollout complete in both staging and production.

    Apply the same analysis-template fix in the production overlay if needed, or let Argo CD sync it automatically. Both environments will complete the rollout to podinfo 6.9.3 without manual intervention.

    Tip

    Run the smoke test to confirm all objectives are met: adventures/01-echoes-lost-in-orbit/intermediate/smoke-test.sh

Final Result

Complete AnalysisTemplate

All three fixes applied: correct Prometheus address, corrected success condition, and implemented ready-containers query.

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: echo-analysis
spec:
  args:
    - name: namespace
  metrics:
    - name: container-restarts
      successCondition: result[0] == 0
      failureLimit: 0
      inconclusiveLimit: 0
      consecutiveErrorLimit: 0
      count: 1
      provider:
        prometheus:
          address: http://prometheus-server.prometheus.svc.cluster.local
          query: |
            # There should be no restarts
            sum(increase(kube_pod_container_status_restarts_total{
              namespace="{{args.namespace}}",
              pod=~"echo-server-.*"
            }[1m])) or vector(0)
    - name: ready-containers
      successCondition: result[0] >= 1
      failureLimit: 0
      inconclusiveLimit: 0
      consecutiveErrorLimit: 0
      count: 1
      provider:
        prometheus:
          address: http://prometheus-server.prometheus.svc.cluster.local
          query: |-
            # Check how many containers are ready (should be at least 1)
            sum(kube_pod_container_status_ready{
              namespace="{{args.namespace}}",
              pod=~"echo-server-.*"
            }) or vector(0)

The Canary Sings

The deployment pipeline is no longer silent. Both environments advanced to podinfo 6.9.3 guided by health checks that now actually check health. The Zephyrian relay network is back online, and the canary has earned its promotion.

Every crew navigates a broken deployment differently. See how others got their rollouts unstuck.

Browse the discussion (opens in new tab)