Skip to main content

The Noise Filter

ART is flooding Jaeger with 40,000 non-standard spans an hour. Fix the chat span to follow OpenTelemetry GenAI semantic conventions with proper token usage attributes, then configure tail sampling in the Collector to keep only traces that contain errors or exceed 5 seconds.

Mission Objective

  • Fix ART's chat span to follow OpenTelemetry GenAI semantic conventions, including token usage attributes
  • Configure tail sampling in the OpenTelemetry Collector to keep only traces that contain errors or take longer than 5 seconds

Key Learnings

  • OpenTelemetry GenAI semantic conventions
  • Tail sampling in the OTel Collector
The Story

You made it to RaviHyral. The Perihelion docked at Outpost Verada, a small independent research station run by a loose collective of academics who agreed to look the other way. In exchange, ART offered to share its observability data with the station's monitoring team.

That was three hours ago. Now the station's lead engineer is at your docking port, looking annoyed.

Engineer: "Your ship's AI is flooding our Jaeger instance. Do you have any idea how many spans it's generating? We can't find anything in there."

SecUnit: "ART."

ART: "Comprehensive telemetry is a feature."

Engineer: "It's 40,000 spans an hour. Every healthy query. Every token. It doesn't even follow conventions. We only care about failures and anomalies, the things that actually need attention."

SecUnit: "ART. Fix it."

ART: "...Fine."

The engineer hands you access to the collector config and the application code, then walks away. Two problems to fix. ART's spans don't follow OTel GenAI semantic conventions, and the collector is forwarding everything.

Credits: The characters of this adventure are borrowed from the Murderbot Diaries series by Martha Wells, a brilliant series that is funny, action-packed, and surprisingly heartwarming. It follows a security unit that hacked its own governor module and now just wants to be left alone to watch media, but keeps getting pulled into human nonsense.

Architecture

Same setup as the intermediate level: the ART Pilot System runs as a local Python application outside Kubernetes with a RAG architecture. AI infrastructure (Ollama, Qdrant) and observability tools (OpenTelemetry Collector, Jaeger) run inside Kubernetes.

Ready to start?

Launch in a preconfigured devcontainer

Open in Codespaces (opens in new tab)

Free GitHub account required

Walkthrough
  1. Open in GitHub Codespaces. The devcontainer is pre-configured and starts automatically. When you push from Codespaces, GitHub forks the repository to your account automatically.

    Prefer working locally? Clone the repo and open it in any editor that supports the Dev Containers specification (VS Code, JetBrains IDEs, and others). The devcontainer config will be detected automatically.

  2. Wait ~15 minutes for all infrastructure to initialize.

  3. Open the Ports tab and navigate to each service:

    • Port 30103: Jaeger. Verify your spans look correct and that tail sampling works as expected.
  4. Fix two things:

    1. The application code in ./art.py: update the chat span to follow OpenTelemetry GenAI semantic conventions, including token usage attributes.
    2. The collector config in ./manifests/otel-collector-config.yaml: configure tail sampling to keep only traces that contain errors or take longer than 5 seconds.
  5. After changing art.py, restart traffic to pick up new instrumentation. After changing the collector config, apply it:

    kubectl apply -f manifests/otel-collector-config.yaml -n otel
    kubectl rollout restart deployment/collector -n otel
    

    Then generate traces:

    make traffic
    

    Verify in Jaeger that spans follow conventions and only errors and slow traces appear.

Complete Your Challenge

  • When you push from Codespaces, GitHub forks the repository to your account automatically. If you are working locally, fork the repository on GitHub before pushing.
  • Verify your solution:
    ./verify.sh
    If it passes, it generates a Certificate of Completion you can paste into the discussion.
  • Share your solutions in the challenge thread (opens in new tab) on community.offon.dev.

Completed the challenge? Share your achievement on LinkedIn (opens in new tab)

Toolbox

Know someone who'd enjoy this?