Skip to content

feat: Measure system errors using a counter#127

Open
morgan-wowk wants to merge 1 commit intosetup-metric-providerfrom
system-error-metric
Open

feat: Measure system errors using a counter#127
morgan-wowk wants to merge 1 commit intosetup-metric-providerfrom
system-error-metric

Conversation

@morgan-wowk
Copy link
Collaborator

@morgan-wowk morgan-wowk commented Feb 25, 2026

TL;DR

Added OpenTelemetry metrics instrumentation to track execution system errors in the orchestrator.

Screenshot 2026-02-25 at 4.39.17 AM.png

Screenshot 2026-02-25 at 4.41.10 AM.png

Business value

We will be able to track the rate of system errors and respond to high or increasing rates.

Future iterations

In the future we will emit metrics for state transitions in general, and use measurement attributes to create dimensions on status, then we will have the option to deprecate to deprecate this measurement specific to system errors.

What changed?

  • Created a new metrics module (cloud_pipelines_backend/instrumentation/metrics.py) that defines an orchestrator meter and an execution_system_errors counter instrument
  • Integrated the metrics counter into the record_system_error_exception function in orchestrator_sql.py to increment the counter when system errors occur
  • Added comprehensive OpenTelemetry strategy documentation (otel_strategy.md) covering best practices for meters, instruments, temporality, and aggregation

How to test?

Why make this change?

This enables monitoring and alerting on system errors in pipeline executions, providing better observability into the health and reliability of the orchestrator component. The metrics follow OpenTelemetry semantic conventions and provide a foundation for expanding observability coverage across the application.

Copy link
Collaborator Author

morgan-wowk commented Feb 25, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants