feat: Measure system errors using a counter#127
Open
morgan-wowk wants to merge 1 commit intosetup-metric-providerfrom
Open
feat: Measure system errors using a counter#127morgan-wowk wants to merge 1 commit intosetup-metric-providerfrom
morgan-wowk wants to merge 1 commit intosetup-metric-providerfrom
Conversation
This was referenced Feb 25, 2026
Collaborator
Author
|
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
This stack of pull requests is managed by Graphite. Learn more about stacking. |
aa225ff to
7cbcc18
Compare
7cbcc18 to
f0bc197
Compare
66474d1 to
06ecffc
Compare
yuechao-qin
requested changes
Feb 26, 2026
yuechao-qin
approved these changes
Feb 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

TL;DR
Added OpenTelemetry metrics instrumentation to track execution system errors in the orchestrator.
Business value
We will be able to track the rate of system errors and respond to high or increasing rates.
Future iterations
In the future we will emit metrics for state transitions in general, and use measurement attributes to create dimensions on status, then we will have the option to deprecate to deprecate this measurement specific to system errors.
What changed?
cloud_pipelines_backend/instrumentation/metrics.py) that defines an orchestrator meter and anexecution_system_errorscounter instrumentrecord_system_error_exceptionfunction inorchestrator_sql.pyto increment the counter when system errors occurotel_strategy.md) covering best practices for meters, instruments, temporality, and aggregationHow to test?
docker-compose up -d)export TANGLE_OTEL_TRACE_EXPORTER_ENDPOINT="http://localhost:4317" && export TANGLE_OTEL_TRACE_EXPORTER_PROTOCOL="grpc" && export TANGLE_OTEL_METRIC_EXPORTER_ENDPOINT="http://localhost:4317" && export TANGLE_OTEL_METRIC_EXPORTER_PROTOCOL="grpc"execution.system_errorsmetric is incremented in your OpenTelemetry metrics backend. Add a lineraise RuntimeError("Temporary")to the start of https://github.com/TangleML/tangle/blob/system-error-metric/cloud_pipelines_backend/orchestrator_sql.py#L209Why make this change?
This enables monitoring and alerting on system errors in pipeline executions, providing better observability into the health and reliability of the orchestrator component. The metrics follow OpenTelemetry semantic conventions and provide a foundation for expanding observability coverage across the application.