Skip to content

Comments

feat: CI phase timing instrumentation#20706

Draft
ludamad wants to merge 21 commits intonextfrom
ad/feat/ci-phase-timing
Draft

feat: CI phase timing instrumentation#20706
ludamad wants to merge 21 commits intonextfrom
ad/feat/ci-phase-timing

Conversation

@ludamad
Copy link
Collaborator

@ludamad ludamad commented Feb 19, 2026

Summary

  • Instruments every major CI phase (build, test, bench, cache downloads/uploads, per-project builds) with timing data
  • Wraps cache_download/cache_upload as shell functions that shadow the scripts on $PATH, automatically publishing timing to Redis
  • Provides ci_phase() wrapper for arbitrary commands in bootstrap.sh
  • Adds ci_phases SQLite table + Redis listener + /api/ci/phases API endpoint
  • Adds stacked phase breakdown chart to ci-insights dashboard

- Fix subprocess race condition with fcntl file lock
- Warm billing caches on startup with --preload
- Add test timings link to all dashboard nav bars
- Reduce gunicorn workers from 100 to 50
- Add METRICS_DB_PATH env var for SQLite location
- Fix Content-Encoding stripping for proxied responses
- Kill stale ci-metrics process before restart
- Track test successes via daily aggregate table (test_daily_stats) without
  persisting individual passed events; backfill from existing test_events
- Fix instance type detection in log_ci_run to prefer EC2_INSTANCE_TYPE
  env var over metadata endpoint (which fails in Docker)
- Add CloudTrail backfill to resolve unknown instance types for historical
  CI runs and recalculate costs
- Add test success counts to CI Insights chart (stacked bar: successes,
  flakes, failures)
- Add time period metadata to all API responses and display in dashboard
  headers (ci-insights, cost-overview, test-timings)
- Use test_daily_stats for CI performance endpoint counts (proper
  aggregation across weekly/monthly granularity)
- Increase proxy timeout to 180s for slow BigQuery fetches
- Reduce ci-metrics to 1 worker to avoid redundant cache warmups
…ange

- CloudTrail resolver now joins RunInstances + CreateTags events by
  instance ID, then matches to ci_runs via Dashboard and Name tags
  instead of bare timestamp proximity
- Restore merge_train_failure_slack_notify to match base branch
The previous CloudTrail resolver had three issues causing near-zero
match rates:

1. Single-pass event fetching hit the 5000-event pagination limit,
   missing most RunInstances events beyond ~16 days. Now fetches in
   daily chunks.

2. CreateTags filter discarded Name-only events (line 126 of
   aws_request_instance_type), losing the Name tag for ~90% of
   instances. Now accumulates all tags first, then filters by
   Group=build-instance.

3. Name tag parsing couldn't handle INSTANCE_POSTFIX suffixes
   (e.g. pr-123_arm64_a1-fast). Now uses regex to extract branch
   name regardless of postfix format.

4. Matching window was 10 minutes (only matched first CI step).
   Now allows 90 minutes to match all steps on an instance.

Tested against real data: resolves 4187/4638 (90%) unknown instance
types across 90 days of CloudTrail history.
The API was reading CI runs from a Redis+SQLite hybrid, but the hourly
Redis sync used INSERT OR REPLACE which overwrote CloudTrail-enriched
instance_type and cost_usd back to empty values. Now:

- get_ci_runs() reads exclusively from SQLite
- sync_ci_runs_to_sqlite() uses ON CONFLICT DO UPDATE that preserves
  enriched fields (only overwrites if Redis has non-empty values)
- app.py calls updated to drop unused Redis connection argument
- Add hardcoded rates for m6a.xlarge/4xlarge/8xlarge/24xlarge that were
  missing, causing 192-vCPU fallback ($100+ instead of ~$8 for 8xlarge)
- Make pricing discovery dynamic: query DB for distinct instance types
  so newly resolved types get live pricing automatically
- Add recalculate_all_costs() to fix historical cost data
Instead of guessing 192 vCPUs (which massively overestimates), return
None so the cost shows as unknown rather than a fabricated number.
… page

Merge CI Insights + Test Timings + CI Attribution into a single 3-tab
CI Insights page (Overview, Test Details, Attribution). Remove redundant
cost chart and KPIs from CI Insights. Remove attribution tab from Cost
Overview. Replace test-timings page with redirect. Update nav across all
dashboard pages to 3 links.
gunicorn 25.x introduced a control socket that deadlocks when combined
with --preload. The worker process gets stuck after fork and never
serves requests. Removing --preload fixes the issue.
- Fix _backfill_daily_stats to be incremental (fills gaps instead of
  skipping when table is non-empty)
- Merge test_events into by_date chart so historical failed/flaked data
  appears even without daily_stats rows
- Call _upsert_daily_stats from sync_failed_tests_to_sqlite so synced
  events populate daily stats
- Stop persisting 'started' events to test_events (no duration, bloats DB)
- Remove ci:test:started from pub/sub channels (not used for stats)
Send Accept-Encoding: identity to ci-metrics so it returns uncompressed
responses. rk.py's Flask-Compress then handles browser compression in
one clean step, avoiding the deflate encoding issue that caused garbled
output in browsers.
Instead of double-compression (ci-metrics compresses, requests
decompresses, Flask-Compress re-compresses), pass the browser's
Accept-Encoding to ci-metrics and stream raw compressed bytes back.
This avoids the deflate encoding issue that caused garbled output.
Adds two test scripts hooked into ci3/bootstrap.sh test_cmds:

- test_proxy: spins up a backend Flask-Compress server and a proxy
  using the same raw-stream pattern as rk.py, verifying that gzip
  content passes through without double-compression (regression test
  for the garbled binary output bug)

- test_views: static checks that ci-insights.html has 3 tabs,
  cost-overview.html has 2 tabs (no attribution), test-timings.html
  redirects, and nav links are consistent across pages
Add per-pipeline duration chart to ci-insights overview showing avg
duration trends for each pipeline type (merge-queue, prs, next, etc.).
Add "avg mq duration" KPI card with sparkline.

Backend: add p50/p95/max percentiles to /api/ci/performance by_date
entries, add duration_by_dashboard field backed by new ci_run_daily_stats
table that materializes duration aggregates during sync cycle.

Also includes prior unstaged work: author normalization, merge queue
polling, attribution improvements, test history API, proxy fixes.
Instrument every major CI phase (build, test, bench, cache ops, per-project
builds) with timing data published to Redis and stored in SQLite.

- Add ci3/source_phases: wraps cache_download/cache_upload as functions
  that shadow the scripts on $PATH, auto-publishing timing to Redis.
  Provides ci_phase() for wrapping arbitrary commands.
- Source it from source_bootstrap so all bootstrap scripts get it.
- Wrap build/test/bench and serial project calls in bootstrap.sh.
- Add ci_phases table to metrics DB.
- Add phase listener subscribing to ci:phase:complete channel.
- Add /api/ci/phases endpoint with by_phase, by_date, recent_runs views.
- Add stacked phase breakdown chart to ci-insights dashboard.
@ludamad ludamad requested a review from charlielye as a code owner February 19, 2026 22:38
@ludamad ludamad added the ci-draft Run CI on draft PRs. label Feb 19, 2026
@ludamad ludamad marked this pull request as draft February 19, 2026 22:40
ludamad and others added 5 commits February 19, 2026 22:48
Show avg CI time per pipeline (next, prs, nightly, etc.) as stacked bars
where each segment is a phase (build, test, barretenberg, cache ops).
Total time displayed above each bar. API returns by_dashboard breakdown.
- API returns total_secs (not avg) per phase per dashboard
- Filter out cache-download/cache-upload noise from chart (project
  ci_phase wrappers capture meaningful build time)
- Add per-circuit timing in noir-protocol-circuits (circuit:{name})
- Add per-contract timing in noir-contracts (contract:{name})
- Skip publishing phases with 0s duration (cached no-ops)
- Add stacked "total CI time by pipeline" chart (hours/day)
- API returns total_duration_mins per pipeline per date
- Trivial bb change to invalidate cache and trigger real builds
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-draft Run CI on draft PRs.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant