Skip to content

Conversation

@suvodeep-pyne
Copy link
Contributor

@suvodeep-pyne suvodeep-pyne commented Jan 27, 2026

Summary

  • Add TenantTagReadinessCallback to verify broker has valid tenant tags (not just broker_untagged)
  • Add RoutingReadinessCallback to verify routing exists for all online tables before reporting healthy
  • Register both callbacks in BaseBrokerStarter.registerServiceStatusHandler() with documented ordering dependency
  • Use ExternalView (actual state) instead of IdealState (desired state) in RoutingReadinessCallback, following the Helix layer separation principle

Background

This addresses a production issue where queries could be routed to brokers that were not yet ready to serve traffic, causing query failures. The root cause is that the existing health check only validates Helix state convergence but doesn't verify:

  1. The broker has valid tenant tags assigned
  2. Routing tables are built for assigned tables

Test plan

  • Unit tests for TenantTagReadinessCallback (4 tests)
  • Unit tests for RoutingReadinessCallback (6 tests including ExternalView API verification)
  • Integration test in HelixBrokerStarterTest

@suvodeep-pyne suvodeep-pyne changed the title Add health check callbacks for tenant tags and routing readiness [broker] Add health check callbacks for tenant tags and routing readiness Jan 27, 2026
@codecov-commenter
Copy link

codecov-commenter commented Jan 27, 2026

Codecov Report

❌ Patch coverage is 75.70093% with 26 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.12%. Comparing base (7305eec) to head (d8d03a9).
⚠️ Report is 8 commits behind head on master.

Files with missing lines Patch % Lines
.../api/resources/PinotBrokerHealthCheckResource.java 0.00% 14 Missing ⚠️
...roker/broker/helix/TenantTagReadinessCallback.java 80.00% 3 Missing and 4 partials ⚠️
.../broker/broker/helix/RoutingReadinessCallback.java 89.58% 2 Missing and 3 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #17580      +/-   ##
============================================
- Coverage     63.16%   63.12%   -0.05%     
- Complexity     1479     1495      +16     
============================================
  Files          3173     3175       +2     
  Lines        189917   190210     +293     
  Branches      29064    29091      +27     
============================================
+ Hits         119970   120072     +102     
- Misses        60621    60792     +171     
- Partials       9326     9346      +20     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 63.12% <75.70%> (-0.02%) ⬇️
java-21 34.04% <75.70%> (-29.09%) ⬇️
temurin 63.12% <75.70%> (-0.05%) ⬇️
unittests 63.12% <75.70%> (-0.05%) ⬇️
unittests1 55.50% <100.00%> (-0.04%) ⬇️
unittests2 34.04% <75.70%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@suvodeep-pyne suvodeep-pyne marked this pull request as draft January 29, 2026 01:43
This change ensures that broker health checks take tenant tags and routing
readiness into account before reporting the broker as healthy.

Two new ServiceStatusCallback implementations are added:
- TenantTagReadinessCallback: Verifies the broker has valid tenant tags
  (not just broker_untagged)
- RoutingReadinessCallback: Verifies routing exists for all assigned tables

This addresses a production issue where queries could be routed to brokers
that were not yet ready to serve traffic, causing query failures.

The callbacks are registered before the IdealState/ExternalView match
callbacks, with TenantTagReadinessCallback preceding RoutingReadinessCallback
since an untagged broker has no tables assigned.
…lState

ExternalView represents the actual cluster state (what Helix participants
have reported), while IdealState represents the desired state. Participants
should use ExternalView to determine their responsibilities, following the
Helix layer separation principle.
Introduce `/health/readiness` to check broker readiness status. Add shutdown detection mechanism to handle graceful service degradation during broker shutdown. Update service status callbacks to include readiness checks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants