-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Add pluggable table samplers with precomputed broker routing entries and tableSampler query option #17532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
cd4e1c6 to
19f856b
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #17532 +/- ##
============================================
+ Coverage 63.18% 63.21% +0.03%
- Complexity 1477 1479 +2
============================================
Files 3172 3178 +6
Lines 189773 190276 +503
Branches 29041 29141 +100
============================================
+ Hits 119913 120288 +375
- Misses 60547 60630 +83
- Partials 9313 9358 +45
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
19f856b to
758dda4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds a pluggable table sampling feature that enables deterministic sampling of segments at the broker routing layer to reduce query-time overhead for large tables. The implementation precomputes sampler-specific routing entries and allows query-time selection via a tableSampler query option.
Changes:
- Introduced
TableSamplerConfigin table configuration with two built-in sampler types:firstN(lexicographic selection) andnPerDay(temporal bucketing) - Extended broker routing manager to build and cache sampler-specific routing entries alongside default routing
- Added MSQ support by propagating query options to leaf routing requests
- Included ZooKeeper serialization/deserialization for table sampler configurations
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
pinot-tools/src/main/resources/examples/batch/airlineStats/airlineStats_offline_table_config.json |
Added sample tableSamplers configuration to quickstart example |
pinot-spi/src/main/java/org/apache/pinot/spi/utils/builder/TableConfigBuilder.java |
Added builder support for table samplers |
pinot-spi/src/main/java/org/apache/pinot/spi/utils/CommonConstants.java |
Registered tableSampler query option constant |
pinot-spi/src/main/java/org/apache/pinot/spi/config/table/sampler/TableSamplerConfig.java |
New configuration class for table sampler definitions |
pinot-spi/src/main/java/org/apache/pinot/spi/config/table/TableConfig.java |
Extended table config to include table samplers list |
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/utils/TableConfigUtilsTest.java |
Updated test constructors with new table sampler parameter |
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/index/creator/CLPForwardIndexCreatorTest.java |
Updated test constructor with new table sampler parameter |
pinot-query-planner/src/main/java/org/apache/pinot/query/routing/WorkerManager.java |
Propagated query options to MSQ leaf routing for sampler support |
pinot-integration-tests/src/test/java/org/apache/pinot/integration/tests/custom/TableSamplerIntegrationTest.java |
Integration test validating nPerDay sampler behavior |
pinot-connectors/pinot-spark-3-connector/src/main/scala/org/apache/pinot/connector/spark/v3/datasource/PinotDataWriter.scala |
Updated constructor with new table sampler parameter |
pinot-common/src/main/java/org/apache/pinot/common/utils/config/TableConfigSerDeUtils.java |
Added ZK serialization/deserialization for table samplers |
pinot-broker/src/test/java/org/apache/pinot/broker/routing/tablesampler/NPerDaySegmentsTableSamplerTest.java |
Unit tests for nPerDay sampler including timezone handling |
pinot-broker/src/main/java/org/apache/pinot/broker/routing/tablesampler/TableSamplerFactory.java |
Factory for creating table sampler instances |
pinot-broker/src/main/java/org/apache/pinot/broker/routing/tablesampler/TableSampler.java |
Interface defining table sampler contract |
pinot-broker/src/main/java/org/apache/pinot/broker/routing/tablesampler/NPerDaySegmentsTableSampler.java |
Implementation selecting N segments per day using ZK metadata |
pinot-broker/src/main/java/org/apache/pinot/broker/routing/tablesampler/FirstNSegmentsTableSampler.java |
Implementation selecting first N segments lexicographically |
pinot-broker/src/main/java/org/apache/pinot/broker/routing/segmentpreselector/TableSamplerSegmentPreSelector.java |
Wrapper applying table sampler to pre-selected segments |
pinot-broker/src/main/java/org/apache/pinot/broker/routing/manager/BaseBrokerRoutingManager.java |
Core routing logic to build, cache, and select sampler-specific routing entries |
...t-broker/src/main/java/org/apache/pinot/broker/routing/manager/BaseBrokerRoutingManager.java
Outdated
Show resolved
Hide resolved
...t-broker/src/main/java/org/apache/pinot/broker/routing/manager/BaseBrokerRoutingManager.java
Outdated
Show resolved
Hide resolved
.../src/main/java/org/apache/pinot/broker/routing/tablesampler/NPerDaySegmentsTableSampler.java
Outdated
Show resolved
Hide resolved
...t-broker/src/main/java/org/apache/pinot/broker/routing/manager/BaseBrokerRoutingManager.java
Show resolved
Hide resolved
758dda4 to
adfc8ba
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.
...t-broker/src/main/java/org/apache/pinot/broker/routing/manager/BaseBrokerRoutingManager.java
Outdated
Show resolved
Hide resolved
.../src/main/java/org/apache/pinot/broker/routing/tablesampler/NPerDaySegmentsTableSampler.java
Outdated
Show resolved
Hide resolved
...t-broker/src/main/java/org/apache/pinot/broker/routing/manager/BaseBrokerRoutingManager.java
Outdated
Show resolved
Hide resolved
pinot-query-planner/src/main/java/org/apache/pinot/query/routing/WorkerManager.java
Show resolved
Hide resolved
adfc8ba to
bed4ff3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
pinot-broker/src/main/java/org/apache/pinot/broker/routing/tablesampler/NPerDaySegmentsTableSampler.java:1
- Line 158 uses the incorrect constant
Segment.TIME_TIME_UNITinstead ofSegment.TIME_UNIT. This will cause the code to fail to retrieve the time unit field from segment metadata, preventing epoch-zero segments from being correctly sampled.
/**
pinot-spi/src/main/java/org/apache/pinot/spi/config/table/TableConfig.java
Show resolved
Hide resolved
...t-broker/src/main/java/org/apache/pinot/broker/routing/manager/BaseBrokerRoutingManager.java
Outdated
Show resolved
Hide resolved
bed4ff3 to
72a336d
Compare
0301e99 to
6241ba7
Compare
6241ba7 to
02b4512
Compare
02b4512 to
9480a27
Compare
Motivation
Large-table sampling needs to be deterministic and avoid query-time segment selection overhead. This adds a pluggable “table sampler” definition in table config and precomputes sampler-specific routing entries at the broker.
Key changes
Config: add
tableSamplerstoTableConfig(with ZK SerDe + builder support) and newTableSamplerConfig.Query option: add
tableSampler=<name>to select a sampler at query time.Broker routing:
tableSampler(fallback to default routing entry when absent/unknown).Built-in samplers:
firstN: select first N segments (lexicographic)timeBucket: select up to N segments per time bucket (days or hours), deterministically.tableSamplerworks with multi-stage engine.MSQ support: propagate query options into MSQ leaf routing, so
tableSamplerworks with the multi‑stage engine.Quickstart: add a sample
tableSamplersconfig to batch airlineStats.Tests:
timeBucketQuickstart: add sample
tableSamplersconfig to batch airlineStats table config.How to use
1. Add samplers to your table config
Example (offline table):
2. Query with a sampler (via query option)
Use the query option: tableSampler=
queryOptions: "tableSampler=perDay1"3. Sampler details
firstN
properties.numSegments(required, positive)timeBucket
properties.numSegmentsPerDay(required, positive)properties.bucketDays(optional, default 1)properties.numSegmentsPerHour(required, positive)properties.bucketHours(optional, default 1)4. Default behavior (no sampler selected)
If you don’t set
tableSampler, Pinot uses the default routing entry (full table, no sampling).Compatibility