Skip to content

Conversation

@xiangfu0
Copy link
Contributor

@xiangfu0 xiangfu0 commented Jan 19, 2026

Motivation

Large-table sampling needs to be deterministic and avoid query-time segment selection overhead. This adds a pluggable “table sampler” definition in table config and precomputes sampler-specific routing entries at the broker.

Key changes

  • Config: add tableSamplers to TableConfig (with ZK SerDe + builder support) and new TableSamplerConfig.

  • Query option: add tableSampler=<name> to select a sampler at query time.

  • Broker routing:

    • Build and cache sampler-specific routing entries per table sampler name.
    • Select sampler routing entry based on tableSampler (fallback to default routing entry when absent/unknown).
    • Keep sampler routing in sync on assignment changes, instance include/exclude, and segment refresh.
  • Built-in samplers:

    • firstN: select first N segments (lexicographic)
    • timeBucket: select up to N segments per time bucket (days or hours), deterministically.
      • For OFFLINE: bucket by segment end time from ZK metadata.
      • For REALTIME: derive timestamp from segment name (LLC / uploaded realtime), avoiding ZK reads.
      • Optional partition-aware sampling when a single segment partition column is configured.- MSQ support: propagate query options into MSQ leaf routing requests so tableSampler works with multi-stage engine.
  • MSQ support: propagate query options into MSQ leaf routing, so tableSampler works with the multi‑stage engine.

  • Quickstart: add a sample tableSamplers config to batch airlineStats.

  • Tests:

    • Unit test for timeBucket
    • Integration test (shared cluster) validating 10 segments/day × 7 days → sampler returns 1 segment/day and group-by results reflect that
  • Quickstart: add sample tableSamplers config to batch airlineStats table config.

How to use

1. Add samplers to your table config

Example (offline table):

"tableSamplers": [
  {
    "name": "small",
    "type": "firstN",
    "properties": {
      "numSegments": "10"
    }
  },
  {
    "name": "perDay1",
    "type": "timeBucket",
    "properties": {
      "numSegmentsPerDay": "1",
      "bucketDays": "1"
    }
  },
  {
    "name": "perHour2",
    "type": "timeBucket",
    "properties": {
      "numSegmentsPerHour": "2",
      "bucketHours": "1"
    }
  }
]

2. Query with a sampler (via query option)
Use the query option: tableSampler=

  • Pinot SQL:
SET tableSampler=small;
SELECT COUNT(*) FROM myTable;

SET tableSampler=perDay1;
SELECT DaysSinceEpoch, COUNT(*) FROM myTable GROUP BY DaysSinceEpoch;
  • QueryOptions:
    • queryOptions: "tableSampler=perDay1"

3. Sampler details

firstN

  • Purpose: Deterministic, small subset of segments.
  • Config
    • properties.numSegments (required, positive)

timeBucket

  • Purpose: Select up to N segments per time bucket.
  • Config (choose day or hour mode)
    • Day mode:
      • properties.numSegmentsPerDay (required, positive)
      • properties.bucketDays (optional, default 1)
    • Hour mode:
      • properties.numSegmentsPerHour (required, positive)
      • properties.bucketHours (optional, default 1)
  • Notes
    • Buckets are computed in UTC.
    • If a single partition column is configured, selection is per‑partition within each bucket.
    • Segments without valid timestamps (or invalid partition metadata when partition‑aware) are skipped.
    • Selection is deterministic: lexicographically first N segment names per bucket/partition.

4. Default behavior (no sampler selected)
If you don’t set tableSampler, Pinot uses the default routing entry (full table, no sampling).

Compatibility

  • Fully backward compatible: if no sampler is configured or selected, routing behavior is unchanged.

@xiangfu0 xiangfu0 force-pushed the feature/table-sampler-routing branch from cd4e1c6 to 19f856b Compare January 19, 2026 14:30
@xiangfu0 xiangfu0 marked this pull request as draft January 19, 2026 14:35
@codecov-commenter
Copy link

codecov-commenter commented Jan 19, 2026

Codecov Report

❌ Patch coverage is 50.65963% with 187 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.21%. Comparing base (d54ec21) to head (9480a27).
⚠️ Report is 27 commits behind head on master.

Files with missing lines Patch % Lines
...oker/routing/manager/BaseBrokerRoutingManager.java 22.05% 98 Missing and 8 partials ⚠️
...g/tablesampler/TimeBucketSegmentsTableSampler.java 88.31% 2 Missing and 16 partials ⚠️
...uting/tablesampler/FirstNSegmentsTableSampler.java 0.00% 15 Missing ⚠️
...org/apache/pinot/spi/config/table/TableConfig.java 31.25% 10 Missing and 1 partial ⚠️
.../org/apache/pinot/query/routing/WorkerManager.java 64.28% 8 Missing and 2 partials ⚠️
...t/spi/config/table/sampler/TableSamplerConfig.java 0.00% 8 Missing ⚠️
...oker/routing/tablesampler/TableSamplerFactory.java 0.00% 7 Missing ⚠️
...entpreselector/TableSamplerSegmentPreSelector.java 0.00% 6 Missing ⚠️
...not/common/utils/config/TableConfigSerDeUtils.java 42.85% 2 Missing and 2 partials ⚠️
...he/pinot/spi/utils/builder/TableConfigBuilder.java 0.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #17532      +/-   ##
============================================
+ Coverage     63.18%   63.21%   +0.03%     
- Complexity     1477     1479       +2     
============================================
  Files          3172     3178       +6     
  Lines        189773   190276     +503     
  Branches      29041    29141     +100     
============================================
+ Hits         119913   120288     +375     
- Misses        60547    60630      +83     
- Partials       9313     9358      +45     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 63.16% <50.65%> (+0.05%) ⬆️
java-21 63.12% <50.65%> (-0.05%) ⬇️
temurin 63.21% <50.65%> (+0.03%) ⬆️
unittests 63.21% <50.65%> (+0.03%) ⬆️
unittests1 55.51% <42.62%> (-0.02%) ⬇️
unittests2 34.12% <45.91%> (+0.09%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@xiangfu0 xiangfu0 force-pushed the feature/table-sampler-routing branch from 19f856b to 758dda4 Compare January 20, 2026 08:23
@xiangfu0 xiangfu0 requested a review from Copilot January 20, 2026 08:33
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a pluggable table sampling feature that enables deterministic sampling of segments at the broker routing layer to reduce query-time overhead for large tables. The implementation precomputes sampler-specific routing entries and allows query-time selection via a tableSampler query option.

Changes:

  • Introduced TableSamplerConfig in table configuration with two built-in sampler types: firstN (lexicographic selection) and nPerDay (temporal bucketing)
  • Extended broker routing manager to build and cache sampler-specific routing entries alongside default routing
  • Added MSQ support by propagating query options to leaf routing requests
  • Included ZooKeeper serialization/deserialization for table sampler configurations

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
pinot-tools/src/main/resources/examples/batch/airlineStats/airlineStats_offline_table_config.json Added sample tableSamplers configuration to quickstart example
pinot-spi/src/main/java/org/apache/pinot/spi/utils/builder/TableConfigBuilder.java Added builder support for table samplers
pinot-spi/src/main/java/org/apache/pinot/spi/utils/CommonConstants.java Registered tableSampler query option constant
pinot-spi/src/main/java/org/apache/pinot/spi/config/table/sampler/TableSamplerConfig.java New configuration class for table sampler definitions
pinot-spi/src/main/java/org/apache/pinot/spi/config/table/TableConfig.java Extended table config to include table samplers list
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/utils/TableConfigUtilsTest.java Updated test constructors with new table sampler parameter
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/index/creator/CLPForwardIndexCreatorTest.java Updated test constructor with new table sampler parameter
pinot-query-planner/src/main/java/org/apache/pinot/query/routing/WorkerManager.java Propagated query options to MSQ leaf routing for sampler support
pinot-integration-tests/src/test/java/org/apache/pinot/integration/tests/custom/TableSamplerIntegrationTest.java Integration test validating nPerDay sampler behavior
pinot-connectors/pinot-spark-3-connector/src/main/scala/org/apache/pinot/connector/spark/v3/datasource/PinotDataWriter.scala Updated constructor with new table sampler parameter
pinot-common/src/main/java/org/apache/pinot/common/utils/config/TableConfigSerDeUtils.java Added ZK serialization/deserialization for table samplers
pinot-broker/src/test/java/org/apache/pinot/broker/routing/tablesampler/NPerDaySegmentsTableSamplerTest.java Unit tests for nPerDay sampler including timezone handling
pinot-broker/src/main/java/org/apache/pinot/broker/routing/tablesampler/TableSamplerFactory.java Factory for creating table sampler instances
pinot-broker/src/main/java/org/apache/pinot/broker/routing/tablesampler/TableSampler.java Interface defining table sampler contract
pinot-broker/src/main/java/org/apache/pinot/broker/routing/tablesampler/NPerDaySegmentsTableSampler.java Implementation selecting N segments per day using ZK metadata
pinot-broker/src/main/java/org/apache/pinot/broker/routing/tablesampler/FirstNSegmentsTableSampler.java Implementation selecting first N segments lexicographically
pinot-broker/src/main/java/org/apache/pinot/broker/routing/segmentpreselector/TableSamplerSegmentPreSelector.java Wrapper applying table sampler to pre-selected segments
pinot-broker/src/main/java/org/apache/pinot/broker/routing/manager/BaseBrokerRoutingManager.java Core routing logic to build, cache, and select sampler-specific routing entries

@xiangfu0 xiangfu0 force-pushed the feature/table-sampler-routing branch from 758dda4 to adfc8ba Compare January 20, 2026 11:49
@xiangfu0 xiangfu0 requested a review from Copilot January 20, 2026 15:06
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.

@xiangfu0 xiangfu0 force-pushed the feature/table-sampler-routing branch from adfc8ba to bed4ff3 Compare January 21, 2026 18:25
@xiangfu0 xiangfu0 requested a review from Copilot January 21, 2026 18:25
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

pinot-broker/src/main/java/org/apache/pinot/broker/routing/tablesampler/NPerDaySegmentsTableSampler.java:1

  • Line 158 uses the incorrect constant Segment.TIME_TIME_UNIT instead of Segment.TIME_UNIT. This will cause the code to fail to retrieve the time unit field from segment metadata, preventing epoch-zero segments from being correctly sampled.
/**

@xiangfu0 xiangfu0 force-pushed the feature/table-sampler-routing branch from bed4ff3 to 72a336d Compare January 21, 2026 18:42
@xiangfu0 xiangfu0 marked this pull request as ready for review January 21, 2026 18:42
@xiangfu0 xiangfu0 force-pushed the feature/table-sampler-routing branch 5 times, most recently from 0301e99 to 6241ba7 Compare January 22, 2026 19:02
@xiangfu0 xiangfu0 changed the title Add table sampler routing entries (precomputed segment subsets) with nPerDay sampler and tableSampler query option Add table sampler routing entries (precomputed segment subsets) with timeBucket sampler and tableSampler query option Jan 22, 2026
@xiangfu0 xiangfu0 changed the title Add table sampler routing entries (precomputed segment subsets) with timeBucket sampler and tableSampler query option Add pluggable table samplers with precomputed broker routing entries and tableSampler query option Jan 22, 2026
@xiangfu0 xiangfu0 force-pushed the feature/table-sampler-routing branch from 6241ba7 to 02b4512 Compare January 23, 2026 10:29
@xiangfu0 xiangfu0 force-pushed the feature/table-sampler-routing branch from 02b4512 to 9480a27 Compare January 30, 2026 03:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants