Implement preimage for floor function to enable predicate pushdown#20059
Implement preimage for floor function to enable predicate pushdown#20059comphead merged 9 commits intoapache:mainfrom
Conversation
This adds a `preimage` implementation for the `floor()` function that transforms `floor(x) = N` into `x >= N AND x < N+1`. This enables statistics-based predicate pushdown for queries using floor(). For example, a query like: SELECT * FROM t WHERE floor(price) = 100 Is rewritten to: SELECT * FROM t WHERE price >= 100 AND price < 101 This allows the query engine to leverage min/max statistics from Parquet row groups, significantly reducing the amount of data scanned. Benchmarks on the ClickBench hits dataset show: - 80% file pruning (89 out of 111 files skipped) - 70x fewer rows scanned (1.4M vs 100M)
bd9b68f to
5c4c771
Compare
comphead
left a comment
There was a problem hiding this comment.
the PR looks good to me, thanks @devanshu0987 and @masonh22 for the review, decimal types support makes sense, WDYT can decimals be addressed in this PR or in followup?
|
Can you advise on what I should do?
|
|
@devanshu0987 that's a great point that I didn't consider. My first thought would be to always rescale the min to whatever scale is used by the max using something like |
|
If decimal support for preimage floor is not trivial and makes this PR much more complex, we should address this in followup IMO. I created #20080 |
|
@devanshu0987 please address changes for |
This feels like it will work. |
Hi @comphead , I have taken care of the comments. Please take another look. |
|
Hi @devanshu0987, I didn't have a chance to take a close look at everything, but I saw you're using unit tests instead of slt tests. I'm working on a similar issue on |
comphead
left a comment
There was a problem hiding this comment.
Thanks everyone for helping this PR 👍
Correctness is proved by existing floor slt tests. |
|
FYI @sdf-jkl |
This adds a
preimageimplementation for thefloor()function that transformsfloor(x) = Nintox >= N AND x < N+1. This enables statistics-based predicate pushdown for queries using floor().For example, a query like:
SELECT * FROM t WHERE floor(price) = 100Is rewritten to:
SELECT * FROM t WHERE price >= 100 AND price < 101This allows the query engine to leverage min/max statistics from Parquet row groups, significantly reducing the amount of data scanned.
Benchmarks on the ClickBench hits dataset show:
Which issue does this PR close?
Rationale for this change
#19946
This epic introduced the pre-image API. This PR is using the pre-image API to provide it for
floorfunction where it is applicable.What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?
No