-
Notifications
You must be signed in to change notification settings - Fork 273
refactor: rename scan.allowIncompatible to scan.unsignedSmallIntSafetyCheck #3238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
andygrove
wants to merge
8
commits into
apache:main
Choose a base branch
from
andygrove:unsigned-small-int-safety-check
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…yCheck This change renames `spark.comet.scan.allowIncompatible` to `spark.comet.scan.unsignedSmallIntSafetyCheck` and changes its default to `true` (enabled by default). The key change is that ByteType is removed from the safety check entirely, leaving only ShortType subject to fallback behavior. ## Why ByteType is Safe ByteType columns are always safe for native execution because: 1. **Parquet type mapping**: Spark's ByteType can only originate from signed INT8 in Parquet. There is no unsigned 8-bit Parquet type (UINT_8) that maps to ByteType. 2. **UINT_8 maps to ShortType**: When Parquet files contain unsigned UINT_8 columns, Spark maps them to ShortType (16-bit), not ByteType. This is because UINT_8 values (0-255) exceed the signed byte range (-128 to 127). 3. **Truncation preserves signed values**: When storing signed INT8 in 8 bits, the truncation from any wider representation preserves the correct signed value due to two's complement representation. ## Why ShortType Needs the Safety Check ShortType columns may be problematic because: 1. **Ambiguous origin**: ShortType can come from either signed INT16 (safe) or unsigned UINT_8 (potentially incompatible). 2. **Different reader behavior**: Arrow-based readers like DataFusion respect the unsigned UINT_8 logical type and read data as unsigned, while Spark ignores the logical type and reads as signed. This can produce different results for values 128-255. 3. **No metadata available**: At scan time, Comet cannot determine whether a ShortType column originated from INT16 or UINT_8, so the safety check conservatively falls back to Spark for all ShortType columns. Users who know their data does not contain unsigned UINT_8 columns can disable the safety check with `spark.comet.scan.unsignedSmallIntSafetyCheck=false`. Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Use local `root_op` variable instead of unwrapping `exec_context.root_op` - Replace `is_some()` + `unwrap()` pattern with `if let Some(...)` Co-Authored-By: Claude Opus 4.5 <[email protected]>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3238 +/- ##
============================================
+ Coverage 56.12% 60.07% +3.95%
- Complexity 976 1437 +461
============================================
Files 119 172 +53
Lines 11743 15929 +4186
Branches 2251 2631 +380
============================================
+ Hits 6591 9570 +2979
- Misses 4012 5031 +1019
- Partials 1140 1328 +188 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
spark.comet.scan.allowIncompatibletospark.comet.scan.unsignedSmallIntSafetyCheckfalsetotrue(safety check enabled by default)ByteTypefrom the safety check, leaving onlyShortTypesubject to fallbackWhy ByteType is Safe
ByteTypecolumns are always safe for native execution because:Parquet type mapping: Spark's
ByteTypecan only originate from signedINT8in Parquet. There is no unsigned 8-bit Parquet type (UINT_8) that maps toByteType.UINT_8 maps to ShortType: When Parquet files contain unsigned
UINT_8columns, Spark maps them toShortType(16-bit), notByteType. This is becauseUINT_8values (0-255) exceed the signed byte range (-128 to 127).Truncation preserves signed values: When storing signed
INT8in 8 bits, the truncation from any wider representation preserves the correct signed value due to two's complement representation.Why ShortType Needs the Safety Check
ShortTypecolumns may be problematic because:Ambiguous origin:
ShortTypecan come from either signedINT16(safe) or unsignedUINT_8(potentially incompatible).Different reader behavior: Arrow-based readers like DataFusion respect the unsigned
UINT_8logical type and read data as unsigned, while Spark ignores the logical type and reads as signed. This can produce different results for values 128-255.No metadata available: At scan time, Comet cannot determine whether a
ShortTypecolumn originated fromINT16orUINT_8, so the safety check conservatively falls back to Spark for allShortTypecolumns.Users who know their data does not contain unsigned
UINT_8columns can disable the safety check withspark.comet.scan.unsignedSmallIntSafetyCheck=false.🤖 Generated with Claude Code