Skip to content

Conversation

@andygrove
Copy link
Member

@andygrove andygrove commented Jan 22, 2026

Summary

  • Renames spark.comet.scan.allowIncompatible to spark.comet.scan.unsignedSmallIntSafetyCheck
  • Changes default from false to true (safety check enabled by default)
  • Removes ByteType from the safety check, leaving only ShortType subject to fallback

Why ByteType is Safe

ByteType columns are always safe for native execution because:

  1. Parquet type mapping: Spark's ByteType can only originate from signed INT8 in Parquet. There is no unsigned 8-bit Parquet type (UINT_8) that maps to ByteType.

  2. UINT_8 maps to ShortType: When Parquet files contain unsigned UINT_8 columns, Spark maps them to ShortType (16-bit), not ByteType. This is because UINT_8 values (0-255) exceed the signed byte range (-128 to 127).

  3. Truncation preserves signed values: When storing signed INT8 in 8 bits, the truncation from any wider representation preserves the correct signed value due to two's complement representation.

Why ShortType Needs the Safety Check

ShortType columns may be problematic because:

  1. Ambiguous origin: ShortType can come from either signed INT16 (safe) or unsigned UINT_8 (potentially incompatible).

  2. Different reader behavior: Arrow-based readers like DataFusion respect the unsigned UINT_8 logical type and read data as unsigned, while Spark ignores the logical type and reads as signed. This can produce different results for values 128-255.

  3. No metadata available: At scan time, Comet cannot determine whether a ShortType column originated from INT16 or UINT_8, so the safety check conservatively falls back to Spark for all ShortType columns.

Users who know their data does not contain unsigned UINT_8 columns can disable the safety check with spark.comet.scan.unsignedSmallIntSafetyCheck=false.

🤖 Generated with Claude Code

…yCheck

This change renames `spark.comet.scan.allowIncompatible` to
`spark.comet.scan.unsignedSmallIntSafetyCheck` and changes its default
to `true` (enabled by default).

The key change is that ByteType is removed from the safety check entirely,
leaving only ShortType subject to fallback behavior.

## Why ByteType is Safe

ByteType columns are always safe for native execution because:

1. **Parquet type mapping**: Spark's ByteType can only originate from signed
   INT8 in Parquet. There is no unsigned 8-bit Parquet type (UINT_8) that maps
   to ByteType.

2. **UINT_8 maps to ShortType**: When Parquet files contain unsigned UINT_8
   columns, Spark maps them to ShortType (16-bit), not ByteType. This is
   because UINT_8 values (0-255) exceed the signed byte range (-128 to 127).

3. **Truncation preserves signed values**: When storing signed INT8 in 8 bits,
   the truncation from any wider representation preserves the correct signed
   value due to two's complement representation.

## Why ShortType Needs the Safety Check

ShortType columns may be problematic because:

1. **Ambiguous origin**: ShortType can come from either signed INT16 (safe) or
   unsigned UINT_8 (potentially incompatible).

2. **Different reader behavior**: Arrow-based readers like DataFusion respect
   the unsigned UINT_8 logical type and read data as unsigned, while Spark
   ignores the logical type and reads as signed. This can produce different
   results for values 128-255.

3. **No metadata available**: At scan time, Comet cannot determine whether a
   ShortType column originated from INT16 or UINT_8, so the safety check
   conservatively falls back to Spark for all ShortType columns.

Users who know their data does not contain unsigned UINT_8 columns can disable
the safety check with `spark.comet.scan.unsignedSmallIntSafetyCheck=false`.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@andygrove andygrove marked this pull request as ready for review January 22, 2026 01:55
@andygrove andygrove marked this pull request as draft January 22, 2026 02:18
andygrove and others added 5 commits January 22, 2026 07:34
- Use local `root_op` variable instead of unwrapping `exec_context.root_op`
- Replace `is_some()` + `unwrap()` pattern with `if let Some(...)`

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@andygrove andygrove modified the milestone: 0.13.0 Jan 22, 2026
@codecov-commenter
Copy link

codecov-commenter commented Jan 24, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 60.07%. Comparing base (f09f8af) to head (66e3186).
⚠️ Report is 878 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #3238      +/-   ##
============================================
+ Coverage     56.12%   60.07%   +3.95%     
- Complexity      976     1437     +461     
============================================
  Files           119      172      +53     
  Lines         11743    15929    +4186     
  Branches       2251     2631     +380     
============================================
+ Hits           6591     9570    +2979     
- Misses         4012     5031    +1019     
- Partials       1140     1328     +188     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@andygrove andygrove marked this pull request as ready for review January 24, 2026 15:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants