-
Notifications
You must be signed in to change notification settings - Fork 66
Open
Labels
Description
What happens?
Filtering on a string column with WHERE length(col) = <integer> via iceberg_scan() returns 0 rows starting in DuckDB 1.4.2. The same query returns the correct results in DuckDB 1.4.1.
The bug does not affect read_parquet() on the same underlying files. It also does not affect equivalent predicates written as WHERE length(col) > N-1 AND length(col) < N+1 or WHERE length(col) IN (N) — both of those return the correct rows via iceberg_scan() in all versions.
To Reproduce
import duckdb
import pyarrow as pa
import pyiceberg
from pyiceberg.catalog.sql import SqlCatalog
from pyiceberg.schema import Schema
from pyiceberg.types import NestedField, StringType
print(f"DuckDB version: {duckdb.__version__}")
print(f"PyIceberg version: {pyiceberg.__version__}")
print(f"PyArrow version: {pa.__version__}")
WAREHOUSE_PATH = "/tmp/iceberg_debug"
NAMESPACE = "ns"
TABLE_NAME = "items"
catalog = SqlCatalog(
"local",
uri="sqlite:///:memory:",
warehouse=f"file://{WAREHOUSE_PATH}",
)
catalog.create_namespace_if_not_exists(NAMESPACE)
schema = Schema(NestedField(1, "item_id", StringType()))
full_table_name = f"{NAMESPACE}.{TABLE_NAME}"
if catalog.table_exists(full_table_name):
catalog.drop_table(full_table_name)
iceberg_table = catalog.create_table(full_table_name, schema=schema)
# Two rows with length=36 (should match), one with 37 and one with 35 (should not match)
iceberg_table.append(pa.table({"item_id": [
"10000000-0000-0000-0000-000000000001", # length=36 ← should match
"20000000-0000-0000-0000-000000000002", # length=36 ← should match
"30000000-0000-0000-0000-0000000000033", # length=37 ← should NOT match
"40000000-0000-0000-0000-00000000004", # length=35 ← should NOT match
]}))
con = duckdb.connect()
con.execute("INSTALL iceberg; LOAD iceberg; SET unsafe_enable_version_guessing = true;")
SCAN = f"iceberg_scan('{iceberg_table.location()}')"
print("\n-- Query 1 (broken): equality predicate on length()")
print(con.execute(f"SELECT item_id, length(item_id) AS len FROM {SCAN} WHERE length(item_id) = 36").pl())
print("\n-- Query 2 (workaround): range predicate")
print(con.execute(f"SELECT item_id, length(item_id) AS len FROM {SCAN} WHERE length(item_id) > 35 AND length(item_id) < 37").pl())
print("\n-- Query 3 (workaround): IN predicate")
print(con.execute(f"SELECT item_id, length(item_id) AS len FROM {SCAN} WHERE length(item_id) IN (36)").pl())
parquet_glob = f"{WAREHOUSE_PATH}/{NAMESPACE}/{TABLE_NAME}/data/*.parquet"
print("\n-- Query 4 (workaround): read_parquet with same equality predicate")
print(con.execute(f"SELECT item_id, length(item_id) AS len FROM read_parquet('{parquet_glob}') WHERE length(item_id) = 36").pl())
catalog.purge_table(full_table_name)
catalog.drop_namespace(NAMESPACE)Output for DuckDB 1.4.1:
-- Query 1 (broken): equality predicate on length()
shape: (2, 2)
┌──────────────────────────────────────┬─────┐
│ item_id ┆ len │
│ --- ┆ --- │
│ str ┆ i64 │
╞══════════════════════════════════════╪═════╡
│ 10000000-0000-0000-0000-000000000001 ┆ 36 │
│ 20000000-0000-0000-0000-000000000002 ┆ 36 │
└──────────────────────────────────────┴─────┘
-- Query 2 (workaround): range predicate
┌─────────────────────────────────┬─────┐
│ item_id ┆ len │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════════════════════════════════╪═════╡
│ 10000000-0000-0000-0000-000000… ┆ 36 │
│ 20000000-0000-0000-0000-000000… ┆ 36 │
└─────────────────────────────────┴─────┘
-- Query 3 (workaround): IN predicate
shape: (2, 2)
┌─────────────────────────────────┬─────┐
│ item_id ┆ len │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════════════════════════════════╪═════╡
│ 10000000-0000-0000-0000-000000… ┆ 36 │
│ 20000000-0000-0000-0000-000000… ┆ 36 │
└─────────────────────────────────┴─────┘
-- Query 4 (workaround): read_parquet with same equality predicate
shape: (2, 2)
┌─────────────────────────────────┬─────┐
│ item_id ┆ len │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════════════════════════════════╪═════╡
│ 10000000-0000-0000-0000-000000… ┆ 36 │
│ 20000000-0000-0000-0000-000000… ┆ 36 │
└─────────────────────────────────┴─────┘
Output for DuckDB 1.4.2+ (Tested for each version up to and including 1.5.0):
-- Query 1 (broken): equality predicate on length()
shape: (0, 2)
┌─────────┬─────┐
│ item_id ┆ len │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════════╪═════╡
└─────────┴─────┘
-- Query 2 (workaround): range predicate
shape: (2, 2)
┌─────────────────────────────────┬─────┐
│ item_id ┆ len │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════════════════════════════════╪═════╡
│ 10000000-0000-0000-0000-000000… ┆ 36 │
│ 20000000-0000-0000-0000-000000… ┆ 36 │
└─────────────────────────────────┴─────┘
-- Query 3 (workaround): IN predicate
shape: (2, 2)
┌─────────────────────────────────┬─────┐
│ item_id ┆ len │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════════════════════════════════╪═════╡
│ 10000000-0000-0000-0000-000000… ┆ 36 │
│ 20000000-0000-0000-0000-000000… ┆ 36 │
└─────────────────────────────────┴─────┘
-- Query 4 (workaround): read_parquet with same equality predicate
shape: (2, 2)
┌─────────────────────────────────┬─────┐
│ item_id ┆ len │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════════════════════════════════╪═════╡
│ 10000000-0000-0000-0000-000000… ┆ 36 │
│ 20000000-0000-0000-0000-000000… ┆ 36 │
└─────────────────────────────────┴─────┘
OS:
Ubuntu 24.04 x86_64
DuckDB Package Version:
1.4.2
Python Version:
3.11
Full Name:
Lasse Schier Christiansen
Affiliation:
TV2 Danmark
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a stable release
Did you include all relevant data sets for reproducing the issue?
Yes
Did you include all code required to reproduce the issue?
- Yes, I have
Did you include all relevant configuration to reproduce the issue?
- Yes, I have
Reactions are currently unavailable