Skip to content

iceberg_scan: WHERE length(col) = <literal> returns no rows (regression in 1.4.2) #377

@lasse-schier

Description

@lasse-schier

What happens?

Filtering on a string column with WHERE length(col) = <integer> via iceberg_scan() returns 0 rows starting in DuckDB 1.4.2. The same query returns the correct results in DuckDB 1.4.1.

The bug does not affect read_parquet() on the same underlying files. It also does not affect equivalent predicates written as WHERE length(col) > N-1 AND length(col) < N+1 or WHERE length(col) IN (N) — both of those return the correct rows via iceberg_scan() in all versions.

To Reproduce

import duckdb
import pyarrow as pa
import pyiceberg
from pyiceberg.catalog.sql import SqlCatalog
from pyiceberg.schema import Schema
from pyiceberg.types import NestedField, StringType

print(f"DuckDB version:    {duckdb.__version__}")
print(f"PyIceberg version: {pyiceberg.__version__}")
print(f"PyArrow version:   {pa.__version__}")

WAREHOUSE_PATH = "/tmp/iceberg_debug"
NAMESPACE = "ns"
TABLE_NAME = "items"

catalog = SqlCatalog(
    "local",
    uri="sqlite:///:memory:",
    warehouse=f"file://{WAREHOUSE_PATH}",
)
catalog.create_namespace_if_not_exists(NAMESPACE)

schema = Schema(NestedField(1, "item_id", StringType()))
full_table_name = f"{NAMESPACE}.{TABLE_NAME}"
if catalog.table_exists(full_table_name):
    catalog.drop_table(full_table_name)
iceberg_table = catalog.create_table(full_table_name, schema=schema)

# Two rows with length=36 (should match), one with 37 and one with 35 (should not match)
iceberg_table.append(pa.table({"item_id": [
    "10000000-0000-0000-0000-000000000001",   # length=36 ← should match
    "20000000-0000-0000-0000-000000000002",   # length=36 ← should match
    "30000000-0000-0000-0000-0000000000033",  # length=37 ← should NOT match
    "40000000-0000-0000-0000-00000000004",    # length=35 ← should NOT match
]}))

con = duckdb.connect()
con.execute("INSTALL iceberg; LOAD iceberg; SET unsafe_enable_version_guessing = true;")
SCAN = f"iceberg_scan('{iceberg_table.location()}')"

print("\n-- Query 1 (broken): equality predicate on length()")
print(con.execute(f"SELECT item_id, length(item_id) AS len FROM {SCAN} WHERE length(item_id) = 36").pl())

print("\n-- Query 2 (workaround): range predicate")
print(con.execute(f"SELECT item_id, length(item_id) AS len FROM {SCAN} WHERE length(item_id) > 35 AND length(item_id) < 37").pl())

print("\n-- Query 3 (workaround): IN predicate")
print(con.execute(f"SELECT item_id, length(item_id) AS len FROM {SCAN} WHERE length(item_id) IN (36)").pl())

parquet_glob = f"{WAREHOUSE_PATH}/{NAMESPACE}/{TABLE_NAME}/data/*.parquet"
print("\n-- Query 4 (workaround): read_parquet with same equality predicate")
print(con.execute(f"SELECT item_id, length(item_id) AS len FROM read_parquet('{parquet_glob}') WHERE length(item_id) = 36").pl())

catalog.purge_table(full_table_name)
catalog.drop_namespace(NAMESPACE)

Output for DuckDB 1.4.1:

-- Query 1 (broken): equality predicate on length()
shape: (2, 2)
┌──────────────────────────────────────┬─────┐
│ item_id                              ┆ len │
│ ---                                  ┆ --- │
│ str                                  ┆ i64 │
╞══════════════════════════════════════╪═════╡
│ 10000000-0000-0000-0000-000000000001 ┆ 36  │
│ 20000000-0000-0000-0000-000000000002 ┆ 36  │
└──────────────────────────────────────┴─────┘

-- Query 2 (workaround): range predicate
┌─────────────────────────────────┬─────┐
│ item_id                         ┆ len │
│ ---                             ┆ --- │
│ str                             ┆ i64 │
╞═════════════════════════════════╪═════╡
│ 10000000-0000-0000-0000-000000… ┆ 36  │
│ 20000000-0000-0000-0000-000000… ┆ 36  │
└─────────────────────────────────┴─────┘

-- Query 3 (workaround): IN predicate
shape: (2, 2)
┌─────────────────────────────────┬─────┐
│ item_id                         ┆ len │
│ ---                             ┆ --- │
│ str                             ┆ i64 │
╞═════════════════════════════════╪═════╡
│ 10000000-0000-0000-0000-000000… ┆ 36  │
│ 20000000-0000-0000-0000-000000… ┆ 36  │
└─────────────────────────────────┴─────┘

-- Query 4 (workaround): read_parquet with same equality predicate
shape: (2, 2)
┌─────────────────────────────────┬─────┐
│ item_id                         ┆ len │
│ ---                             ┆ --- │
│ str                             ┆ i64 │
╞═════════════════════════════════╪═════╡
│ 10000000-0000-0000-0000-000000… ┆ 36  │
│ 20000000-0000-0000-0000-000000… ┆ 36  │
└─────────────────────────────────┴─────┘

Output for DuckDB 1.4.2+ (Tested for each version up to and including 1.5.0):

-- Query 1 (broken): equality predicate on length()
shape: (0, 2)
┌─────────┬─────┐
│ item_id ┆ len │
│ ---     ┆ --- │
│ str     ┆ i64 │
╞═════════╪═════╡
└─────────┴─────┘

-- Query 2 (workaround): range predicate
shape: (2, 2)
┌─────────────────────────────────┬─────┐
│ item_id                         ┆ len │
│ ---                             ┆ --- │
│ str                             ┆ i64 │
╞═════════════════════════════════╪═════╡
│ 10000000-0000-0000-0000-000000… ┆ 36  │
│ 20000000-0000-0000-0000-000000… ┆ 36  │
└─────────────────────────────────┴─────┘

-- Query 3 (workaround): IN predicate
shape: (2, 2)
┌─────────────────────────────────┬─────┐
│ item_id                         ┆ len │
│ ---                             ┆ --- │
│ str                             ┆ i64 │
╞═════════════════════════════════╪═════╡
│ 10000000-0000-0000-0000-000000… ┆ 36  │
│ 20000000-0000-0000-0000-000000… ┆ 36  │
└─────────────────────────────────┴─────┘

-- Query 4 (workaround): read_parquet with same equality predicate
shape: (2, 2)
┌─────────────────────────────────┬─────┐
│ item_id                         ┆ len │
│ ---                             ┆ --- │
│ str                             ┆ i64 │
╞═════════════════════════════════╪═════╡
│ 10000000-0000-0000-0000-000000… ┆ 36  │
│ 20000000-0000-0000-0000-000000… ┆ 36  │
└─────────────────────────────────┴─────┘

OS:

Ubuntu 24.04 x86_64

DuckDB Package Version:

1.4.2

Python Version:

3.11

Full Name:

Lasse Schier Christiansen

Affiliation:

TV2 Danmark

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration to reproduce the issue?

  • Yes, I have

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions