Skip to content

Conversation

@boneanxs
Copy link

Rationale for this change

WriteArrowSerialize could unconditionally read values from the Arrow array even for null rows. Since it's possible the caller could provided a zero-sized dummy buffer for all-null arrays, this caused an ASAN heap-buffer-overflow.

What changes are included in this PR?

Early check the array is not all null values before serialize it

Are these changes tested?

Added tests.

Are there any user-facing changes?

No

@boneanxs boneanxs requested a review from wgtmac as a code owner December 30, 2025 04:25
@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@boneanxs
Copy link
Author

boneanxs commented Jan 5, 2026

@wgtmac Hi, could you please help review this, thanks!

// Set all bits to 0 (null)
::arrow::bit_util::SetBitsTo(null_bitmap->mutable_data(), 0, 100, false);

std::shared_ptr<::arrow::Buffer> data_buffer = nullptr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please correct me if I was wrong. I think the Arrow spec is vague on whether the value buffer can be null if all values are null. It also escapes the Array::Validate check as in

if (buffer == nullptr) {
continue;
}
.

If this violates the spec, is it better to fix Array::Validate() and calls it before calling functor.Serialize()?

Copy link
Member

@wgtmac wgtmac Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @pitrou as this is somehow related to #48560 though we don't have a fuzz writer yet.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arrow spec is vague on whether the value buffer can be null

Yes, that also confuses me, I think we don't support null value buffer but accept empty value buffer if it's all nulls in the batch? Seems we're avoiding null value buffers: https://github.com/apache/arrow/pull/2243/changes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC it's deliberate that a null buffer pointer is accepted there. I would rather not have this but it could break compatibility with existing usage.

In any case, feel free to open a separate issue about it.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 7, 2026
Comment on lines 2426 to 2428
ASSERT_OK_AND_ASSIGN(null_bitmap, ::arrow::AllocateBitmap(100));
// Set all bits to 0 (null)
::arrow::bit_util::SetBitsTo(null_bitmap->mutable_data(), 0, 100, false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just use AllocateEmptyBitmap which will zero-initialize the bitmap.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// Set all bits to 0 (null)
::arrow::bit_util::SetBitsTo(null_bitmap->mutable_data(), 0, 100, false);

std::shared_ptr<::arrow::Buffer> data_buffer = nullptr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC it's deliberate that a null buffer pointer is accepted there. I would rather not have this but it could break compatibility with existing usage.

In any case, feel free to open a separate issue about it.

@boneanxs boneanxs force-pushed the fix_empty_data_buffer branch from c5eff39 to 684b7b7 Compare January 21, 2026 06:08
@boneanxs boneanxs requested review from pitrou and wgtmac January 21, 2026 06:08
TEST_F(TestArrowWriteSerialize, AllNulls) {
std::shared_ptr<::arrow::Buffer> null_bitmap, data_buffer;
ASSERT_OK_AND_ASSIGN(null_bitmap, ::arrow::AllocateEmptyBitmap(100));
ASSERT_OK_AND_ASSIGN(data_buffer, ::arrow::AllocateBuffer(0));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is ok. The data buffer should either remain unallocated (i.e. nullptr) or have the right length.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @pitrou, please CMIIW, so for an all-null batch, the data buffer should either be nullptr (accepted for compatibility, even if not ideal) or have the correct full length. The empty buffer I used in the test is technically invalid.

I will update the test to use nullptr.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @pitrou, please CMIIW, so for an all-null batch, the data buffer should either be nullptr (accepted for compatibility, even if not ideal) or have the correct full length. The empty buffer I used in the test is technically invalid.

Yes, and if you call Validate on the array, you should get an error.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code change looks good to me. I just have one comment to simplify the test case to be as simple as possible.

}
}

class TestArrowWriteSerialize : public ::testing::Test {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is the right place to add this test. It seems that column_writer_test.cc does not deal with writing Arrow arrays. Is it better to move it to the arrow_reader_writer_test.cc by manually creating an Arrow array with all null values and null value buffer? In that approach, we don't need to deal with boilerplate like ColumnChunkMetaDataBuilder, PageWriter and ColumnWriter below. BTW, we can also use TEST instead of TEST_F with class definition because we only have one case here.

Copy link
Member

@wgtmac wgtmac Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote a simple test in arrow_reader_writer_test.cc:

TEST(TestArrowReadWrite, AllNulls) {
  auto schema = ::arrow::schema({::arrow::field("all_nulls", ::arrow::int8())});

  constexpr int64_t length = 3;
  ASSERT_OK_AND_ASSIGN(auto null_bitmap, ::arrow::AllocateEmptyBitmap(length));
  auto array_data = ::arrow::ArrayData::Make(
      ::arrow::int8(), length, {null_bitmap, /*values=*/nullptr}, /*null_count=*/length);
  auto array = ::arrow::MakeArray(array_data);
  auto record_batch = ::arrow::RecordBatch::Make(schema, length, {array});

  auto sink = CreateOutputStream();
  ASSERT_OK_AND_ASSIGN(auto writer, parquet::arrow::FileWriter::Open(
                                        *schema, ::arrow::default_memory_pool(), sink,
                                        parquet::default_writer_properties(),
                                        parquet::default_arrow_writer_properties()));
  ASSERT_OK(writer->WriteRecordBatch(*record_batch));
  ASSERT_OK(writer->Close());
  ASSERT_OK_AND_ASSIGN(auto buffer, sink->Finish());

  std::shared_ptr<::arrow::Table> read_table;
  ASSERT_OK_AND_ASSIGN(auto reader,
                       parquet::arrow::OpenFile(std::make_shared<BufferReader>(buffer),
                                                ::arrow::default_memory_pool()));
  ASSERT_OK(reader->ReadTable(&read_table));
  auto expected_table = ::arrow::Table::Make(
      schema, {::arrow::ArrayFromJSON(::arrow::int8(), R"([null, null, null])")});
  ASSERT_TRUE(expected_table->Equals(*read_table));
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants