-
Notifications
You must be signed in to change notification settings - Fork 4k
[GH-48691][C++] Write serializer could be crash if the value buffer is empty #48692
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format? or See also: |
|
@wgtmac Hi, could you please help review this, thanks! |
| // Set all bits to 0 (null) | ||
| ::arrow::bit_util::SetBitsTo(null_bitmap->mutable_data(), 0, 100, false); | ||
|
|
||
| std::shared_ptr<::arrow::Buffer> data_buffer = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please correct me if I was wrong. I think the Arrow spec is vague on whether the value buffer can be null if all values are null. It also escapes the Array::Validate check as in
arrow/cpp/src/arrow/array/validate.cc
Lines 505 to 507 in abbcd53
| if (buffer == nullptr) { | |
| continue; | |
| } |
If this violates the spec, is it better to fix Array::Validate() and calls it before calling functor.Serialize()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arrow spec is vague on whether the value buffer can be null
Yes, that also confuses me, I think we don't support null value buffer but accept empty value buffer if it's all nulls in the batch? Seems we're avoiding null value buffers: https://github.com/apache/arrow/pull/2243/changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC it's deliberate that a null buffer pointer is accepted there. I would rather not have this but it could break compatibility with existing usage.
In any case, feel free to open a separate issue about it.
| ASSERT_OK_AND_ASSIGN(null_bitmap, ::arrow::AllocateBitmap(100)); | ||
| // Set all bits to 0 (null) | ||
| ::arrow::bit_util::SetBitsTo(null_bitmap->mutable_data(), 0, 100, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just use AllocateEmptyBitmap which will zero-initialize the bitmap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| // Set all bits to 0 (null) | ||
| ::arrow::bit_util::SetBitsTo(null_bitmap->mutable_data(), 0, 100, false); | ||
|
|
||
| std::shared_ptr<::arrow::Buffer> data_buffer = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC it's deliberate that a null buffer pointer is accepted there. I would rather not have this but it could break compatibility with existing usage.
In any case, feel free to open a separate issue about it.
c5eff39 to
684b7b7
Compare
| TEST_F(TestArrowWriteSerialize, AllNulls) { | ||
| std::shared_ptr<::arrow::Buffer> null_bitmap, data_buffer; | ||
| ASSERT_OK_AND_ASSIGN(null_bitmap, ::arrow::AllocateEmptyBitmap(100)); | ||
| ASSERT_OK_AND_ASSIGN(data_buffer, ::arrow::AllocateBuffer(0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is ok. The data buffer should either remain unallocated (i.e. nullptr) or have the right length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @pitrou, please CMIIW, so for an all-null batch, the data buffer should either be nullptr (accepted for compatibility, even if not ideal) or have the correct full length. The empty buffer I used in the test is technically invalid.
I will update the test to use nullptr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @pitrou, please CMIIW, so for an all-null batch, the data buffer should either be nullptr (accepted for compatibility, even if not ideal) or have the correct full length. The empty buffer I used in the test is technically invalid.
Yes, and if you call Validate on the array, you should get an error.
wgtmac
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the code change looks good to me. I just have one comment to simplify the test case to be as simple as possible.
| } | ||
| } | ||
|
|
||
| class TestArrowWriteSerialize : public ::testing::Test { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is the right place to add this test. It seems that column_writer_test.cc does not deal with writing Arrow arrays. Is it better to move it to the arrow_reader_writer_test.cc by manually creating an Arrow array with all null values and null value buffer? In that approach, we don't need to deal with boilerplate like ColumnChunkMetaDataBuilder, PageWriter and ColumnWriter below. BTW, we can also use TEST instead of TEST_F with class definition because we only have one case here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote a simple test in arrow_reader_writer_test.cc:
TEST(TestArrowReadWrite, AllNulls) {
auto schema = ::arrow::schema({::arrow::field("all_nulls", ::arrow::int8())});
constexpr int64_t length = 3;
ASSERT_OK_AND_ASSIGN(auto null_bitmap, ::arrow::AllocateEmptyBitmap(length));
auto array_data = ::arrow::ArrayData::Make(
::arrow::int8(), length, {null_bitmap, /*values=*/nullptr}, /*null_count=*/length);
auto array = ::arrow::MakeArray(array_data);
auto record_batch = ::arrow::RecordBatch::Make(schema, length, {array});
auto sink = CreateOutputStream();
ASSERT_OK_AND_ASSIGN(auto writer, parquet::arrow::FileWriter::Open(
*schema, ::arrow::default_memory_pool(), sink,
parquet::default_writer_properties(),
parquet::default_arrow_writer_properties()));
ASSERT_OK(writer->WriteRecordBatch(*record_batch));
ASSERT_OK(writer->Close());
ASSERT_OK_AND_ASSIGN(auto buffer, sink->Finish());
std::shared_ptr<::arrow::Table> read_table;
ASSERT_OK_AND_ASSIGN(auto reader,
parquet::arrow::OpenFile(std::make_shared<BufferReader>(buffer),
::arrow::default_memory_pool()));
ASSERT_OK(reader->ReadTable(&read_table));
auto expected_table = ::arrow::Table::Make(
schema, {::arrow::ArrayFromJSON(::arrow::int8(), R"([null, null, null])")});
ASSERT_TRUE(expected_table->Equals(*read_table));
}
Rationale for this change
WriteArrowSerialize could unconditionally read values from the Arrow array even for null rows. Since it's possible the caller could provided a zero-sized dummy buffer for all-null arrays, this caused an ASAN heap-buffer-overflow.
What changes are included in this PR?
Early check the array is not all null values before serialize it
Are these changes tested?
Added tests.
Are there any user-facing changes?
No