-
-
Notifications
You must be signed in to change notification settings - Fork 19.4k
BUG: fix DataFrame.__setitem__ with 2D object arrays
#63184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
can I get some eyes on this when you get a chance @rhshadrach 🙌 |
rhshadrach
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
| arr = value | ||
|
|
||
| # np.matrix is always 2D; gonna convert to regular ndarray | ||
| if isinstance(arr, np.matrix): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what case do we get a matrix here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_sanitize_column(...) can see an np.matrix when the user assigns one directly. for example: df["col"] = np.matrix([[1], [2], [3]]).
Since, np.matrix is always 2D and preserves its 2D shape under the slicing operation, calling arr[:, 0] (which occurs on line 5517) on a matrix still gives the shape (n, 1) rather than (n,). Essentially, this would mean that we wouldn't actually end up producing a 1D array for matrices in that case.
Hence, I thought converting matrics to a regular ndarray first will ensure that the upcoming blocks behave consistently for both np.ndarray and np.matrix.
| elif arr.dtype == object: | ||
| # single-column setitem with a 2D object array is not allowed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why only object dtype here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dtype == object guard is there to keep this bugfix scoped tightly to the case that regressed in issue #61026.
The problematic behaviour (ValueError: Buffer has wrong number of dimensions (expected 1, got 2)) only arose when assigning a 2D dtype=object array to a single column. For other dtypes, assigning a 2D array either already behaves correctly or raises a clearer, existing error, so this change leaves those paths alone to avoid altering semantics outside this issue.
doc/source/whatsnew/vX.X.X.rstfile if fixing a bug or adding a new feature.AGENTS.md.Fixed
DataFrame.__setitem__so that assigning a 2D NumPy array withdtype=objectand shape(n, 1)to a single column works the same way as the non-object case, and raise clearer, high-level errors for unsupported shapes. More detail below:Before this change:
dtype=objectarray with shape(n, 1)to a single DataFrame column (e.g.,df["c1"] = t2) raised a low-levelValueError: Buffer has wrong number of dimensions (expected 1, got 2). This was coming fromlib.maybe_convert_objects, instead of behaving like the non-object case.(n, 1)already worked just fine, and assigning a 2D array with multiple columns to multiple columns (e.g.,df[["c1", "c2"]] = t3) also worked, butndim > 2arrays could surface confusing internal errors.After this change:
dtype=objectarray with shape(n, 1)to a single column now works by flattening(n, 1)to a 1D(n,)array, matching the behaviour of non-object arrays.ValueErrorexplaining that only(n, 1)is supported and suggesting multi-column assignment (e.g.,df[["c1", "c2"]] = some_values) for wider arrays.ndim >= 3to a single column is now raises an explicitValueErrorindicating that setting a column with that spec is not supported. The existing multi-column assignment with 2D arrays remains unchanged.