Align cudf indexing/setitem validation and alignment with pandas#22912
Align cudf indexing/setitem validation and alignment with pandas#22912galipremsagar wants to merge 7 commits into
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
💤 Files with no reviewable changes (1)
📝 WalkthroughSummary by CodeRabbit
WalkthroughThis PR improves cuDF's pandas compatibility across multiple indexing and assignment areas: temporal dtype values are rejected when assigned into numeric columns; boolean Series/DataFrame masks are reindexed by label instead of applied positionally; dict/set objects are rejected as indexers; MultiIndex ChangesPandas Compatibility Fixes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
python/cudf/cudf/core/dataframe.py (1)
436-447:⚠️ Potential issue | 🟠 Major | ⚡ Quick winHIGH: Apply dict/set indexer validation to
.iloc.__getitem__too.
_DataFrameLocIndexer.__getitem__now rejects dict/set keys, but_DataFrameIlocIndexer.__getitem__still sendsargdirectly into iloc parsing, leaving.iloc[{...}]/.iloc[{...}, :]on the old path.As per coding guidelines, cuDF Python reviews should flag pandas API compatibility regressions in public indexing behavior.
Suggested fix
class _DataFrameIlocIndexer(_DataFrameIndexer): @@ def __getitem__(self, arg): + indexing_utils.check_dict_or_set_indexers(arg) ( row_key, (🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@python/cudf/cudf/core/dataframe.py` around lines 436 - 447, The __getitem__ method in _DataFrameIlocIndexer needs to validate that dict/set keys are rejected, just like _DataFrameLocIndexer.__getitem__ already does. Before calling indexing_utils.destructure_dataframe_iloc_indexer with the arg parameter, add validation logic to check if arg is a dict or set and raise an appropriate error to prevent these invalid index types from being processed through the iloc path, ensuring consistent pandas API compatibility between .loc and .iloc indexing behavior.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@python/cudf/cudf/core/dataframe.py`:
- Around line 1643-1654: The fillna(False) and astype("bool") normalization for
boolean Series masks only runs in the conditional reindex branch, allowing
same-index nullable Series masks to reach as_column(arg) with null values still
present. Move the normalization steps (fillna(False) and astype("bool")) outside
of the conditional block so they apply to all Series masks regardless of whether
reindexing is needed. Restructure the logic so that after converting arg to a
cudf.Series, the fillna and type conversion are applied unconditionally, and
reindexing only occurs when the indexes do not match, ensuring all nullable
boolean Series masks are normalized before being converted to a column.
- Around line 349-365: The boolean Series row mask in `row_key_col` is not being
aligned with the target index before being used as `scatter_map`, which can
cause data to be written to incorrect rows. After reindexing `value` to align
with `target_index`, the `row_key_col` boolean Series must also be reindexed to
match the same `target_index` before it is assigned to `scatter_map`. This
ensures that the scatter operation writes values from the reindexed RHS
dataframe to the correct rows in the target dataframe, preventing silent data
corruption from misaligned boolean masks.
- Around line 405-410: The issue is that when key[0] represents a single row
(scalar key), assigning the entire Series to self._frame[col].loc[key[0]] treats
the Series object as a single cell value instead of extracting the appropriate
row value from the Series. To fix this, check whether key[0] is a scalar that
selects a single row: if it is, extract the scalar value from the Series at that
index position (using value.loc[key[0]] or value[key[0]]) and assign that
extracted value instead of the entire Series object to
self._frame[col].loc[key[0]]. This ensures proper pandas API compatibility and
correct row-based assignment semantics.
In `@python/cudf/cudf/core/indexing_utils.py`:
- Around line 685-699: The validation logic in the boolean Series indexer block
around line 693 incorrectly uses has_nulls() on the aligned column to detect
unalignable masks, which treats pre-existing nulls as errors when they should be
allowed. Instead of checking if the aligned column has any nulls, you need to
distinguish between pre-existing nulls in the original key Series and nulls
introduced by the reindexing process. Modify the check to only raise the error
if new nulls were introduced during reindexing (indicating missing labels in the
original index), while allowing pre-existing nulls from the original Series to
pass through. Compare the null patterns before and after the reindex call to
identify only the newly introduced nulls.
In `@python/cudf/cudf/core/series.py`:
- Around line 383-397: The issue is that the null check at line 391 using
`aligned._column.has_nulls()` cannot distinguish between pre-existing nulls in
the original `arg` Series and new nulls introduced by the reindex operation due
to missing labels. To fix this, compare the null status before and after
alignment: check if `arg._column.has_nulls()` is already true, and only raise
the "Unalignable boolean Series" error if the alignment operation introduced new
nulls that weren't present in the original `arg` Series. This ensures the
validation only catches cases where the alignment actually created mismatches,
not cases where the user intentionally provided a boolean Series with null
values.
---
Outside diff comments:
In `@python/cudf/cudf/core/dataframe.py`:
- Around line 436-447: The __getitem__ method in _DataFrameIlocIndexer needs to
validate that dict/set keys are rejected, just like
_DataFrameLocIndexer.__getitem__ already does. Before calling
indexing_utils.destructure_dataframe_iloc_indexer with the arg parameter, add
validation logic to check if arg is a dict or set and raise an appropriate error
to prevent these invalid index types from being processed through the iloc path,
ensuring consistent pandas API compatibility between .loc and .iloc indexing
behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 4dc99a95-5f9e-4273-b0c6-98af31f3535b
📒 Files selected for processing (14)
python/cudf/cudf/core/column/numerical.pypython/cudf/cudf/core/dataframe.pypython/cudf/cudf/core/indexed_frame.pypython/cudf/cudf/core/indexing_utils.pypython/cudf/cudf/core/multiindex.pypython/cudf/cudf/core/series.pypython/cudf/cudf/core/single_column_frame.pypython/cudf/cudf/pandas/scripts/pandas-testing-plugin.pypython/cudf/cudf/tests/dataframe/indexing/test_getitem.pypython/cudf/cudf/tests/dataframe/indexing/test_loc.pypython/cudf/cudf/tests/dataframe/indexing/test_setitem.pypython/cudf/cudf/tests/dataframe/methods/test_reindex.pypython/cudf/cudf/tests/series/indexing/test_setitem.pypython/cudf/cudf/tests/series/test_attributes.py
💤 Files with no reviewable changes (1)
- python/cudf/cudf/pandas/scripts/pandas-testing-plugin.py
Update SPDX copyright lines to the canonical 'NVIDIA CORPORATION & AFFILIATES. All rights reserved.' form expected by the verify-copyright pre-commit hook.
Aligns cudf's classic indexing/setitem behavior with pandas (applied unconditionally — no
mode.pandas_compatiblegating), fixing a batch offrame/indexing/test_indexing.pyfailures undercudf.pandas:datetime64/timedelta64(incl.NaT) assigned into a numeric columnset/dict(or tuples nesting them) as.loc/.ilocindexersSeries/DataFramemasks by label, and aSeriesRHS inlocsetitemInvalidIndexErrorfor a slice inside adf[tuple]getitemreindex; keep aDataFramefor 1-level MultiIndex.loc; fix list-of-tuples MultiIndex.loc.valuesfor a masked extension dtype with nullsAlso drops the now-passing entries from the pandas-testing plugin and adds cudf unit tests comparing against pandas. Verified the full cudf unit suite (101k tests): 0 regressions.