Align cudf indexing/setitem validation and alignment with pandas by galipremsagar · Pull Request #22912 · rapidsai/cudf

galipremsagar · 2026-06-17T00:46:15Z

Aligns cudf's classic indexing/setitem behavior with pandas (applied unconditionally — no mode.pandas_compatible gating), fixing a batch of frame/indexing/test_indexing.py failures under cudf.pandas:

Reject datetime64/timedelta64 (incl. NaT) assigned into a numeric column
Reject set/dict (or tuples nesting them) as .loc/.iloc indexers
Align boolean Series/DataFrame masks by label, and a Series RHS in loc setitem
Raise InvalidIndexError for a slice inside a df[tuple] getitem
Match int/float labels in reindex; keep a DataFrame for 1-level MultiIndex .loc; fix list-of-tuples MultiIndex .loc
Raise on .values for a masked extension dtype with nulls

Also drops the now-passing entries from the pandas-testing plugin and adds cudf unit tests comparing against pandas. Verified the full cudf unit suite (101k tests): 0 regressions.

copy-pr-bot · 2026-06-17T00:46:18Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-06-17T00:59:54Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 44ed6827-4b29-4165-9116-74cd50c76fef

📥 Commits

Reviewing files that changed from the base of the PR and between d8f5f9c and 04c72c5.

📒 Files selected for processing (1)

python/cudf/cudf/pandas/scripts/pandas-testing-plugin.py

💤 Files with no reviewable changes (1)

python/cudf/cudf/pandas/scripts/pandas-testing-plugin.py

📝 Walkthrough

Summary by CodeRabbit

Bug Fixes
- Prevent assigning temporal (datetime/timedelta, including NaT) scalars/arrays into numeric columns; now raises TypeError instead of silently casting.
- Reject dict/set indexers (including nested in tuples) with pandas-like TypeError guidance to use a list.
- Improved boolean-mask alignment by label, filling nulls as False, with stricter unalignable boolean indexer errors.
- Refined .loc/.iloc tuple and MultiIndex selection/assignment semantics; one-level MultiIndex .loc stays a DataFrame.
- DataFrame.__getitem__ rejects tuple keys containing inner slices.
- Series.values/single-column values now raise ValueError for nullable/masked dtypes with nulls.
- reindex treats numeric int/uint/float dtypes as compatible for label matching.
Tests
- Added regression tests covering the updated indexing, assignment, temporal/type errors, and exception parity with pandas (including operator-based cases).

Walkthrough

This PR improves cuDF's pandas compatibility across multiple indexing and assignment areas: temporal dtype values are rejected when assigned into numeric columns; boolean Series/DataFrame masks are reindexed by label instead of applied positionally; dict/set objects are rejected as indexers; MultiIndex .loc semantics are corrected for list-of-tuple keys and 1-level indices; numeric dtype reindex label matching is widened; and masked-dtype .values now raises instead of silently converting.

Changes

Pandas Compatibility Fixes

Layer / File(s)	Summary
Temporal dtype rejection in numeric column setitem `python/cudf/cudf/core/column/numerical.py`, `python/cudf/cudf/tests/dataframe/indexing/test_setitem.py`, `python/cudf/cudf/tests/series/indexing/test_setitem.py`	`NumericalColumn._cast_setitem_value` raises `TypeError` for temporal PyArrow scalars and temporal-kind columns assigned into numeric columns. Tests verify matching exception behavior with pandas for scalar NaT, datetime64, and timedelta64 values via `iloc` and `loc`.
Dict/set indexer rejection across getitem and setitem `python/cudf/cudf/core/indexing_utils.py`, `python/cudf/cudf/core/dataframe.py`, `python/cudf/cudf/tests/dataframe/indexing/test_loc.py`, `python/cudf/cudf/tests/dataframe/indexing/test_setitem.py`	New `check_dict_or_set_indexers` helper raises `TypeError` for set/dict (including nested in tuples) as indexing keys. `_DataFrameIndexer.__setitem__`, `_DataFrameLocIndexer.__getitem__`, and `_DataFrameIlocIndexer.__getitem__` call this validator. Tests verify pandas/cuDF exception parity.
Boolean Series label-alignment in indexing `python/cudf/cudf/core/indexing_utils.py`, `python/cudf/cudf/core/series.py`, `python/cudf/cudf/tests/dataframe/indexing/test_loc.py`, `python/cudf/cudf/tests/dataframe/indexing/test_setitem.py`	`parse_single_row_loc_key` and `_SeriesLocIndexer._loc_to_iloc` now reindex boolean Series indexers by label when index differs, raising `ValueError` for unalignable masks. Tests verify boolean Series masks are applied by label alignment, not positional application.
DataFrame loc getitem/setitem: MultiIndex tuples, slice rejection, RHS alignment, boolean masks `python/cudf/cudf/core/dataframe.py`, `python/cudf/cudf/core/multiindex.py`, `python/cudf/cudf/tests/dataframe/indexing/test_getitem.py`, `python/cudf/cudf/tests/dataframe/indexing/test_loc.py`, `python/cudf/cudf/tests/dataframe/indexing/test_setitem.py`	`DataFrame.__getitem__` raises `InvalidIndexError` for tuple keys with slices. `_getitem_tuple_arg` treats list-of-full-tuples as explicit MultiIndex keys. `_setitem_tuple_arg` distinguishes boolean masks from label keys for DataFrame RHS and aligns RHS by index. Series RHS in multi-column scatter is assigned whole for label alignment. Boolean DataFrame/Series masks in `__setitem__` are reindexed to `self.index`. `MultiIndex._index_and_downcast` skips downcasting for 1-level indices.
Numeric dtype reindex compatibility `python/cudf/cudf/core/indexed_frame.py`, `python/cudf/cudf/tests/dataframe/methods/test_reindex.py`	`_is_same_dtype` returns `True` when both dtype kinds are numeric (`i`/`u`/`f`), enabling reindex to match int/float/uint label equality. A parametrized regression test validates reindexing a float index with int or float labels.
SingleColumnFrame.values error for masked dtypes with nulls `python/cudf/cudf/core/single_column_frame.py`, `python/cudf/cudf/tests/series/test_attributes.py`	`SingleColumnFrame.values` imports `numpy` and raises `ValueError` when nulls are present and the dtype is a non-`np.dtype` nullable extension type, preventing silent float-NaN downcasting. Test verifies this raises for cuDF while pandas returns an `ExtensionArray`.
pandas-testing-plugin xfail map updates `python/cudf/cudf/pandas/scripts/pandas-testing-plugin.py`	Removes ~70 xfail/skip entries to reflect now-fixed behaviors, including test_setitem_validation_scalar_int (temporal into numeric), test_loc_getitem_nested_indexer (set-typed indexers), test_multi_assign_broadcasting_rhs, test_loc_set_series_to_multiple_columns, test_iloc_ea_series_indexer_with_na, test_type_error_multiindex, and test_interpolate methods. Reorganizes DataFrame indexing test mapping.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested labels

improvement

Suggested reviewers

rjzamora
Matt711
mroeschke

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 15.79% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main objective: aligning cudf indexing/setitem validation and alignment with pandas behavior across multiple aspects.
Description check	✅ Passed	The description is directly related to the changeset, providing context for the indexing/setitem alignment changes and listing the key behavioral modifications made.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

python/cudf/cudf/core/dataframe.py (1)
436-447: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

HIGH: Apply dict/set indexer validation to .iloc.__getitem__ too.

_DataFrameLocIndexer.__getitem__ now rejects dict/set keys, but _DataFrameIlocIndexer.__getitem__ still sends arg directly into iloc parsing, leaving .iloc[{...}]/.iloc[{...}, :] on the old path.

As per coding guidelines, cuDF Python reviews should flag pandas API compatibility regressions in public indexing behavior.
Suggested fix
 class _DataFrameIlocIndexer(_DataFrameIndexer):
@@
     def __getitem__(self, arg):
+        indexing_utils.check_dict_or_set_indexers(arg)
         (
             row_key,
             (
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/cudf/cudf/core/dataframe.py` around lines 436 - 447, The __getitem__
method in _DataFrameIlocIndexer needs to validate that dict/set keys are
rejected, just like _DataFrameLocIndexer.__getitem__ already does. Before
calling indexing_utils.destructure_dataframe_iloc_indexer with the arg
parameter, add validation logic to check if arg is a dict or set and raise an
appropriate error to prevent these invalid index types from being processed
through the iloc path, ensuring consistent pandas API compatibility between .loc
and .iloc indexing behavior.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@python/cudf/cudf/core/dataframe.py`:
- Around line 1643-1654: The fillna(False) and astype("bool") normalization for
boolean Series masks only runs in the conditional reindex branch, allowing
same-index nullable Series masks to reach as_column(arg) with null values still
present. Move the normalization steps (fillna(False) and astype("bool")) outside
of the conditional block so they apply to all Series masks regardless of whether
reindexing is needed. Restructure the logic so that after converting arg to a
cudf.Series, the fillna and type conversion are applied unconditionally, and
reindexing only occurs when the indexes do not match, ensuring all nullable
boolean Series masks are normalized before being converted to a column.
- Around line 349-365: The boolean Series row mask in `row_key_col` is not being
aligned with the target index before being used as `scatter_map`, which can
cause data to be written to incorrect rows. After reindexing `value` to align
with `target_index`, the `row_key_col` boolean Series must also be reindexed to
match the same `target_index` before it is assigned to `scatter_map`. This
ensures that the scatter operation writes values from the reindexed RHS
dataframe to the correct rows in the target dataframe, preventing silent data
corruption from misaligned boolean masks.
- Around line 405-410: The issue is that when key[0] represents a single row
(scalar key), assigning the entire Series to self._frame[col].loc[key[0]] treats
the Series object as a single cell value instead of extracting the appropriate
row value from the Series. To fix this, check whether key[0] is a scalar that
selects a single row: if it is, extract the scalar value from the Series at that
index position (using value.loc[key[0]] or value[key[0]]) and assign that
extracted value instead of the entire Series object to
self._frame[col].loc[key[0]]. This ensures proper pandas API compatibility and
correct row-based assignment semantics.

In `@python/cudf/cudf/core/indexing_utils.py`:
- Around line 685-699: The validation logic in the boolean Series indexer block
around line 693 incorrectly uses has_nulls() on the aligned column to detect
unalignable masks, which treats pre-existing nulls as errors when they should be
allowed. Instead of checking if the aligned column has any nulls, you need to
distinguish between pre-existing nulls in the original key Series and nulls
introduced by the reindexing process. Modify the check to only raise the error
if new nulls were introduced during reindexing (indicating missing labels in the
original index), while allowing pre-existing nulls from the original Series to
pass through. Compare the null patterns before and after the reindex call to
identify only the newly introduced nulls.

In `@python/cudf/cudf/core/series.py`:
- Around line 383-397: The issue is that the null check at line 391 using
`aligned._column.has_nulls()` cannot distinguish between pre-existing nulls in
the original `arg` Series and new nulls introduced by the reindex operation due
to missing labels. To fix this, compare the null status before and after
alignment: check if `arg._column.has_nulls()` is already true, and only raise
the "Unalignable boolean Series" error if the alignment operation introduced new
nulls that weren't present in the original `arg` Series. This ensures the
validation only catches cases where the alignment actually created mismatches,
not cases where the user intentionally provided a boolean Series with null
values.

---

Outside diff comments:
In `@python/cudf/cudf/core/dataframe.py`:
- Around line 436-447: The __getitem__ method in _DataFrameIlocIndexer needs to
validate that dict/set keys are rejected, just like
_DataFrameLocIndexer.__getitem__ already does. Before calling
indexing_utils.destructure_dataframe_iloc_indexer with the arg parameter, add
validation logic to check if arg is a dict or set and raise an appropriate error
to prevent these invalid index types from being processed through the iloc path,
ensuring consistent pandas API compatibility between .loc and .iloc indexing
behavior.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4dc99a95-5f9e-4273-b0c6-98af31f3535b

📥 Commits

Reviewing files that changed from the base of the PR and between fe332f5 and 2a7fa96.

📒 Files selected for processing (14)

python/cudf/cudf/core/column/numerical.py
python/cudf/cudf/core/dataframe.py
python/cudf/cudf/core/indexed_frame.py
python/cudf/cudf/core/indexing_utils.py
python/cudf/cudf/core/multiindex.py
python/cudf/cudf/core/series.py
python/cudf/cudf/core/single_column_frame.py
python/cudf/cudf/pandas/scripts/pandas-testing-plugin.py
python/cudf/cudf/tests/dataframe/indexing/test_getitem.py
python/cudf/cudf/tests/dataframe/indexing/test_loc.py
python/cudf/cudf/tests/dataframe/indexing/test_setitem.py
python/cudf/cudf/tests/dataframe/methods/test_reindex.py
python/cudf/cudf/tests/series/indexing/test_setitem.py
python/cudf/cudf/tests/series/test_attributes.py

💤 Files with no reviewable changes (1)

python/cudf/cudf/pandas/scripts/pandas-testing-plugin.py

fix

526c16c

github-actions Bot assigned galipremsagar Jun 17, 2026

github-actions Bot added Python Affects Python cuDF API. cudf.pandas Issues specific to cudf.pandas labels Jun 17, 2026

github-project-automation Bot added this to cuDF Python Jun 17, 2026

galipremsagar changed the title ~~fix~~ Align cudf indexing/setitem validation and alignment with pandas Jun 17, 2026

GPUtester moved this to In Progress in cuDF Python Jun 17, 2026

Merge branch 'main' into raise_setitem

2a7fa96

galipremsagar added bug Something isn't working breaking Breaking change labels Jun 17, 2026

galipremsagar marked this pull request as ready for review June 17, 2026 00:49

galipremsagar requested a review from a team as a code owner June 17, 2026 00:49

galipremsagar requested review from TomAugspurger and brandon-b-miller June 17, 2026 00:49

coderabbitai Bot reviewed Jun 17, 2026

View reviewed changes

Comment thread python/cudf/cudf/core/dataframe.py

Comment thread python/cudf/cudf/core/dataframe.py Outdated

Comment thread python/cudf/cudf/core/dataframe.py Outdated

Comment thread python/cudf/cudf/core/indexing_utils.py Outdated

Comment thread python/cudf/cudf/core/series.py

galipremsagar added 5 commits June 16, 2026 21:27

Merge branch 'main' into raise_setitem

3c2f8aa

Merge remote-tracking branch 'upstream/main' into raise_setitem

520e65b

address reviews

4f9c23d

Use canonical copyright notice to fix style check

d8f5f9c

Update SPDX copyright lines to the canonical 'NVIDIA CORPORATION & AFFILIATES. All rights reserved.' form expected by the verify-copyright pre-commit hook.

fix

04c72c5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Align cudf indexing/setitem validation and alignment with pandas#22912

Align cudf indexing/setitem validation and alignment with pandas#22912
galipremsagar wants to merge 7 commits into
rapidsai:mainfrom
galipremsagar:raise_setitem

galipremsagar commented Jun 17, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 17, 2026

Uh oh!

coderabbitai Bot commented Jun 17, 2026 •

edited

Loading

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

galipremsagar commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot Bot commented Jun 17, 2026

Uh oh!

coderabbitai Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

galipremsagar commented Jun 17, 2026 •

edited

Loading

coderabbitai Bot commented Jun 17, 2026 •

edited

Loading