Skip to content

Align cudf indexing/setitem validation and alignment with pandas#22912

Open
galipremsagar wants to merge 7 commits into
rapidsai:mainfrom
galipremsagar:raise_setitem
Open

Align cudf indexing/setitem validation and alignment with pandas#22912
galipremsagar wants to merge 7 commits into
rapidsai:mainfrom
galipremsagar:raise_setitem

Conversation

@galipremsagar

@galipremsagar galipremsagar commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Aligns cudf's classic indexing/setitem behavior with pandas (applied unconditionally — no mode.pandas_compatible gating), fixing a batch of frame/indexing/test_indexing.py failures under cudf.pandas:

  • Reject datetime64/timedelta64 (incl. NaT) assigned into a numeric column
  • Reject set/dict (or tuples nesting them) as .loc/.iloc indexers
  • Align boolean Series/DataFrame masks by label, and a Series RHS in loc setitem
  • Raise InvalidIndexError for a slice inside a df[tuple] getitem
  • Match int/float labels in reindex; keep a DataFrame for 1-level MultiIndex .loc; fix list-of-tuples MultiIndex .loc
  • Raise on .values for a masked extension dtype with nulls

Also drops the now-passing entries from the pandas-testing plugin and adds cudf unit tests comparing against pandas. Verified the full cudf unit suite (101k tests): 0 regressions.

@copy-pr-bot

copy-pr-bot Bot commented Jun 17, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions Bot added Python Affects Python cuDF API. cudf.pandas Issues specific to cudf.pandas labels Jun 17, 2026
@galipremsagar galipremsagar changed the title fix Align cudf indexing/setitem validation and alignment with pandas Jun 17, 2026
@GPUtester GPUtester moved this to In Progress in cuDF Python Jun 17, 2026
@galipremsagar galipremsagar added bug Something isn't working breaking Breaking change labels Jun 17, 2026
@galipremsagar galipremsagar marked this pull request as ready for review June 17, 2026 00:49
@galipremsagar galipremsagar requested a review from a team as a code owner June 17, 2026 00:49
@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 44ed6827-4b29-4165-9116-74cd50c76fef

📥 Commits

Reviewing files that changed from the base of the PR and between d8f5f9c and 04c72c5.

📒 Files selected for processing (1)
  • python/cudf/cudf/pandas/scripts/pandas-testing-plugin.py
💤 Files with no reviewable changes (1)
  • python/cudf/cudf/pandas/scripts/pandas-testing-plugin.py

📝 Walkthrough

Summary by CodeRabbit

  • Bug Fixes

    • Prevent assigning temporal (datetime/timedelta, including NaT) scalars/arrays into numeric columns; now raises TypeError instead of silently casting.
    • Reject dict/set indexers (including nested in tuples) with pandas-like TypeError guidance to use a list.
    • Improved boolean-mask alignment by label, filling nulls as False, with stricter unalignable boolean indexer errors.
    • Refined .loc/.iloc tuple and MultiIndex selection/assignment semantics; one-level MultiIndex .loc stays a DataFrame.
    • DataFrame.__getitem__ rejects tuple keys containing inner slices.
    • Series.values/single-column values now raise ValueError for nullable/masked dtypes with nulls.
    • reindex treats numeric int/uint/float dtypes as compatible for label matching.
  • Tests

    • Added regression tests covering the updated indexing, assignment, temporal/type errors, and exception parity with pandas (including operator-based cases).

Walkthrough

This PR improves cuDF's pandas compatibility across multiple indexing and assignment areas: temporal dtype values are rejected when assigned into numeric columns; boolean Series/DataFrame masks are reindexed by label instead of applied positionally; dict/set objects are rejected as indexers; MultiIndex .loc semantics are corrected for list-of-tuple keys and 1-level indices; numeric dtype reindex label matching is widened; and masked-dtype .values now raises instead of silently converting.

Changes

Pandas Compatibility Fixes

Layer / File(s) Summary
Temporal dtype rejection in numeric column setitem
python/cudf/cudf/core/column/numerical.py, python/cudf/cudf/tests/dataframe/indexing/test_setitem.py, python/cudf/cudf/tests/series/indexing/test_setitem.py
NumericalColumn._cast_setitem_value raises TypeError for temporal PyArrow scalars and temporal-kind columns assigned into numeric columns. Tests verify matching exception behavior with pandas for scalar NaT, datetime64, and timedelta64 values via iloc and loc.
Dict/set indexer rejection across getitem and setitem
python/cudf/cudf/core/indexing_utils.py, python/cudf/cudf/core/dataframe.py, python/cudf/cudf/tests/dataframe/indexing/test_loc.py, python/cudf/cudf/tests/dataframe/indexing/test_setitem.py
New check_dict_or_set_indexers helper raises TypeError for set/dict (including nested in tuples) as indexing keys. _DataFrameIndexer.__setitem__, _DataFrameLocIndexer.__getitem__, and _DataFrameIlocIndexer.__getitem__ call this validator. Tests verify pandas/cuDF exception parity.
Boolean Series label-alignment in indexing
python/cudf/cudf/core/indexing_utils.py, python/cudf/cudf/core/series.py, python/cudf/cudf/tests/dataframe/indexing/test_loc.py, python/cudf/cudf/tests/dataframe/indexing/test_setitem.py
parse_single_row_loc_key and _SeriesLocIndexer._loc_to_iloc now reindex boolean Series indexers by label when index differs, raising ValueError for unalignable masks. Tests verify boolean Series masks are applied by label alignment, not positional application.
DataFrame loc getitem/setitem: MultiIndex tuples, slice rejection, RHS alignment, boolean masks
python/cudf/cudf/core/dataframe.py, python/cudf/cudf/core/multiindex.py, python/cudf/cudf/tests/dataframe/indexing/test_getitem.py, python/cudf/cudf/tests/dataframe/indexing/test_loc.py, python/cudf/cudf/tests/dataframe/indexing/test_setitem.py
DataFrame.__getitem__ raises InvalidIndexError for tuple keys with slices. _getitem_tuple_arg treats list-of-full-tuples as explicit MultiIndex keys. _setitem_tuple_arg distinguishes boolean masks from label keys for DataFrame RHS and aligns RHS by index. Series RHS in multi-column scatter is assigned whole for label alignment. Boolean DataFrame/Series masks in __setitem__ are reindexed to self.index. MultiIndex._index_and_downcast skips downcasting for 1-level indices.
Numeric dtype reindex compatibility
python/cudf/cudf/core/indexed_frame.py, python/cudf/cudf/tests/dataframe/methods/test_reindex.py
_is_same_dtype returns True when both dtype kinds are numeric (i/u/f), enabling reindex to match int/float/uint label equality. A parametrized regression test validates reindexing a float index with int or float labels.
SingleColumnFrame.values error for masked dtypes with nulls
python/cudf/cudf/core/single_column_frame.py, python/cudf/cudf/tests/series/test_attributes.py
SingleColumnFrame.values imports numpy and raises ValueError when nulls are present and the dtype is a non-np.dtype nullable extension type, preventing silent float-NaN downcasting. Test verifies this raises for cuDF while pandas returns an ExtensionArray.
pandas-testing-plugin xfail map updates
python/cudf/cudf/pandas/scripts/pandas-testing-plugin.py
Removes ~70 xfail/skip entries to reflect now-fixed behaviors, including test_setitem_validation_scalar_int (temporal into numeric), test_loc_getitem_nested_indexer (set-typed indexers), test_multi_assign_broadcasting_rhs, test_loc_set_series_to_multiple_columns, test_iloc_ea_series_indexer_with_na, test_type_error_multiindex, and test_interpolate methods. Reorganizes DataFrame indexing test mapping.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested labels

improvement

Suggested reviewers

  • rjzamora
  • Matt711
  • mroeschke
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 15.79% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main objective: aligning cudf indexing/setitem validation and alignment with pandas behavior across multiple aspects.
Description check ✅ Passed The description is directly related to the changeset, providing context for the indexing/setitem alignment changes and listing the key behavioral modifications made.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
python/cudf/cudf/core/dataframe.py (1)

436-447: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

HIGH: Apply dict/set indexer validation to .iloc.__getitem__ too.

_DataFrameLocIndexer.__getitem__ now rejects dict/set keys, but _DataFrameIlocIndexer.__getitem__ still sends arg directly into iloc parsing, leaving .iloc[{...}]/.iloc[{...}, :] on the old path.

As per coding guidelines, cuDF Python reviews should flag pandas API compatibility regressions in public indexing behavior.

Suggested fix
 class _DataFrameIlocIndexer(_DataFrameIndexer):
@@
     def __getitem__(self, arg):
+        indexing_utils.check_dict_or_set_indexers(arg)
         (
             row_key,
             (
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/cudf/cudf/core/dataframe.py` around lines 436 - 447, The __getitem__
method in _DataFrameIlocIndexer needs to validate that dict/set keys are
rejected, just like _DataFrameLocIndexer.__getitem__ already does. Before
calling indexing_utils.destructure_dataframe_iloc_indexer with the arg
parameter, add validation logic to check if arg is a dict or set and raise an
appropriate error to prevent these invalid index types from being processed
through the iloc path, ensuring consistent pandas API compatibility between .loc
and .iloc indexing behavior.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@python/cudf/cudf/core/dataframe.py`:
- Around line 1643-1654: The fillna(False) and astype("bool") normalization for
boolean Series masks only runs in the conditional reindex branch, allowing
same-index nullable Series masks to reach as_column(arg) with null values still
present. Move the normalization steps (fillna(False) and astype("bool")) outside
of the conditional block so they apply to all Series masks regardless of whether
reindexing is needed. Restructure the logic so that after converting arg to a
cudf.Series, the fillna and type conversion are applied unconditionally, and
reindexing only occurs when the indexes do not match, ensuring all nullable
boolean Series masks are normalized before being converted to a column.
- Around line 349-365: The boolean Series row mask in `row_key_col` is not being
aligned with the target index before being used as `scatter_map`, which can
cause data to be written to incorrect rows. After reindexing `value` to align
with `target_index`, the `row_key_col` boolean Series must also be reindexed to
match the same `target_index` before it is assigned to `scatter_map`. This
ensures that the scatter operation writes values from the reindexed RHS
dataframe to the correct rows in the target dataframe, preventing silent data
corruption from misaligned boolean masks.
- Around line 405-410: The issue is that when key[0] represents a single row
(scalar key), assigning the entire Series to self._frame[col].loc[key[0]] treats
the Series object as a single cell value instead of extracting the appropriate
row value from the Series. To fix this, check whether key[0] is a scalar that
selects a single row: if it is, extract the scalar value from the Series at that
index position (using value.loc[key[0]] or value[key[0]]) and assign that
extracted value instead of the entire Series object to
self._frame[col].loc[key[0]]. This ensures proper pandas API compatibility and
correct row-based assignment semantics.

In `@python/cudf/cudf/core/indexing_utils.py`:
- Around line 685-699: The validation logic in the boolean Series indexer block
around line 693 incorrectly uses has_nulls() on the aligned column to detect
unalignable masks, which treats pre-existing nulls as errors when they should be
allowed. Instead of checking if the aligned column has any nulls, you need to
distinguish between pre-existing nulls in the original key Series and nulls
introduced by the reindexing process. Modify the check to only raise the error
if new nulls were introduced during reindexing (indicating missing labels in the
original index), while allowing pre-existing nulls from the original Series to
pass through. Compare the null patterns before and after the reindex call to
identify only the newly introduced nulls.

In `@python/cudf/cudf/core/series.py`:
- Around line 383-397: The issue is that the null check at line 391 using
`aligned._column.has_nulls()` cannot distinguish between pre-existing nulls in
the original `arg` Series and new nulls introduced by the reindex operation due
to missing labels. To fix this, compare the null status before and after
alignment: check if `arg._column.has_nulls()` is already true, and only raise
the "Unalignable boolean Series" error if the alignment operation introduced new
nulls that weren't present in the original `arg` Series. This ensures the
validation only catches cases where the alignment actually created mismatches,
not cases where the user intentionally provided a boolean Series with null
values.

---

Outside diff comments:
In `@python/cudf/cudf/core/dataframe.py`:
- Around line 436-447: The __getitem__ method in _DataFrameIlocIndexer needs to
validate that dict/set keys are rejected, just like
_DataFrameLocIndexer.__getitem__ already does. Before calling
indexing_utils.destructure_dataframe_iloc_indexer with the arg parameter, add
validation logic to check if arg is a dict or set and raise an appropriate error
to prevent these invalid index types from being processed through the iloc path,
ensuring consistent pandas API compatibility between .loc and .iloc indexing
behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4dc99a95-5f9e-4273-b0c6-98af31f3535b

📥 Commits

Reviewing files that changed from the base of the PR and between fe332f5 and 2a7fa96.

📒 Files selected for processing (14)
  • python/cudf/cudf/core/column/numerical.py
  • python/cudf/cudf/core/dataframe.py
  • python/cudf/cudf/core/indexed_frame.py
  • python/cudf/cudf/core/indexing_utils.py
  • python/cudf/cudf/core/multiindex.py
  • python/cudf/cudf/core/series.py
  • python/cudf/cudf/core/single_column_frame.py
  • python/cudf/cudf/pandas/scripts/pandas-testing-plugin.py
  • python/cudf/cudf/tests/dataframe/indexing/test_getitem.py
  • python/cudf/cudf/tests/dataframe/indexing/test_loc.py
  • python/cudf/cudf/tests/dataframe/indexing/test_setitem.py
  • python/cudf/cudf/tests/dataframe/methods/test_reindex.py
  • python/cudf/cudf/tests/series/indexing/test_setitem.py
  • python/cudf/cudf/tests/series/test_attributes.py
💤 Files with no reviewable changes (1)
  • python/cudf/cudf/pandas/scripts/pandas-testing-plugin.py

Comment thread python/cudf/cudf/core/dataframe.py
Comment thread python/cudf/cudf/core/dataframe.py Outdated
Comment thread python/cudf/cudf/core/dataframe.py Outdated
Comment thread python/cudf/cudf/core/indexing_utils.py Outdated
Comment thread python/cudf/cudf/core/series.py
Update SPDX copyright lines to the canonical 'NVIDIA CORPORATION & AFFILIATES. All rights reserved.' form expected by the verify-copyright pre-commit hook.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking Breaking change bug Something isn't working cudf.pandas Issues specific to cudf.pandas Python Affects Python cuDF API.

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

2 participants