Skip to content

Fix groupby any/all on null-containing string columns#22926

Open
galipremsagar wants to merge 2 commits into
rapidsai:mainfrom
galipremsagar:groupby_string_bool_nulls
Open

Fix groupby any/all on null-containing string columns#22926
galipremsagar wants to merge 2 commits into
rapidsai:mainfrom
galipremsagar:groupby_string_bool_nulls

Conversation

@galipremsagar

@galipremsagar galipremsagar commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Description

GroupBy.all / GroupBy.any returned wrong results for string columns containing nulls. Two issues in GroupBy._bool_reduce's string→bool coercion:

  1. Lost missingness for na_value=np.nan string dtypes. count_characters() > 0 does not propagate the source null (a NaN-flavored missing compares as False), so a missing value was treated as a real empty/falsy string. The source column's null positions are now restored on the bool column.
  2. astype(np.bool_) on a null-containing extension bool. Casting a null-containing pandas-extension bool to numpy bool is (intentionally) rejected in pandas-compatible mode. The bool column is now kept nullable through the min/max aggregation and normalized to np.bool_ only after empty/skipna groups are filled.

As a result, all over an all-null group is now vacuously True and any is False (with nulls dropped under skipna=True), matching pandas across all four StringDtype variants (python/pyarrow × na_value=pd.NA/np.nan).

This fixes tests/groupby/test_reductions.py::test_string_dtype_all_na (40 parametrizations) in the cudf.pandas pandas test suite, and incidentally test_any and test_groupby_bool_aggs[True-any-vals12]; their xfail entries are dropped.

Out of scope

test_masked_kleene_logic[any-False-...] remains xfailed: a mixed True/False/NA group under skipna=False requires full three-valued (Kleene) NA propagation (e.g. False OR NA -> NA), which is a larger change than the all-null / all-valued cases addressed here.

Tests

Added test_groupby_any_all_string_nulls (cuDF unit test) covering any/all × skipna over null and valued string groups across all four StringDtype variants. It fails without this change.

Verification

  • pandas-testing/.../test_reductions.py under cudf.pandas (with plugin): 1243 passed, 121 xfailed, 0 failed, 0 XPASS.
  • Full cuDF groupby unit suite: no regressions.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@galipremsagar galipremsagar requested a review from a team as a code owner June 18, 2026 22:27
@galipremsagar galipremsagar requested review from Matt711 and wence- June 18, 2026 22:27
@github-actions github-actions Bot added Python Affects Python cuDF API. cudf.pandas Issues specific to cudf.pandas labels Jun 18, 2026
@GPUtester GPUtester moved this to In Progress in cuDF Python Jun 18, 2026
@galipremsagar galipremsagar added bug Something isn't working breaking Breaking change labels Jun 18, 2026
@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c297dead-9885-4f6e-9c83-21911832379b

📥 Commits

Reviewing files that changed from the base of the PR and between 0e24684 and 92d6582.

📒 Files selected for processing (2)
  • python/cudf/cudf/core/groupby/groupby.py
  • python/cudf/cudf/tests/groupby/test_reductions.py

📝 Walkthrough

Summary by CodeRabbit

  • Bug Fixes

    • GroupBy any()/all() operations now correctly preserve null values and match pandas output data types across nullable and Arrow-backed data formats.
  • Tests

    • Added validation tests for any()/all() groupby operations with null values and various data types to ensure pandas compatibility.

Walkthrough

GroupBy._bool_reduce's _to_bool_col helper is updated to restore the source column's null mask after boolean coercion when null counts diverge, fill missing values with True under skipna=False, and defer casting to np.bool_ until nulls are absent. A new _bool_result_dtype function selects output dtypes matching input flavor (Arrow-backed to pd.ArrowDtype(pa.bool_()), pandas nullable to pd.BooleanDtype(), others to np.bool_), applied to both Series and DataFrame results. Two parametrized tests validate any/all on grouped nullable strings and confirm output dtype matching across dtype flavors. Pandas-testing failure expectations are updated accordingly.

Changes

GroupBy boolean reduction null mask and dtype fix

Layer / File(s) Summary
Import and _to_bool_col null mask fix
python/cudf/cudf/core/groupby/groupby.py
Adds is_pandas_nullable_extension_dtype import and refines _to_bool_col to reapply the source null mask when null counts diverge after boolean coercion, fill remaining nulls with True for skipna=False, and defer np.bool_ cast until no nulls remain.
Output dtype selection and application
python/cudf/cudf/core/groupby/groupby.py
Introduces _bool_result_dtype to map input dtype flavor (Arrow-backed, pandas nullable/masked integer-like, other) to corresponding output boolean dtypes, applies dtype selection when filling empty/all-null groups for Series results, and extends dtype-aware fillna and casting to DataFrame value columns.
New parametrized tests for string nulls and result dtype matching
python/cudf/cudf/tests/groupby/test_reductions.py
Updates copyright header and adds test_groupby_any_all_string_nulls covering grouped any/all over multiple StringDtype variants with skipna handling, and test_groupby_any_all_result_dtype validating output dtype matches pandas across nullable/pyarrow dtype flavors under pandas_compatible mode.
Pandas-testing failure expectation updates
python/cudf/cudf/pandas/scripts/pandas-testing-plugin.py
Removes now-passing test_any and test_groupby_bool_aggs[True-any-vals12] from failure map, removes large block of test_string_dtype_all_na failures, adds TODO placeholders for test_first_last_skipna and test_sum_skipna cases, and updates expectations for multiple other reduction test cases.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • rapidsai/cudf#22813: Modifies the same GroupBy._bool_reduce boolean-coercion and casting path, skipping grouping-key columns to prevent incorrect casting — directly adjacent to this PR's null-mask and dtype fix in the same function.

Suggested labels

bug, Python, non-breaking, cudf.pandas

Suggested reviewers

  • vyasr
  • mroeschke
  • bdice
  • brandon-b-miller
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title directly describes the main fix: addressing incorrect results in GroupBy any/all operations on null-containing string columns.
Description check ✅ Passed The description is comprehensive and clearly related to the changeset, detailing the two root causes, the fix approach, test coverage, and verification results.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking Breaking change bug Something isn't working cudf.pandas Issues specific to cudf.pandas Python Affects Python cuDF API.

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

2 participants