Fix groupby any/all on null-containing string columns#22926
Fix groupby any/all on null-containing string columns#22926galipremsagar wants to merge 2 commits into
any/all on null-containing string columns#22926Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
📝 WalkthroughSummary by CodeRabbit
Walkthrough
ChangesGroupBy boolean reduction null mask and dtype fix
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Description
GroupBy.all/GroupBy.anyreturned wrong results for string columns containing nulls. Two issues inGroupBy._bool_reduce's string→bool coercion:na_value=np.nanstring dtypes.count_characters() > 0does not propagate the source null (a NaN-flavored missing compares asFalse), so a missing value was treated as a real empty/falsy string. The source column's null positions are now restored on the bool column.astype(np.bool_)on a null-containing extension bool. Casting a null-containing pandas-extension bool to numpy bool is (intentionally) rejected in pandas-compatible mode. The bool column is now kept nullable through themin/maxaggregation and normalized tonp.bool_only after empty/skipnagroups are filled.As a result,
allover an all-null group is now vacuouslyTrueandanyisFalse(with nulls dropped underskipna=True), matching pandas across all fourStringDtypevariants (python/pyarrow×na_value=pd.NA/np.nan).This fixes
tests/groupby/test_reductions.py::test_string_dtype_all_na(40 parametrizations) in thecudf.pandaspandas test suite, and incidentallytest_anyandtest_groupby_bool_aggs[True-any-vals12]; their xfail entries are dropped.Out of scope
test_masked_kleene_logic[any-False-...]remains xfailed: a mixed True/False/NA group underskipna=Falserequires full three-valued (Kleene) NA propagation (e.g.False OR NA -> NA), which is a larger change than the all-null / all-valued cases addressed here.Tests
Added
test_groupby_any_all_string_nulls(cuDF unit test) coveringany/all×skipnaover null and valued string groups across all fourStringDtypevariants. It fails without this change.Verification
pandas-testing/.../test_reductions.pyundercudf.pandas(with plugin): 1243 passed, 121 xfailed, 0 failed, 0 XPASS.Checklist