Fix calc_stats dataset mutation causing incorrect diff_std computation by RajdeepKushwaha5 · Pull Request #103 · mllam/mllam-data-prep

RajdeepKushwaha5 · 2026-03-20T02:55:21Z

Describe your changes

The calc_stats function in mllam_data_prep/ops/statistics.py mutates the ds variable inside the loop over statistics_config.ops. When a diff_* operation is encountered, the original dataset is overwritten:

ds = ds[vars_to_keep].diff(dim=splitting_dim)

This means any subsequent diff_* operation in the same loop applies .diff() on an already-diffed dataset instead of the original.

With the default config shipped in all example YAML files (ops: [mean, std, diff_mean, diff_std]):

mean → ds.mean(...) ✅
std → ds.std(...) ✅
diff_mean → mutates ds to ds.diff(...), then mean(...) ✅ (but corrupts ds)
diff_std → applies .diff() again on already-diffed data → computes std(diff(diff(ds))) ❌

Expected: diff_std should compute std(diff(ds)).
Actual: diff_std computes std(diff(diff(ds))).

Fix: Use a per-iteration local variable ds_op instead of mutating ds, so each iteration always starts from the original dataset.

No dependencies required for this change.

Issue Link

Closes #102

Type of change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
📖 Documentation (Addition or improvements to documentation)

Checklist before requesting a review

My branch is up-to-date with the target branch - if not update your fork with the changes from the target branch (use pull with --rebase option if possible).
I have performed a self-review of my code
For any new/modified functions/classes I have added docstrings that clearly describe its purpose, expected inputs and returned values
I have placed in-line comments to clarify the intent of any hard-to-understand passages of my code
I have updated the documentation to cover introduced code changes
I have added tests that prove my fix is effective or that my feature works
I have given the PR a name that clearly describes the change, written in imperative form (context).
I have requested a reviewer and an assignee (assignee is responsible for merging)

Checklist for reviewers

Each PR comes with its own improvements and flaws. The reviewer should check the following:

the code is readable
the code is well tested
the code is documented (including return types and parameters)
the code is easy to maintain

Author checklist after completed review

I have added a line to the CHANGELOG describing this change, in a section
reflecting type of change (add section where missing):
- added: when you have added new functionality
- changed: when default behaviour of the code has been changed
- fixes: when your contribution fixes a bug

Checklist for assignee

PR is up to date with the base branch
the tests pass

The calc_stats function mutated the ds variable inside the loop over statistics_config.ops. When a diff_* operation was encountered, the original dataset was overwritten with ds.diff(), causing subsequent diff_* operations to apply diff() on already-diffed data. With ops=[mean, std, diff_mean, diff_std], diff_std incorrectly computed std(diff(diff(ds))) instead of std(diff(ds)). Fix: use a per-iteration local variable ds_op instead of mutating ds. Closes mllam#102

Copilot

Pull request overview

Fixes a logic error in calc_stats where encountering a diff_* op would overwrite the dataset used for subsequent ops, leading to incorrect results (e.g., diff_std computing a double diff when diff_mean appears earlier in the same ops list). This aligns behavior with the default YAML configs that compute both base and diff-based statistics.

Changes:

Avoid reassigning/mutating the loop dataset by introducing a per-iteration ds_op variable in calc_stats.
Add regression tests covering multiple diff_* ops in a single ops list (verifies diff_std uses a single diff of the original dataset).
Document the fix in the unreleased CHANGELOG (links to issue #102).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File	Description
`mllam_data_prep/ops/statistics.py`	Ensures each statistic op runs against the intended (original) dataset by using a per-iteration dataset variable (`ds_op`).
`tests/test_calc_stats_fix.py`	Adds regression coverage to prevent reintroducing the incorrect `diff_std` behavior.
`CHANGELOG.md`	Records the bugfix under “Fixes” in the unreleased section.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings March 20, 2026 02:55

Copilot started reviewing on behalf of RajdeepKushwaha5 March 20, 2026 02:55 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix calc_stats dataset mutation causing incorrect diff_std computation#103

Fix calc_stats dataset mutation causing incorrect diff_std computation#103
RajdeepKushwaha5 wants to merge 1 commit into
mllam:mainfrom
RajdeepKushwaha5:fix/calc-stats-mutation

RajdeepKushwaha5 commented Mar 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

RajdeepKushwaha5 commented Mar 20, 2026

Describe your changes

Issue Link

Type of change

Checklist before requesting a review

Checklist for reviewers

Author checklist after completed review

Checklist for assignee

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants