Skip to content

Fix calc_stats dataset mutation causing incorrect diff_std computation#103

Open
RajdeepKushwaha5 wants to merge 1 commit into
mllam:mainfrom
RajdeepKushwaha5:fix/calc-stats-mutation
Open

Fix calc_stats dataset mutation causing incorrect diff_std computation#103
RajdeepKushwaha5 wants to merge 1 commit into
mllam:mainfrom
RajdeepKushwaha5:fix/calc-stats-mutation

Conversation

@RajdeepKushwaha5

Copy link
Copy Markdown

Describe your changes

The calc_stats function in mllam_data_prep/ops/statistics.py mutates the ds variable inside the loop over statistics_config.ops. When a diff_* operation is encountered, the original dataset is overwritten:

ds = ds[vars_to_keep].diff(dim=splitting_dim)

This means any subsequent diff_* operation in the same loop applies .diff() on an already-diffed dataset instead of the original.

With the default config shipped in all example YAML files (ops: [mean, std, diff_mean, diff_std]):

  1. meands.mean(...)
  2. stdds.std(...)
  3. diff_mean → mutates ds to ds.diff(...), then mean(...) ✅ (but corrupts ds)
  4. diff_std → applies .diff() again on already-diffed data → computes std(diff(diff(ds)))

Expected: diff_std should compute std(diff(ds)).
Actual: diff_std computes std(diff(diff(ds))).

Fix: Use a per-iteration local variable ds_op instead of mutating ds, so each iteration always starts from the original dataset.

No dependencies required for this change.

Issue Link

Closes #102

Type of change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • 📖 Documentation (Addition or improvements to documentation)

Checklist before requesting a review

  • My branch is up-to-date with the target branch - if not update your fork with the changes from the target branch (use pull with --rebase option if possible).
  • I have performed a self-review of my code
  • For any new/modified functions/classes I have added docstrings that clearly describe its purpose, expected inputs and returned values
  • I have placed in-line comments to clarify the intent of any hard-to-understand passages of my code
  • I have updated the documentation to cover introduced code changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have given the PR a name that clearly describes the change, written in imperative form (context).
  • I have requested a reviewer and an assignee (assignee is responsible for merging)

Checklist for reviewers

Each PR comes with its own improvements and flaws. The reviewer should check the following:

  • the code is readable
  • the code is well tested
  • the code is documented (including return types and parameters)
  • the code is easy to maintain

Author checklist after completed review

  • I have added a line to the CHANGELOG describing this change, in a section
    reflecting type of change (add section where missing):
    • added: when you have added new functionality
    • changed: when default behaviour of the code has been changed
    • fixes: when your contribution fixes a bug

Checklist for assignee

  • PR is up to date with the base branch
  • the tests pass

The calc_stats function mutated the ds variable inside the loop over
statistics_config.ops. When a diff_* operation was encountered, the
original dataset was overwritten with ds.diff(), causing subsequent
diff_* operations to apply diff() on already-diffed data.

With ops=[mean, std, diff_mean, diff_std], diff_std incorrectly
computed std(diff(diff(ds))) instead of std(diff(ds)).

Fix: use a per-iteration local variable ds_op instead of mutating ds.

Closes mllam#102
Copilot AI review requested due to automatic review settings March 20, 2026 02:55

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a logic error in calc_stats where encountering a diff_* op would overwrite the dataset used for subsequent ops, leading to incorrect results (e.g., diff_std computing a double diff when diff_mean appears earlier in the same ops list). This aligns behavior with the default YAML configs that compute both base and diff-based statistics.

Changes:

  • Avoid reassigning/mutating the loop dataset by introducing a per-iteration ds_op variable in calc_stats.
  • Add regression tests covering multiple diff_* ops in a single ops list (verifies diff_std uses a single diff of the original dataset).
  • Document the fix in the unreleased CHANGELOG (links to issue #102).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
mllam_data_prep/ops/statistics.py Ensures each statistic op runs against the intended (original) dataset by using a per-iteration dataset variable (ds_op).
tests/test_calc_stats_fix.py Adds regression coverage to prevent reintroducing the incorrect diff_std behavior.
CHANGELOG.md Records the bugfix under “Fixes” in the unreleased section.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: calc_stats mutates dataset in loop, causing incorrect diff_std computation

2 participants