Skip to content

[Perf] Enable forcing perf dispatch use specific implementation#401

Open
hughperkins wants to merge 15 commits intomainfrom
hp/perf-dispatch-force-implementation
Open

[Perf] Enable forcing perf dispatch use specific implementation#401
hughperkins wants to merge 15 commits intomainfrom
hp/perf-dispatch-force-implementation

Conversation

@hughperkins
Copy link
Collaborator

Issue: #

Brief Summary

copilot:summary

Walkthrough

copilot:walkthrough

Document the @qd.perf_dispatch auto-tuning decorator: basic usage,
decorator order, compatibility filtering, geometry hashing, tuning
parameters, and benchmarking internals.
Two env vars allow overriding the auto-tuning:

- QD_PERFDISPATCH_FORCE=dispatcher:impl[,dispatcher2:impl2,...]
  forces specific implementations for named dispatchers.
- QD_PERFDISPATCH_FORCE_INDEX=N forces the Nth registered
  implementation for all dispatchers (per-dispatcher override
  takes priority).

When either env var is set, all dispatchers print their name and
registered implementations on first call, making it easy to
discover valid values.
Only keep the per-dispatcher QD_PERFDISPATCH_FORCE env var.
Add section on forcing specific implementations and discovering
available dispatcher/implementation names.
@hughperkins hughperkins changed the title [Perf] Enable perf dispatch force implementation via env var [Perf] Enable forcing perf dispatch use specific implementation Mar 6, 2026
Use larger arrays, marker array c with IntEnum, do_work_py, and
.to_numpy() for assertions — matching the patterns that work
reliably on metal/vulkan GPU backends.
@hughperkins hughperkins marked this pull request as draft March 6, 2026 02:39
Comment on lines +92 to +94
self._dispatch_impl_set: set[DispatchImpl] = set()
self._dispatch_impl_list: list[DispatchImpl] = []
self._forced_impl: DispatchImpl | None = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you storing list and set? This is weird.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -1,9 +1,11 @@
from enum import IntEnum
from typing import cast
from unittest import mock
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use this. Use pytest.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +32 to +35
if ":" not in pair:
print(
f"[perf_dispatch] WARNING: ignoring malformed QD_PERFDISPATCH_FORCE entry '{pair}' (expected 'dispatcher:impl')"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think a simple print is ok. It should be an error. Why would you tolerate this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

f"[perf_dispatch] WARNING: ignoring malformed QD_PERFDISPATCH_FORCE entry '{pair}' (expected 'dispatcher:impl')"
)
continue
dispatcher_name, impl_name = pair.split(":", 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dispatcher_name, impl_name = pair.split(":") would guard against invalid characters for free.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

)
continue
dispatcher_name, impl_name = pair.split(":", 1)
result[dispatcher_name.strip()] = impl_name.strip()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could just do dispatcher_name, impl_name = map(str.strip, pair.split(":")) to avoid all these strip if you want.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return result


_FORCE_MAP: dict[str, str] = _parse_force_map(_QD_PERFDISPATCH_FORCE_RAW)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about a frozendict? (if it is already a deps, otherwise not worth it)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

…e-implementation

# Conflicts:
#	docs/source/user_guide/index.md
@hughperkins
Copy link
Collaborator Author

hughperkins commented Mar 12, 2026

Opus 4.6 review:

Summary

The branch adds two things:

  1. A QD_PERFDISPATCH_FORCE env var that lets users bypass auto-tuning and pin a specific implementation for a given dispatcher (e.g. QD_PERFDISPATCH_FORCE=my_op:my_op_v2). Useful for debugging/profiling.
  2. A comprehensive user guide for perf_dispatch (docs/source/user_guide/user_guide/perf_dispatch.md).

What's done well

  • Clean parsing/resolution pattern. _parse_force_map() handles malformed input gracefully with warnings. The lazy _resolve_force() pattern — deferring resolution to first call rather than registration time — is the right approach since all @register calls may not have happened yet.
  • Good test coverage. Three tests cover the key cases: force by name, unmatched name fallback, and multiple dispatchers. Using mock.patch.object to inject _FORCE_MAP / _ANY_FORCE_ACTIVE is clean and avoids environment variable side effects.
  • Excellent documentation. The user guide is thorough — covers basic usage, decorator order, compatibility filtering, geometry hash, tuning parameters, the forcing mechanism, and a complete example. Well-structured.
  • Improved log messages. The switch from perf dispatch chose X to perf_dispatch 'my_op': chose 'my_op_v2' adds the dispatcher name, making multi-dispatcher logs actually debuggable.

Issues to address

1. _ANY_FORCE_ACTIVE triggers verbose prints for all dispatchers

When QD_PERFDISPATCH_FORCE=op_a:v1 is set, every dispatcher — even unrelated ones — prints registration and available-implementations messages. This is intentional for discoverability (via QD_PERFDISPATCH_FORCE=?) but will be noisy in production scripts with many dispatchers.

Consider gating the extra prints on the dispatcher actually being in the force map, and having a separate QD_PERFDISPATCH_LIST=1 mode for discovery.

2. Parallel _dispatch_impl_set and _dispatch_impl_list

The branch adds _dispatch_impl_list alongside the existing _dispatch_impl_set — both are appended to during registration. The list is needed for ordered name lookup during force resolution, but maintaining two parallel collections is a maintenance risk.

Consider whether the set can be replaced by the list (using set() views where needed), or at least add a comment explaining why both exist.

3. _resolve_force() called on every __call__

After the first call, _force_resolved is True and it short-circuits, but the branch check still runs on every invocation. This is cheap (a boolean check), but in a hot path it's an extra branch per call. Could be eliminated by swapping __call__ to a different method after resolution, though that's arguably over-engineering.

4. Stray SPIRV-Cross submodule bump

The diff includes a SPIRV-Cross submodule pointer change (2dcbe8d → d8e3e2b), likely pulled in from the merge commit. Probably unintentional — should be excluded or called out.

5. No negative test for _parse_force_map

The parser warns on malformed entries (no colon), but there's no test exercising that warning path directly, or testing edge cases like trailing commas, empty entries, or whitespace-only entries.

@hughperkins
Copy link
Collaborator Author

additional thougths from reviewer opus on concern 1:

This one I'd actually leave as-is. A few reasons:

  1. The noise is bounded and useful. In practice, a project probably has a handful of perf_dispatch dispatchers, not hundreds. Seeing all of them when you set the env var gives
    you immediate orientation — you know what's available without having to grep the codebase.
  2. The discovery and forcing use cases are intertwined. If you only printed info for the targeted dispatcher, you'd need to already know the exact dispatcher name to use the
    env var — but the whole point of the verbose output is to help you discover names in the first place. A separate QD_PERFDISPATCH_LIST=1 env var would work but adds another
    knob for marginal benefit.
  3. It's already documented. The user guide explicitly describes this as the discovery mechanism (QD_PERFDISPATCH_FORCE=?). Changing the behavior would mean the docs and the
    code diverge.

If it ever does become noisy (many dispatchers), it could be refined later by only printing the full listing once at process exit or by adding a filter. But that's a problem for
when it actually materializes. I'd drop this concern from the review or relabel it as "future consideration."
Want me to update the review doc?

@hughperkins
Copy link
Collaborator Author

update from reviewer opus on concern 5:

Honestly, this one is borderline. _parse_force_map is a simple string parser that's only called once at module load with an env var value. The happy paths (valid entries,
empty string) are already exercised indirectly by the force tests. The edge cases (no colon, trailing commas, whitespace) are trivial enough that you can verify correctness by
reading the code — split(","), strip(), and the ":" not in pair check handle them obviously.
Adding a dedicated unit test for a 15-line parser that's unlikely to be modified feels like test bloat. If the parser were more complex or had tricky corner cases, it'd be
different. I'd drop this concern.

@hughperkins
Copy link
Collaborator Author

reviewer opus update on concern 3:

This is technically correct — there's a self._resolve_force() call plus a self._forced_impl is not None check on every call — but it's not worth fixing. Two boolean checks
(self._force_resolved early-return, then self._forced_impl is not None) are essentially free compared to the actual kernel execution that follows. Python function call
overhead on _resolve_force() is in the tens of nanoseconds.
The alternative (swapping call to a different method after resolution, or using a flag to replace the method) adds real complexity for an unmeasurable performance gain.
I'd drop this one too.

@hughperkins hughperkins marked this pull request as ready for review March 12, 2026 02:35
Rejects entries with multiple colons via unpacking ValueError
rather than silently accepting them.
…tead of warning

Fail loudly so users don't silently get normal benchmarking when
they intended to force an implementation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants