Skip to content

[MISC] Speedup rigid body mass matrix cholesky decomposition for large dof count.#2915

Merged
duburcqa merged 1 commit into
Genesis-Embodied-AI:mainfrom
Kashu7100:feat/register-tiled-mass
Jun 22, 2026
Merged

[MISC] Speedup rigid body mass matrix cholesky decomposition for large dof count.#2915
duburcqa merged 1 commit into
Genesis-Embodied-AI:mainfrom
Kashu7100:feat/register-tiled-mass

Conversation

@Kashu7100

@Kashu7100 Kashu7100 commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Summary

Follow-up to #2869. That PR made the constraint-Hessian Cholesky register-tiled and fast. This PR does the same for the other dense factorization in the rigid step — the per-entity mass-matrix factor (func_factor_mass).

For high-DOF articulated bodies (SMPL/SMPL-X, dexterous hands), once an entity's mass submatrix exceeds GPU shared memory the factor falls back to the cooperative-global LDLᵀ path, which becomes the dominant cost. Profiling two SMPL-X humanoids (159 DOFs each, 318 total) at 1024 envs on an RTX 3080: the mass factor is ~89% of the whole substep.

This adds a register-streaming tiled factor for that >shared-cap path, reusing the same qd.simt.TileNxN Cholesky primitive as the constraint Hessian. The mass matrix is block-diagonal across entities, so each entity's n_dofs×n_dofs block factors independently in one warp of T lanes.

Results — measured on this branch @ main (quadrants 1.0.2, RTX 3080, 2×SMPL-X = 318 DOFs)

metric cooperative-global (current) register-tiled (this PR) speedup
mass-factor kernel kernel_step_1 @ 1024 envs 462 ms 29 ms ~15.7×
full substep @ 1024 envs 521 ms 89 ms ~5.9×
mass-factor kernel kernel_step_1 @ 512 envs 237 ms 16 ms ~14.8×
full substep @ 512 envs 285 ms 66 ms ~4.3×

After the fix the constraint solve (rest, ~58 ms, unchanged) dominates the substep — the mass factor is no longer the bottleneck. Also confirmed inert / no regression on a low-DOF robot (G1, 29 DOFs ≤ cap → unchanged shared-memory path) and on CPU.

Correctness (verified on main)

func_solve_mass_entity consumes the LTDL form M = Lᵀ D L (L unit-lower, reverse-pivot — same as the stock scalar/cooperative/shared factors), not standard L D Lᵀ. The tile primitive does forward Cholesky, so we factor the reverse-indexed block M_rev[a,b]=M[n-1-a,n-1-b] and map its factor back to M's LTDL factor.

Verified on a main env (quadrants==1.0.2): the factor reconstructs M in the solve's convention to ‖M − Lᵀ D L‖/‖M‖ ≈ 2.6e-7 on both the 159-DOF and 29-DOF blocks (matches the stock factor's 4e-7), and trajectories match the cooperative-global baseline (rel 2.5e-7).

Design / safety

  • Gated off by default behind RigidOptions.register_tiled_mass (None = auto → the GS_REGISTER_TILED_MASS env var, default off; True/False overrides). No-op until opted in.
  • CUDA forward only. CPU, backward/AD, and the under-cap shared-memory tiled path are untouched (the AD factor keeps its dedicated safe-access branch). The dispatch flag is only set in the gs.backend != gs.cpu block, so the path is fully inert on CPU.
  • The only edit to existing control flow is one ifelif in func_factor_mass; everything else is additive.
  • Scratch buffer / memory: [MISC] Move register-tile Cholesky into quadrants. #2860 moved the register-tile Cholesky into quadrants, whose tile slice ops are batch-first (arr[batch, rows, cols]), while mass_mat_L is canonical batch-last (n_dofs, n_dofs, _B). So the factorization can't run in place; it uses a dedicated batch-first scratch mass_mat_tiled of shape (n_entities·_B, P, P) (P = tiled_n_dofs_per_entity), allocated only when the path is enabled (e.g. ~335 MB at 1024 envs / 2×SMPL-X). Open to alternatives — e.g. transiently reusing the Hessian's nt_H (also batch-first, free at factor time) for zero extra memory, if maintainers prefer.

Files (4, all additive)

  • engine/solvers/rigid/abd/forward_dynamics.py — the tiled factor _factor_mass_tiled_impl + dispatcher (qd.simt.Tile16x16/32x32); new if branch in func_factor_mass.
  • utils/array_class.pyenable_register_tiled_mass + cholesky_tile_size_mass static config, and the mass_mat_tiled scratch allocation.
  • options/solvers.py — user-facing RigidOptions.register_tiled_mass.
  • engine/solvers/rigid/rigid_solver.py — tile-size selection + gating.

A separate small follow-up documents that RigidSolver.get_mass_mat(decompose=True) returns the LTDL factor (#2916).

🤖 Generated with Claude Code

@Kashu7100 Kashu7100 force-pushed the feat/register-tiled-mass branch 2 times, most recently from 149cc54 to 00b7052 Compare June 9, 2026 09:03
@Kashu7100

Copy link
Copy Markdown
Collaborator Author

The three red checks on the previous push were pre-existing CI flakes, not from this change:

  • windows-2025-3.12-cpu-ndarray — 10 errors, all huggingface_hub 429 Too Many Requests / LocalEntryNotFound while fetching test assets from the Hub (network rate-limit).
  • production-unit_tests-ndarray (exit 3) — 747 passed; the non-zero exit is a pytest-timeout (>1200 s) on test_sensor_camera.py::test_batch_renderer[2-cuda] crashing an xdist worker.
  • ubuntu-24.04-3.12-cpu-field (exit 3) — 122 passed; same pattern, a >1200 s timeout on test_render.py::test_render_api_advanced[0-RASTERIZER] (which also hit the HF 429 above).

tests/test_rigid_physics.py::test_mass_mat passed on every job. No CPU code path is affected (the feature is gated behind gs.backend != gs.cpu and off by default). I've re-pushed to re-trigger CI.

@Kashu7100 Kashu7100 force-pushed the feat/register-tiled-mass branch 2 times, most recently from 00c758b to 6a28583 Compare June 9, 2026 09:51
@Kashu7100 Kashu7100 marked this pull request as ready for review June 9, 2026 09:56
@Kashu7100 Kashu7100 force-pushed the feat/register-tiled-mass branch from 6a28583 to 82f389c Compare June 11, 2026 21:10
@duburcqa duburcqa changed the title [PERF] Register-tiled per-entity mass-matrix factorization (follow-up to #2869) [MISC] Register-tiled per-entity mass-matrix factorization. Jun 22, 2026
@duburcqa duburcqa changed the title [MISC] Register-tiled per-entity mass-matrix factorization. [MISC] Speedup rigid body mass matrix cholesky decomposition for large dof count. Jun 22, 2026
@duburcqa duburcqa force-pushed the feat/register-tiled-mass branch from 82f389c to a79f89f Compare June 22, 2026 15:22
@duburcqa duburcqa force-pushed the feat/register-tiled-mass branch from a79f89f to 224cd3a Compare June 22, 2026 15:28
@duburcqa duburcqa merged commit 4a5be7f into Genesis-Embodied-AI:main Jun 22, 2026
22 of 23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants