[MISC] Speedup rigid body mass matrix cholesky decomposition for large dof count. by Kashu7100 · Pull Request #2915 · Genesis-Embodied-AI/genesis-world

Kashu7100 · 2026-06-08T10:02:39Z

Summary

Follow-up to #2869. That PR made the constraint-Hessian Cholesky register-tiled and fast. This PR does the same for the other dense factorization in the rigid step — the per-entity mass-matrix factor (func_factor_mass).

For high-DOF articulated bodies (SMPL/SMPL-X, dexterous hands), once an entity's mass submatrix exceeds GPU shared memory the factor falls back to the cooperative-global LDLᵀ path, which becomes the dominant cost. Profiling two SMPL-X humanoids (159 DOFs each, 318 total) at 1024 envs on an RTX 3080: the mass factor is ~89% of the whole substep.

This adds a register-streaming tiled factor for that >shared-cap path, reusing the same qd.simt.TileNxN Cholesky primitive as the constraint Hessian. The mass matrix is block-diagonal across entities, so each entity's n_dofs×n_dofs block factors independently in one warp of T lanes.

Results — measured on this branch @ `main` (quadrants 1.0.2, RTX 3080, 2×SMPL-X = 318 DOFs)

metric	cooperative-global (current)	register-tiled (this PR)	speedup
mass-factor kernel `kernel_step_1` @ 1024 envs	462 ms	29 ms	~15.7×
full substep @ 1024 envs	521 ms	89 ms	~5.9×
mass-factor kernel `kernel_step_1` @ 512 envs	237 ms	16 ms	~14.8×
full substep @ 512 envs	285 ms	66 ms	~4.3×

After the fix the constraint solve (rest, ~58 ms, unchanged) dominates the substep — the mass factor is no longer the bottleneck. Also confirmed inert / no regression on a low-DOF robot (G1, 29 DOFs ≤ cap → unchanged shared-memory path) and on CPU.

Correctness (verified on `main`)

func_solve_mass_entity consumes the LTDL form M = Lᵀ D L (L unit-lower, reverse-pivot — same as the stock scalar/cooperative/shared factors), not standard L D Lᵀ. The tile primitive does forward Cholesky, so we factor the reverse-indexed block M_rev[a,b]=M[n-1-a,n-1-b] and map its factor back to M's LTDL factor.

Verified on a main env (quadrants==1.0.2): the factor reconstructs M in the solve's convention to ‖M − Lᵀ D L‖/‖M‖ ≈ 2.6e-7 on both the 159-DOF and 29-DOF blocks (matches the stock factor's 4e-7), and trajectories match the cooperative-global baseline (rel 2.5e-7).

Design / safety

Gated off by default behind RigidOptions.register_tiled_mass (None = auto → the GS_REGISTER_TILED_MASS env var, default off; True/False overrides). No-op until opted in.
CUDA forward only. CPU, backward/AD, and the under-cap shared-memory tiled path are untouched (the AD factor keeps its dedicated safe-access branch). The dispatch flag is only set in the gs.backend != gs.cpu block, so the path is fully inert on CPU.
The only edit to existing control flow is one if → elif in func_factor_mass; everything else is additive.
Scratch buffer / memory: [MISC] Move register-tile Cholesky into quadrants. #2860 moved the register-tile Cholesky into quadrants, whose tile slice ops are batch-first (arr[batch, rows, cols]), while mass_mat_L is canonical batch-last (n_dofs, n_dofs, _B). So the factorization can't run in place; it uses a dedicated batch-first scratch mass_mat_tiled of shape (n_entities·_B, P, P) (P = tiled_n_dofs_per_entity), allocated only when the path is enabled (e.g. ~335 MB at 1024 envs / 2×SMPL-X). Open to alternatives — e.g. transiently reusing the Hessian's nt_H (also batch-first, free at factor time) for zero extra memory, if maintainers prefer.

Files (4, all additive)

engine/solvers/rigid/abd/forward_dynamics.py — the tiled factor _factor_mass_tiled_impl + dispatcher (qd.simt.Tile16x16/32x32); new if branch in func_factor_mass.
utils/array_class.py — enable_register_tiled_mass + cholesky_tile_size_mass static config, and the mass_mat_tiled scratch allocation.
options/solvers.py — user-facing RigidOptions.register_tiled_mass.
engine/solvers/rigid/rigid_solver.py — tile-size selection + gating.

A separate small follow-up documents that RigidSolver.get_mass_mat(decompose=True) returns the LTDL factor (#2916).

🤖 Generated with Claude Code

Kashu7100 · 2026-06-09T09:04:02Z

The three red checks on the previous push were pre-existing CI flakes, not from this change:

windows-2025-3.12-cpu-ndarray — 10 errors, all huggingface_hub 429 Too Many Requests / LocalEntryNotFound while fetching test assets from the Hub (network rate-limit).
production-unit_tests-ndarray (exit 3) — 747 passed; the non-zero exit is a pytest-timeout (>1200 s) on test_sensor_camera.py::test_batch_renderer[2-cuda] crashing an xdist worker.
ubuntu-24.04-3.12-cpu-field (exit 3) — 122 passed; same pattern, a >1200 s timeout on test_render.py::test_render_api_advanced[0-RASTERIZER] (which also hit the HF 429 above).

tests/test_rigid_physics.py::test_mass_mat passed on every job. No CPU code path is affected (the feature is gated behind gs.backend != gs.cpu and off by default). I've re-pushed to re-trigger CI.

…hared-cap path

Kashu7100 force-pushed the feat/register-tiled-mass branch 2 times, most recently from 149cc54 to 00b7052 Compare June 9, 2026 09:03

Kashu7100 force-pushed the feat/register-tiled-mass branch 2 times, most recently from 00c758b to 6a28583 Compare June 9, 2026 09:51

Kashu7100 marked this pull request as ready for review June 9, 2026 09:56

Kashu7100 requested review from YilingQiao and duburcqa as code owners June 9, 2026 09:56

Kashu7100 force-pushed the feat/register-tiled-mass branch from 6a28583 to 82f389c Compare June 11, 2026 21:10

duburcqa changed the title ~~[PERF] Register-tiled per-entity mass-matrix factorization (follow-up to #2869)~~ [MISC] Register-tiled per-entity mass-matrix factorization. Jun 22, 2026

duburcqa changed the title ~~[MISC] Register-tiled per-entity mass-matrix factorization.~~ [MISC] Speedup rigid body mass matrix cholesky decomposition for large dof count. Jun 22, 2026

duburcqa force-pushed the feat/register-tiled-mass branch from 82f389c to a79f89f Compare June 22, 2026 15:22

[PERF] Register-tiled per-entity mass-matrix factorization for the >s…

224cd3a

…hared-cap path

duburcqa force-pushed the feat/register-tiled-mass branch from a79f89f to 224cd3a Compare June 22, 2026 15:28

duburcqa approved these changes Jun 22, 2026

View reviewed changes

duburcqa merged commit 4a5be7f into Genesis-Embodied-AI:main Jun 22, 2026
22 of 23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MISC] Speedup rigid body mass matrix cholesky decomposition for large dof count.#2915

[MISC] Speedup rigid body mass matrix cholesky decomposition for large dof count.#2915
duburcqa merged 1 commit into
Genesis-Embodied-AI:mainfrom
Kashu7100:feat/register-tiled-mass

Kashu7100 commented Jun 8, 2026 •

edited

Loading

Uh oh!

Kashu7100 commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Kashu7100 commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Results — measured on this branch @ main (quadrants 1.0.2, RTX 3080, 2×SMPL-X = 318 DOFs)

Correctness (verified on main)

Design / safety

Files (4, all additive)

Uh oh!

Kashu7100 commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Kashu7100 commented Jun 8, 2026 •

edited

Loading

Results — measured on this branch @ `main` (quadrants 1.0.2, RTX 3080, 2×SMPL-X = 318 DOFs)

Correctness (verified on `main`)