[MISC] Speedup rigid body mass matrix cholesky decomposition for large dof count.#2915
Merged
duburcqa merged 1 commit intoJun 22, 2026
Merged
Conversation
149cc54 to
00b7052
Compare
Collaborator
Author
|
The three red checks on the previous push were pre-existing CI flakes, not from this change:
|
00c758b to
6a28583
Compare
6a28583 to
82f389c
Compare
82f389c to
a79f89f
Compare
a79f89f to
224cd3a
Compare
duburcqa
approved these changes
Jun 22, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up to #2869. That PR made the constraint-Hessian Cholesky register-tiled and fast. This PR does the same for the other dense factorization in the rigid step — the per-entity mass-matrix factor (
func_factor_mass).For high-DOF articulated bodies (SMPL/SMPL-X, dexterous hands), once an entity's mass submatrix exceeds GPU shared memory the factor falls back to the cooperative-global LDLᵀ path, which becomes the dominant cost. Profiling two SMPL-X humanoids (159 DOFs each, 318 total) at 1024 envs on an RTX 3080: the mass factor is ~89% of the whole substep.
This adds a register-streaming tiled factor for that
>shared-cappath, reusing the sameqd.simt.TileNxNCholesky primitive as the constraint Hessian. The mass matrix is block-diagonal across entities, so each entity'sn_dofs×n_dofsblock factors independently in one warp ofTlanes.Results — measured on this branch @
main(quadrants 1.0.2, RTX 3080, 2×SMPL-X = 318 DOFs)kernel_step_1@ 1024 envskernel_step_1@ 512 envsAfter the fix the constraint solve (
rest, ~58 ms, unchanged) dominates the substep — the mass factor is no longer the bottleneck. Also confirmed inert / no regression on a low-DOF robot (G1, 29 DOFs ≤ cap → unchanged shared-memory path) and on CPU.Correctness (verified on
main)func_solve_mass_entityconsumes the LTDL formM = Lᵀ D L(L unit-lower, reverse-pivot — same as the stock scalar/cooperative/shared factors), not standardL D Lᵀ. The tile primitive does forward Cholesky, so we factor the reverse-indexed blockM_rev[a,b]=M[n-1-a,n-1-b]and map its factor back to M's LTDL factor.Verified on a
mainenv (quadrants==1.0.2): the factor reconstructs M in the solve's convention to‖M − Lᵀ D L‖/‖M‖ ≈ 2.6e-7on both the 159-DOF and 29-DOF blocks (matches the stock factor's 4e-7), and trajectories match the cooperative-global baseline (rel 2.5e-7).Design / safety
RigidOptions.register_tiled_mass(None= auto → theGS_REGISTER_TILED_MASSenv var, default off;True/Falseoverrides). No-op until opted in.gs.backend != gs.cpublock, so the path is fully inert on CPU.if→elifinfunc_factor_mass; everything else is additive.arr[batch, rows, cols]), whilemass_mat_Lis canonical batch-last(n_dofs, n_dofs, _B). So the factorization can't run in place; it uses a dedicated batch-first scratchmass_mat_tiledof shape(n_entities·_B, P, P)(P =tiled_n_dofs_per_entity), allocated only when the path is enabled (e.g. ~335 MB at 1024 envs / 2×SMPL-X). Open to alternatives — e.g. transiently reusing the Hessian'snt_H(also batch-first, free at factor time) for zero extra memory, if maintainers prefer.Files (4, all additive)
engine/solvers/rigid/abd/forward_dynamics.py— the tiled factor_factor_mass_tiled_impl+ dispatcher (qd.simt.Tile16x16/32x32); newifbranch infunc_factor_mass.utils/array_class.py—enable_register_tiled_mass+cholesky_tile_size_massstatic config, and themass_mat_tiledscratch allocation.options/solvers.py— user-facingRigidOptions.register_tiled_mass.engine/solvers/rigid/rigid_solver.py— tile-size selection + gating.A separate small follow-up documents that
RigidSolver.get_mass_mat(decompose=True)returns the LTDL factor (#2916).🤖 Generated with Claude Code