Not Muon
Use single sided whitening that is dynamic and learned instead of being instantanious like Muon. This means we don't have to do it every iteration -- think of the savings.
Siren exmple with Nuon beats Muon tuned
Hyper-params for Muon(Keller): reaches loss of 0.000982 PSGD Nuon reaches loss of 0.000898
# Assuming Muon is defined elsewhere
optimizer = Muon(
muon_params,
lr=0.005,
momentum=0.9,
adamw_params=adamw_params,
adamw_lr=3e-4,
adamw_betas=(0.90, 0.95),
adamw_wd=0
)
