fix: support Apple Silicon Metal API and resolve numerical underflow#9
fix: support Apple Silicon Metal API and resolve numerical underflow#9zhang-zidong wants to merge 1 commit into
Conversation
zhang-zidong
commented
Mar 4, 2026
- Implement smart OS detection for Taichi backend (metal+f32 on Mac, gpu+f64 on Linux).
- Refactor kernel probability calculations to log-space (Log-Sum-Exp) to prevent f32 underflow.
- Add strict numpy dtype casting to prevent f64 leakage into Metal kernels.
- Implement smart OS detection for Taichi backend (metal+f32 on Mac, gpu+f64 on Linux). - Refactor kernel probability calculations to log-space (Log-Sum-Exp) to prevent f32 underflow. - Add strict numpy dtype casting to prevent f64 leakage into Metal kernels.
chengl7
left a comment
There was a problem hiding this comment.
Dear Zidong,
Thank you very much for taking the time to submit this PR and for working on improving SCAPE. We really appreciate the effort to add Apple Silicon / Metal support and improve the numerical stability of the kernels.
Unfortunately, our team is currently understaffed and we do not have the capacity to properly test and validate these changes across the different environments that SCAPE supports. Because this code touches core Taichi kernels and backend initialization, we want to be careful before merging changes that may affect existing workflows.
In particular, some parts of the PR change how the Taichi backend and floating-point precision are selected (e.g., switching between Metal and GPU backends and modifying default dtypes). Without testing, there is a risk that these changes could affect behavior on other platforms such as Linux/CUDA systems, CPU-only environments, or existing pipelines that rely on the current float64 behavior.
For that reason, we’re not able to merge the PR right now. If you (or others in the community) are able to run and validate the changes on different systems (e.g., CUDA/Linux, CPU-only, Apple Silicon/Metal) and confirm that the behavior remains correct, please feel free to report the results here — that would greatly help us move this forward.
Thanks again for the contribution and for supporting the project!
Lu
|
Hi Lu, Thank you for the detailed and transparent feedback! I completely understand your concerns—touching the core Taichi configurations and precision settings does carry risks, and it is totally reasonable to hold off on merging without full validation. To be completely transparent about my current testing status: I have verified that the code successfully runs on Apple Silicon (Metal, f32) without crashing. However, I have not yet performed a rigorous numerical comparison between the new Mac (f32) output and the original CPU/Linux (f64) output to confirm if the results are statistically consistent and close enough. Just to provide a bit of reassurance on the code structure, the backend and precision changes are strictly encapsulated within a platform.system() == "Darwin" condition. On Linux or CPU-only setups, the script automatically defaults back to ti.gpu and ti.f64. This ensures the exact same precision and execution logic as the original code for non-Mac users. The Log-Sum-Exp adjustments are mathematically equivalent, serving only to prevent floating-point underflows. To help move this forward, I will run a comparative benchmark on my end (comparing the Metal/f32 results against the original CPU/f64 results on the toy example) to check for numerical consistency, and I will post the findings here. In the meantime, if anyone in the community has a Linux/CUDA setup and could help run a quick test using this branch, it would be greatly appreciated! Thank you again for your time and for maintaining this great project. Best regards, Zidong |