Summary
EXO 1.0.71 (macOS app, 2-node M3 Ultra cluster) has DownloadPending entries in /state.downloads for two retired models whose node IDs no longer exist in the topology. The coordinator's _delete_download loops indefinitely — surviving event-log clears, model-card removal, and clean simultaneous restarts of both nodes. This causes unbounded event-log growth and sustained CPU waste on the peer node.
Environment
- EXO version: 1.0.71 (macOS app, build 1000071999, Apr 23 2026)
- Nodes: 2× M3 Ultra Mac Studio, connected via Thunderbolt 5 (RDMA)
- EXO_HOME:
/Volumes/logic/00_AI_Models_Cluster/exo (Node A), /Volumes/omni/... (Node B)
- Python 3.12, MLX backend
Symptoms
INFO exo.download.coordinator:_delete_download:321 Deleting model files for mlx-community/Kimi-K2.5
WARN exo.download.coordinator:_delete_download:327 Model mlx-community/Kimi-K2.5 was not found on disk
INFO exo.download.coordinator:_delete_download:321 Deleting model files for mlx-community/Kimi-K2.5
...
Loop fires ~3–19 times/second (rate varies), continuously appending events. lastEventAppliedIdx climbs ~190 per 20 seconds with no real work happening. Peer node replays the events at >100% CPU.
Root cause (confirmed via investigation)
/state.downloads shows DownloadPending entries keyed to dead node IDs — instances that were retired months ago and no longer appear in /state.topology.nodes:
{
"nodeId": "12D3KooWS2vLVdZM46hHFdV5mjffvjNKLri3d8bvNggmFmtpnmZU", // not in topology
"shardMetadata": { "PipelineShardMetadata": { "modelCard": { "modelId": "mlx-community/Kimi-K2.5", ... }}},
"modelDirectory": "/Volumes/logic/.../models/mlx-community--Kimi-K2.5",
"downloaded": { "inBytes": 0 },
"total": { "inBytes": 0 }
}
The coordinator correctly tries to clean up stale download entries, but:
modelDirectory doesn't exist (models were retired and removed)
rmtree fails with "not found on disk"
- Coordinator then fetches a file list from HuggingFace to reconcile → recreates
caches/ entry
- Re-derives the
DownloadPending from the cache
- Loop repeats from step 1
The node IDs in these entries (12D3KooWS2vLVdZM..., 12D3KooWLtdo3u21...) belong to old EXO instances that have since been restarted and received new peer IDs. There is no garbage-collection mechanism to remove DownloadPending entries for node IDs that are no longer in the mesh.
What does NOT fix it
We exhaustively ruled out every filesystem approach:
| Attempt |
Result |
| Single-node restart (events.bin quarantined) |
Loop re-derived from peer within seconds |
| Both-node simultaneous restart (both events.bin cleared) |
Loop returned within 15s of both nodes coming up |
| Remove model card TOML |
No effect (one model had no card and still looped) |
Clear entire $EXO_HOME/event_log/{master,api} on both nodes while both down |
Loop returned on next start |
Create empty placeholder dirs at modelDirectory so rmtree succeeds |
EXO deletes them, immediately refetches file list from HF, re-creates DownloadPending |
Clear ~/Library/Caches/exolabs.EXO/fsCachedData |
No effect (UI cache, not coordinator) |
The state appears to be reconstructed from EXO's internal coordinator logic on startup, not from any clearable on-disk file.
Impact
- Constant event-log growth (observed 160 MB single events file on one node)
- Peer node sustained at >100% CPU replaying events
- Same mechanism caused a production crash: when the
caches/ directory for a different model existed but was non-empty, rmtree threw OSError: [Errno 66] Directory not empty, crashing the EXO process. This left the peer node unreachable for 6 days (no auto-revive mechanism).
- The models involved have a combined declared size of ~1.1 TB (662 GB + 470 GB). If the HF file-list fetch ever succeeds, EXO would attempt a download of that size — causing OOM and evicting all loaded models.
Expected behavior
DownloadPending entries for node IDs that are no longer present in the mesh topology should be garbage-collected. If a node ID hasn't been seen in the topology for N minutes/hours, its pending download entries should be cleaned up or marked as abandoned.
Workaround
None available without an EXO version change. The failing HF file-list fetch is currently acting as an accidental guard against the 1.1 TB download — but this is fragile and depends on network conditions.
Related: #1973 (RDMA regression), #1975 (JACCL placement), #1568 (similar delete loop, different trigger)
Summary
EXO 1.0.71 (macOS app, 2-node M3 Ultra cluster) has
DownloadPendingentries in/state.downloadsfor two retired models whose node IDs no longer exist in the topology. The coordinator's_delete_downloadloops indefinitely — surviving event-log clears, model-card removal, and clean simultaneous restarts of both nodes. This causes unbounded event-log growth and sustained CPU waste on the peer node.Environment
/Volumes/logic/00_AI_Models_Cluster/exo(Node A),/Volumes/omni/...(Node B)Symptoms
Loop fires ~3–19 times/second (rate varies), continuously appending events.
lastEventAppliedIdxclimbs ~190 per 20 seconds with no real work happening. Peer node replays the events at >100% CPU.Root cause (confirmed via investigation)
/state.downloadsshowsDownloadPendingentries keyed to dead node IDs — instances that were retired months ago and no longer appear in/state.topology.nodes:{ "nodeId": "12D3KooWS2vLVdZM46hHFdV5mjffvjNKLri3d8bvNggmFmtpnmZU", // not in topology "shardMetadata": { "PipelineShardMetadata": { "modelCard": { "modelId": "mlx-community/Kimi-K2.5", ... }}}, "modelDirectory": "/Volumes/logic/.../models/mlx-community--Kimi-K2.5", "downloaded": { "inBytes": 0 }, "total": { "inBytes": 0 } }The coordinator correctly tries to clean up stale download entries, but:
modelDirectorydoesn't exist (models were retired and removed)rmtreefails with "not found on disk"caches/entryDownloadPendingfrom the cacheThe node IDs in these entries (
12D3KooWS2vLVdZM...,12D3KooWLtdo3u21...) belong to old EXO instances that have since been restarted and received new peer IDs. There is no garbage-collection mechanism to removeDownloadPendingentries for node IDs that are no longer in the mesh.What does NOT fix it
We exhaustively ruled out every filesystem approach:
$EXO_HOME/event_log/{master,api}on both nodes while both downmodelDirectorysormtreesucceeds~/Library/Caches/exolabs.EXO/fsCachedDataThe state appears to be reconstructed from EXO's internal coordinator logic on startup, not from any clearable on-disk file.
Impact
caches/directory for a different model existed but was non-empty,rmtreethrewOSError: [Errno 66] Directory not empty, crashing the EXO process. This left the peer node unreachable for 6 days (no auto-revive mechanism).Expected behavior
DownloadPendingentries for node IDs that are no longer present in the mesh topology should be garbage-collected. If a node ID hasn't been seen in the topology for N minutes/hours, its pending download entries should be cleaned up or marked as abandoned.Workaround
None available without an EXO version change. The failing HF file-list fetch is currently acting as an accidental guard against the 1.1 TB download — but this is fragile and depends on network conditions.
Related: #1973 (RDMA regression), #1975 (JACCL placement), #1568 (similar delete loop, different trigger)