Skip to content

[BUG] DownloadPending entries for retired node IDs never GC'd — infinite _delete_download loop survives all restarts #2136

@kkakkung2000

Description

@kkakkung2000

Summary

EXO 1.0.71 (macOS app, 2-node M3 Ultra cluster) has DownloadPending entries in /state.downloads for two retired models whose node IDs no longer exist in the topology. The coordinator's _delete_download loops indefinitely — surviving event-log clears, model-card removal, and clean simultaneous restarts of both nodes. This causes unbounded event-log growth and sustained CPU waste on the peer node.

Environment

  • EXO version: 1.0.71 (macOS app, build 1000071999, Apr 23 2026)
  • Nodes: 2× M3 Ultra Mac Studio, connected via Thunderbolt 5 (RDMA)
  • EXO_HOME: /Volumes/logic/00_AI_Models_Cluster/exo (Node A), /Volumes/omni/... (Node B)
  • Python 3.12, MLX backend

Symptoms

INFO  exo.download.coordinator:_delete_download:321  Deleting model files for mlx-community/Kimi-K2.5
WARN  exo.download.coordinator:_delete_download:327  Model mlx-community/Kimi-K2.5 was not found on disk
INFO  exo.download.coordinator:_delete_download:321  Deleting model files for mlx-community/Kimi-K2.5
...

Loop fires ~3–19 times/second (rate varies), continuously appending events. lastEventAppliedIdx climbs ~190 per 20 seconds with no real work happening. Peer node replays the events at >100% CPU.

Root cause (confirmed via investigation)

/state.downloads shows DownloadPending entries keyed to dead node IDs — instances that were retired months ago and no longer appear in /state.topology.nodes:

{
  "nodeId": "12D3KooWS2vLVdZM46hHFdV5mjffvjNKLri3d8bvNggmFmtpnmZU",  // not in topology
  "shardMetadata": { "PipelineShardMetadata": { "modelCard": { "modelId": "mlx-community/Kimi-K2.5", ... }}},
  "modelDirectory": "/Volumes/logic/.../models/mlx-community--Kimi-K2.5",
  "downloaded": { "inBytes": 0 },
  "total": { "inBytes": 0 }
}

The coordinator correctly tries to clean up stale download entries, but:

  1. modelDirectory doesn't exist (models were retired and removed)
  2. rmtree fails with "not found on disk"
  3. Coordinator then fetches a file list from HuggingFace to reconcile → recreates caches/ entry
  4. Re-derives the DownloadPending from the cache
  5. Loop repeats from step 1

The node IDs in these entries (12D3KooWS2vLVdZM..., 12D3KooWLtdo3u21...) belong to old EXO instances that have since been restarted and received new peer IDs. There is no garbage-collection mechanism to remove DownloadPending entries for node IDs that are no longer in the mesh.

What does NOT fix it

We exhaustively ruled out every filesystem approach:

Attempt Result
Single-node restart (events.bin quarantined) Loop re-derived from peer within seconds
Both-node simultaneous restart (both events.bin cleared) Loop returned within 15s of both nodes coming up
Remove model card TOML No effect (one model had no card and still looped)
Clear entire $EXO_HOME/event_log/{master,api} on both nodes while both down Loop returned on next start
Create empty placeholder dirs at modelDirectory so rmtree succeeds EXO deletes them, immediately refetches file list from HF, re-creates DownloadPending
Clear ~/Library/Caches/exolabs.EXO/fsCachedData No effect (UI cache, not coordinator)

The state appears to be reconstructed from EXO's internal coordinator logic on startup, not from any clearable on-disk file.

Impact

  • Constant event-log growth (observed 160 MB single events file on one node)
  • Peer node sustained at >100% CPU replaying events
  • Same mechanism caused a production crash: when the caches/ directory for a different model existed but was non-empty, rmtree threw OSError: [Errno 66] Directory not empty, crashing the EXO process. This left the peer node unreachable for 6 days (no auto-revive mechanism).
  • The models involved have a combined declared size of ~1.1 TB (662 GB + 470 GB). If the HF file-list fetch ever succeeds, EXO would attempt a download of that size — causing OOM and evicting all loaded models.

Expected behavior

DownloadPending entries for node IDs that are no longer present in the mesh topology should be garbage-collected. If a node ID hasn't been seen in the topology for N minutes/hours, its pending download entries should be cleaned up or marked as abandoned.

Workaround

None available without an EXO version change. The failing HF file-list fetch is currently acting as an accidental guard against the 1.1 TB download — but this is fragile and depends on network conditions.

Related: #1973 (RDMA regression), #1975 (JACCL placement), #1568 (similar delete loop, different trigger)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions