[BUG] DownloadPending entries for retired node IDs never GC'd — infinite _delete_download loop survives all restarts

## Summary

EXO 1.0.71 (macOS app, 2-node M3 Ultra cluster) has `DownloadPending` entries in `/state.downloads` for **two retired models** whose node IDs no longer exist in the topology. The coordinator's `_delete_download` loops indefinitely — surviving event-log clears, model-card removal, and clean simultaneous restarts of both nodes. This causes unbounded event-log growth and sustained CPU waste on the peer node.

## Environment

- EXO version: **1.0.71** (macOS app, build 1000071999, Apr 23 2026)
- Nodes: 2× M3 Ultra Mac Studio, connected via Thunderbolt 5 (RDMA)
- EXO_HOME: `/Volumes/logic/00_AI_Models_Cluster/exo` (Node A), `/Volumes/omni/...` (Node B)
- Python 3.12, MLX backend

## Symptoms

```
INFO  exo.download.coordinator:_delete_download:321  Deleting model files for mlx-community/Kimi-K2.5
WARN  exo.download.coordinator:_delete_download:327  Model mlx-community/Kimi-K2.5 was not found on disk
INFO  exo.download.coordinator:_delete_download:321  Deleting model files for mlx-community/Kimi-K2.5
...
```

Loop fires ~3–19 times/second (rate varies), continuously appending events. `lastEventAppliedIdx` climbs ~190 per 20 seconds with no real work happening. Peer node replays the events at >100% CPU.

## Root cause (confirmed via investigation)

`/state.downloads` shows `DownloadPending` entries keyed to **dead node IDs** — instances that were retired months ago and no longer appear in `/state.topology.nodes`:

```json
{
  "nodeId": "12D3KooWS2vLVdZM46hHFdV5mjffvjNKLri3d8bvNggmFmtpnmZU",  // not in topology
  "shardMetadata": { "PipelineShardMetadata": { "modelCard": { "modelId": "mlx-community/Kimi-K2.5", ... }}},
  "modelDirectory": "/Volumes/logic/.../models/mlx-community--Kimi-K2.5",
  "downloaded": { "inBytes": 0 },
  "total": { "inBytes": 0 }
}
```

The coordinator correctly tries to clean up stale download entries, but:
1. `modelDirectory` doesn't exist (models were retired and removed)
2. `rmtree` fails with "not found on disk"
3. Coordinator then fetches a file list from HuggingFace to reconcile → recreates `caches/` entry
4. Re-derives the `DownloadPending` from the cache
5. Loop repeats from step 1

The **node IDs** in these entries (`12D3KooWS2vLVdZM...`, `12D3KooWLtdo3u21...`) belong to old EXO instances that have since been restarted and received new peer IDs. There is no garbage-collection mechanism to remove `DownloadPending` entries for node IDs that are no longer in the mesh.

## What does NOT fix it

We exhaustively ruled out every filesystem approach:

| Attempt | Result |
|---|---|
| Single-node restart (events.bin quarantined) | Loop re-derived from peer within seconds |
| Both-node simultaneous restart (both events.bin cleared) | Loop returned within 15s of both nodes coming up |
| Remove model card TOML | No effect (one model had no card and still looped) |
| Clear entire `$EXO_HOME/event_log/{master,api}` on both nodes while both down | Loop returned on next start |
| Create empty placeholder dirs at `modelDirectory` so `rmtree` succeeds | EXO deletes them, immediately refetches file list from HF, re-creates DownloadPending |
| Clear `~/Library/Caches/exolabs.EXO/fsCachedData` | No effect (UI cache, not coordinator) |

The state appears to be reconstructed from EXO's internal coordinator logic on startup, not from any clearable on-disk file.

## Impact

- Constant event-log growth (observed 160 MB single events file on one node)
- Peer node sustained at >100% CPU replaying events
- **Same mechanism caused a production crash**: when the `caches/` directory for a different model existed but was non-empty, `rmtree` threw `OSError: [Errno 66] Directory not empty`, crashing the EXO process. This left the peer node unreachable for 6 days (no auto-revive mechanism).
- The models involved have a combined declared size of ~1.1 TB (662 GB + 470 GB). If the HF file-list fetch ever succeeds, EXO would attempt a download of that size — causing OOM and evicting all loaded models.

## Expected behavior

`DownloadPending` entries for node IDs that are no longer present in the mesh topology should be garbage-collected. If a node ID hasn't been seen in the topology for N minutes/hours, its pending download entries should be cleaned up or marked as abandoned.

## Workaround

None available without an EXO version change. The failing HF file-list fetch is currently acting as an accidental guard against the 1.1 TB download — but this is fragile and depends on network conditions.

Related: #1973 (RDMA regression), #1975 (JACCL placement), #1568 (similar delete loop, different trigger)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] DownloadPending entries for retired node IDs never GC'd — infinite _delete_download loop survives all restarts #2136

Summary

Environment

Symptoms

Root cause (confirmed via investigation)

What does NOT fix it

Impact

Expected behavior

Workaround

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Attempt	Result
Single-node restart (events.bin quarantined)	Loop re-derived from peer within seconds
Both-node simultaneous restart (both events.bin cleared)	Loop returned within 15s of both nodes coming up
Remove model card TOML	No effect (one model had no card and still looped)
Clear entire `$EXO_HOME/event_log/{master,api}` on both nodes while both down	Loop returned on next start
Create empty placeholder dirs at `modelDirectory` so `rmtree` succeeds	EXO deletes them, immediately refetches file list from HF, re-creates DownloadPending
Clear `~/Library/Caches/exolabs.EXO/fsCachedData`	No effect (UI cache, not coordinator)

[BUG] DownloadPending entries for retired node IDs never GC'd — infinite _delete_download loop survives all restarts #2136

Description

Summary

Environment

Symptoms

Root cause (confirmed via investigation)

What does NOT fix it

Impact

Expected behavior

Workaround

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions