A high-performance, distributed implementation of the EGGROLL algorithm designed for scaling across multiple GPUs and nodes.
The system follows a Coordinator-Worker model designed for high throughput and fault tolerance.
-
Data Requirements:
- Each compute node (worker) must have
input.bin(training data) anddecoding.bin(tokenizer vocabulary) in its working directory. - The coordinator does not require these files.
- Each compute node (worker) must have
-
Population Splitting:
- The total
POPULATION_SIZE(e.g., 327,680) is divided into smallerCHUNK_SIZEunits (e.g., 40,960). - Example: A population of ~320k is split into 8 chunks of ~40k each.
- The total
-
Job Queueing & Assignment:
- The Coordinator maintains a queue of pending chunks for the current step.
- Chunks are assigned to the first available worker that requests a job.
- If a worker times out (straggler), the chunk is re-queued and assigned to another worker.
-
Result Aggregation:
- Workers compute fitness scores for their assigned chunk.
- Scores are ternary packed (compressed to 1.58 bits/value) and sent back to the Coordinator.
- Once all chunks are completed, the Coordinator advances to the next step and broadcasts the updated state.
- Custom Binary Protocol: Lightweight, versioned protocol with fixed-size headers and big-endian byte order for cross-platform compatibility.
- Fault Tolerance: Robust against worker failures or network delays. The coordinator detects stragglers and re-assigns their work to other available workers.
- Efficiency:
- Ternary Packing: Fitness scores are compressed into 2-bit values (packed 5 per byte), significantly reducing network bandwidth.
- CUDA Graphs: Optimizer steps are captured as CUDA Graphs to reduce kernel launch overhead.
- Uniform Random Distribution: Model initialization and evolution noise use Uniform Random distribution by default (not Gaussian).
- Adaptive Training:
- Adaptive Noise Scaling: Automatically adjusts noise scale per layer/parameter to maintain stability.
- Chunk Mean Filter: Favors perturbations that align with the local chunk trend.
- Adaptive Thresholding: Filters out weak signals based on mean absolute difference.
- Optimizers:
- AdamW: Implements the AdamW optimizer (
d-eggs/include/optimizer/adam.cuh) with quantized updates, maintaining float moments while updating integer weights. - SGD / Quantized Update: Uses a threshold-based update mechanism (
d-eggs/include/optimizer/sgd.cuh) to bridge floating-point gradients with integer-only weights.
- AdamW: Implements the AdamW optimizer (
The worker implements a custom CUDA kernel for sequence generation:
- Presence Penalty: Penalizes tokens that have already appeared in the sequence to reduce repetition.
- Temperature Scaling: Adjusts the randomness of predictions. Higher temperature flattens the distribution.
- Min-P Sampling: An efficient alternative to Top-P (Nucleus) sampling. It truncates the tail of the distribution by ignoring tokens with probability less than
min_p * max_prob. - Selection: Performs a weighted random selection from the filtered distribution.
Configuration via include/config.h:
SAMPLING_TEMP(Default: 0.6)SAMPLING_MIN_P(Default: 0.08)SAMPLING_PRESENCE_PENALTY(Default: 0.2)
These features are implemented but disabled in include/config.h. Enable them by modifying the config or passing flags during compilation.
- Muon Optimizer (
USE_MUON): Momentum-based optimizer for potentially faster convergence. - NTT Mode (
NTT_MODE): Number Theoretic Transform embeddings (Walsh-Hadamard, Fermat). - Gaussian Noise (
HOST_GAUSSIAN,DEVICE_GAUSSIAN): Alternative noise generation strategies.
- NVIDIA GPU with CUDA support (Compute Capability 8.6+ recommended).
- GCC/G++ and NVCC.
makeThis produces three binaries:
coordinator: The central server.worker: The GPU worker.print_arch: Utility to print model architecture and parameter counts.
-
Prepare Data: Ensure
input.binanddecoding.binare present on all worker nodes. -
Start the Coordinator:
./coordinator --save-dir ./checkpoints
The coordinator listens on port
12345. -
Start Workers:
./worker <coordinator_ip>
Example (local):
./worker 127.0.0.1
For a multi-GPU setup on a single machine, use the provided script:
./run-workers.sh
Typical PCIe traffic:
Typical coordinator output:
Configuration is handled via include/config.h. Key parameters include:
HIDDEN_DIM,N_LAYERS,N_HEADS: Model architecture.POPULATION_SIZE,CHUNK_SIZE: Training scale.USE_MUON,ROPE_SCALE_BIT: Advanced features.
Recompile after changing config.h.

