Existing accounts of grokking explain the phenomena in terms of mechanistic frameworks such as circuit efficiency or lazy-to-rich transitions. However, despite a known dependence between grokking and model size, how model capacity shapes grokking remains an open question. We give an information-theoretic account of this relationship on the task of modular arithmetic, showing that grokking does not immediately occur when a model becomes large enough to memorise the training set, but rather emerges as the outcome of a competition between two measurable timescales: a memorisation speed
In your micromamba environment of choice, run the following command to install the package in editable mode:
pip install -e .This exposes the experiment entry points (gc-capacity, gc-speed, gc-groks), the suite dispatcher (gc-dispatch), and the figure renderer (gc-figures). The experiment runs database lives at runs.db (override with GC_WALLOW_DB); the schema is declared in wallow.toml and auto-creates on first use.
The paper's data is produced by a set of YAML-driven suites under configs/. Each suite is dispatched in parallel across the available GPUs by gc-dispatch.
To reproduce every suite end-to-end on a single multi-GPU node, run the wrapper script from the repo root with your micromamba environment name:
scripts/replicate.sh <micromamba-env>This launches a detached tmux session that runs all ten suites (central, weight_decay_sweep, alpha_sweep, dropout_sweep, lr_sweep, init_scale_sweep, task_add, depth_scaling, heads_sweep, task_mul) sequentially.
Once the wallow database is populated, render the figures off it (one folder per config under figures/):
gc-figures --allTo render a single config or restrict to a figure family, see gc-figures --help.
Use the following BibTeX entry to cite this work:
@article{song2026grokking,
title={Model Capacity Determines Grokking through Competing Memorisation and Generalisation Speeds},
author={Song, Yiding and Ye, Hanming}
year={2026}
}Unless otherwise stated, the files and code in this repository are licensed under the GNU GENERAL PUBLIC LICENSE (Version 3), Copyright (C) 2026 Yiding Song and Hanming Ye.
Note: the files modular.py and transformer.py are adapted from the code by Amund Tveit (available at adveit/torch_grokking under the MIT License), which itself is a PyTorch port of the original MLX code by Jason Stock (available at stockeh/mlx-grokking). We have modified modular.py to add different split types and random data generation, and mostly left transformer.py untouched. The trainers used to finetune the neural network also takes inspiration from the code of Tveit and Stock.