Skip to content

Add checkpoint-restore before testing mode#149

Open
kamilchodola wants to merge 16 commits into
devnets/bal/6from
codex/checkpoint-before-testing
Open

Add checkpoint-restore before testing mode#149
kamilchodola wants to merge 16 commits into
devnets/bal/6from
codex/checkpoint-before-testing

Conversation

@kamilchodola

Copy link
Copy Markdown
Contributor

Summary

  • Add checkpoint_before_testing workflow dispatch input for Linux Docker/CRIU checkpoint-restore runs
  • Add run.sh -C mode that checkpoints and restores gas-execution-client before each measured test
  • Keep checkpoint mode mutually exclusive with restart_before_testing / -R
  • Install criu in the workflow when checkpoint mode is requested and fail clearly if Docker checkpoint support is unavailable

Notes

Docker/CRIU checkpoints snapshot the execution client process memory, not the bind-mounted data directory. This is an experimental Linux-only alternative to container restart for measuring cache/process-state effects.

Testing

  • Parsed .github/workflows/repricing-nethermind.yml with PyYAML
  • Ran git diff --check
  • Ran bash -n against a CRLF-stripped stream of run.sh

@kamilchodola kamilchodola force-pushed the codex/checkpoint-before-testing branch from b895a1e to 5afd879 Compare May 19, 2026 09:36
Replace the complex 3-layer overlay approach (mount --move to ready-lower,
then mount test overlay on top) with a simpler strategy:

- After setup completes, snapshot the overlay upper dir via rsync/cp
- Before each CRIU restore, wipe upper+work and restore from snapshot
- Removes mount --move dependency and extra overlay layer
- Setup files now run normally without checkpoint interference
- Checkpoint is taken after all setup (gas-bump, funding) completes
@kamilchodola kamilchodola force-pushed the codex/checkpoint-before-testing branch from ed8804c to 1887197 Compare May 19, 2026 14:32
prepare_overlay_for_client runs in a $() subshell, so associative
array assignments inside it (ACTIVE_OVERLAY_LOWERS) are lost in the
parent. Pipe-delimit merged|lower in stdout, parse in
register_overlay_for_client, and strip the suffix before using
data_dir as --dataDir.
- Store checkpoint export on tmpfs (RAM) to eliminate disk I/O
- Add --keep flag to checkpoint (keep container during dump)
- Add --print-stats to both checkpoint and restore for timing visibility
- Remove wait_for_rpc after CRIU restore (process resumes mid-execution)
- Log checkpoint tar size for diagnostics
- Add tmpfs setup/teardown lifecycle management
- tmpfs 4G was too small for Nethermind checkpoint (62GB RAM machine)
- Remove socket drain before checkpoint (not needed with MaxActivePeers=0)
- Remove podman rm after checkpoint (--keep keeps container alive,
  matching benchmarkoor's approach)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant