refactor: replace file-based HTTP cache with SQLite backend#5501
refactor: replace file-based HTTP cache with SQLite backend#5501baszalmstra wants to merge 14 commits intoprefix-dev:mainfrom
Conversation
|
should we also update pixi clean command to remove sqlite file? |
48197d0 to
eac6277
Compare
The PyPI mapping system was using cacache (CACacheManager) which creates many small files on disk. This works poorly on HPC and network filesystems where metadata operations on many small files are expensive. Replace CACacheManager with a new SqliteCacheManager that stores all HTTP cache entries in a single SQLite database file. The implementation: - Uses WAL journal mode for good concurrent read performance - Sets synchronous=NORMAL since this is a cache (crash data loss is OK) - Configures a 5s busy_timeout for concurrent process access - Serializes HttpResponse + CachePolicy together as JSON blobs - Fully respects HTTP cache semantics (same CacheManager trait) The SQLite database is stored at: ~/.cache/pixi/conda-pypi-mapping/http_cache.sqlite https://claude.ai/code/session_01XykR7AMvHDmUnrhnzptwW1
bincode serializes the response body as raw bytes, avoiding the base64 overhead that serde_json would introduce for the Vec<u8> body field. This also matches what the original CACacheManager used. https://claude.ai/code/session_01XykR7AMvHDmUnrhnzptwW1
… columns Instead of serializing the entire HttpResponse+CachePolicy as a single blob, split the schema into three columns: - body: raw BLOB (no serialization overhead for response bytes) - response_meta: JSON (headers, status, url, version) - policy: JSON (HTTP cache policy) This avoids any encoding overhead for the response body and keeps the metadata human-readable for debugging. https://claude.ai/code/session_01XykR7AMvHDmUnrhnzptwW1
Move the SQLite-backed CacheManager out of pypi_mapping into a standalone crate at crates/http_cache_sqlite. This implementation is not pixi-specific and can be reused by any consumer of http-cache-reqwest that wants a single-file SQLite cache instead of many small files. https://claude.ai/code/session_01XykR7AMvHDmUnrhnzptwW1
Tests cover: - get on missing key returns None - put then get roundtrips body, status, and headers - put overwrites existing entries - delete removes entries - delete on nonexistent key is ok - multiple keys are independent - response headers are preserved - binary body (all 256 byte values including null) - empty body - data persists across reopen of the database - parent directories are created automatically https://claude.ai/code/session_01XykR7AMvHDmUnrhnzptwW1
The workspace clippy config disallows std::fs methods. Switch create_dir_all to fs_err::create_dir_all for better error messages. https://claude.ai/code/session_01XykR7AMvHDmUnrhnzptwW1
Store the SQLite database directly as ~/.cache/pixi/conda-pypi-mapping.sqlite instead of nesting it inside a subdirectory. Simpler and avoids creating an extra directory just for one file. https://claude.ai/code/session_01XykR7AMvHDmUnrhnzptwW1
eac6277 to
b3ab5eb
Compare
81b3600 to
587ce2f
Compare
|
Comparing the new Results
SummarySQLite is 12–35x faster across all steady-state operations. The only case where cacache is faster is cold start |
No await points are held while the lock is held, so a synchronous mutex avoids the overhead of the tokio runtime for lock acquisition.
Set mmap_size to 32 MB so SQLite can use memory-mapped I/O for read operations. This is a cap, not a pre-allocation — the OS maps only what the file actually uses and silently falls back to read() if mmap is unavailable.
…und-trip Add From conversions between local HttpVersion and upstream http_cache::HttpVersion, then build/deconstruct HttpResponse by accessing its fields directly. This eliminates two serde_json::Value round-trips per cache get/put.
Reuse compiled SQL statements across calls by using prepare_cached instead of execute, matching what we already do for get.
525c93d to
1e4a75b
Compare
|
@nichmor I fixed the clean situation as well. |
|
I'm very hessitant to merge this as it adds a huge dependency to the cargo workspace: https://crates.io/crates/libsqlite3-sys. This will compile the libsqlite3 package on Here is a little overview of I would like to challenge you to figure out if we can avoid the use of this dependency, if we could use a different strategy to solve the given issue, or make use of sqlite in other parts of pixi's caching to make it a more impactful introduction to the overall project. |
what have you used for plotting? |
|
|
CLosing this for now. |


Description
Replaces the default file-based
CACacheManagerwith a newSqliteCacheManagerthat stores all cached HTTP responses in a single SQLite database file instead of many small files on disk.Motivation:
Implementation Details:
SqliteCacheManagerimplements theCacheManagertrait fromhttp_cache_reqwestsynchronous = NORMALsince this is a cache and data loss on crash is acceptableFixes #5439
How Has This Been Tested?
The change integrates with existing HTTP caching infrastructure. The
CacheManagertrait implementation ensures compatibility with thehttp_cache_reqwestlibrary's cache layer. Existing code paths that use HTTP caching will automatically use the new SQLite backend without modification.Further testing should be done manually and in CI.
AI Disclosure
Written by Claude Code Opus 4.6 Extended.
Checklist: