feat(caching): implement negative stat cache to optimize polling of missing files#4729
feat(caching): implement negative stat cache to optimize polling of missing files#4729alleaditya wants to merge 6 commits into
Conversation
…issing files This change adds negative caching to StatCache, short-circuits LookUpChild on negative hits, and adds metrics/integration tests to verify.
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces negative stat caching to optimize performance in scenarios where applications frequently poll for non-existent files. By caching negative results (404s) and short-circuiting lookups in the filesystem layer, the change significantly reduces redundant backend network traffic. The implementation includes configurable TTL settings and comprehensive integration tests to ensure correctness and efficiency. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request implements negative stat-cache functionality to optimize lookups for non-existent files and directories, reducing redundant backend GCS requests. Key changes include updates to LookUpChild in dir.go to short-circuit on confirmed negative hits, and logic in fast_stat_bucket.go to manage negative cache entries. Documentation and tests were also added to support this feature. Review feedback suggests removing speculative negative caching in insertListing to avoid correctness issues with paginated GCS listings and adding a warning in the documentation regarding the risks of using an infinite TTL for negative entries.
…s blocking CI - Remove speculative negative caching from listing path to avoid pagination bugs (addressed review comment). - Add warning about infinite negative TTL usage in semantics.md (addressed review comment). - Fix deadlock in createFile's defer in fs.go when error occurs before child inode is minted. - Fix data race in downloader Job by deep-copying MinObject, preventing race with reader thread. - Fix missing lock in fake bucket's MoveObject, resolving race with StatObject.
…hing, add doc warning - Revert data race fixes in downloader Job (moved to separate PR). - Revert name length checks and deadlock fix in fs.go (moved to separate PR). - Revert name length checks in dir.go (moved to separate PR), keeping only LookUpChild short-circuiting. - Revert MoveObject lock fix in fake bucket (moved to separate PR), keeping FetchOnlyFromCache checks. - Remove speculative negative caching on empty directory listing to avoid pagination bugs (addressed review comment). - Add warning about infinite TTL usage for negative caching in semantics.md (addressed review comment).
Description
This change implements negative entry caching (non-existent path caching) to optimize workloads that aggressively poll missing files (e.g., JupyterLab).
• Activation: Controlled by
metadata-cache: negative-ttl-secsconfig parameter (or--metadata-cache-negative-ttl-secsflag). It is enabled by default with a 5-second TTL. Setting it to0disables the feature.• Proactive Listing Cache: Empty directory listings (
ListObjectsreturning 0 results) are proactively cached as negative directory entries to fully protect against implicit directory network probes.• VFS Routing:
LookUpChildtracks definitive negative cache hits and short-circuits immediately in memory, avoiding network fallback.Benchmark Results
A custom benchmarking script was executed against a locally mounted test bucket to measure performance over a sustained 60-second polling window per scenario (aggregating over 1.3 million total VFS operations). The bucket was configured with a 5s negative cache TTL.
1. VFS Throughput & Latency Distributions
Memory interception nearly doubles total application throughput while slashing tail latencies by over 40%:
2. TTL Cache Expiration Mechanics (Trace Logs)
Over a 60-second window with a 5-second TTL, exactly 12 cache expirations are mathematically expected.
• Disabled: Triggers 82,268 backend network calls continuously over the 60-second window.
• Enabled: Triggers exactly 24 backend network calls for the entire minute.
• Proof: Each of the 12 TTL expirations triggers exactly 2 backend calls (re-verifying both
missing_fileandmissing_file/to refresh the cache). 12 expirations × 2 calls = 24 calls. Every other request is intercepted entirely in user-space memory.Link to the issue in case of a bug fix.
https://b.corp.google.com/issues/511786738
Testing details
WrappedSaysNotFound_NegativeCachingDisabled,CacheHit_Negative_Disabled_FetchOnly, andEmptyListing_NegativeCachingto verify strict bypass when disabled and proper tombstone storage when enabled.Any backward incompatible change? If so, please explain.
No.