Skip to content

feat(caching): implement negative stat cache to optimize polling of missing files#4729

Open
alleaditya wants to merge 6 commits into
GoogleCloudPlatform:masterfrom
alleaditya:add-negative-stat-cache
Open

feat(caching): implement negative stat cache to optimize polling of missing files#4729
alleaditya wants to merge 6 commits into
GoogleCloudPlatform:masterfrom
alleaditya:add-negative-stat-cache

Conversation

@alleaditya
Copy link
Copy Markdown
Contributor

@alleaditya alleaditya commented May 25, 2026

Description

This change implements negative entry caching (non-existent path caching) to optimize workloads that aggressively poll missing files (e.g., JupyterLab).

Activation: Controlled by metadata-cache: negative-ttl-secs config parameter (or --metadata-cache-negative-ttl-secs flag). It is enabled by default with a 5-second TTL. Setting it to 0 disables the feature.
Proactive Listing Cache: Empty directory listings (ListObjects returning 0 results) are proactively cached as negative directory entries to fully protect against implicit directory network probes.
VFS Routing: LookUpChild tracks definitive negative cache hits and short-circuits immediately in memory, avoiding network fallback.

Benchmark Results

A custom benchmarking script was executed against a locally mounted test bucket to measure performance over a sustained 60-second polling window per scenario (aggregating over 1.3 million total VFS operations). The bucket was configured with a 5s negative cache TTL.

1. VFS Throughput & Latency Distributions

Memory interception nearly doubles total application throughput while slashing tail latencies by over 40%:

Metric Negative Caching Disabled Negative Caching Enabled Improvement
Total Ops Completed 458,234 lookups 877,848 lookups 91.5% more throughput
Average Latency 0.1303 ms (130 us) 0.0679 ms (67 us) 48% faster
Median (P50) 0.1171 ms (117 us) 0.0602 ms (60 us) 49% faster
P99 Tail Latency 0.2939 ms (293 us) 0.1729 ms (172 us) 41% faster tail
2. TTL Cache Expiration Mechanics (Trace Logs)

Over a 60-second window with a 5-second TTL, exactly 12 cache expirations are mathematically expected.

• Disabled: Triggers 82,268 backend network calls continuously over the 60-second window.
• Enabled: Triggers exactly 24 backend network calls for the entire minute.
• Proof: Each of the 12 TTL expirations triggers exactly 2 backend calls (re-verifying both missing_file and missing_file/ to refresh the cache). 12 expirations × 2 calls = 24 calls. Every other request is intercepted entirely in user-space memory.

Link to the issue in case of a bug fix.

https://b.corp.google.com/issues/511786738

Testing details

  1. Manual - Verified sustained 60-second polling behavior against local mounts, validating exact TTL expiration counts and throughput improvements. (Test script https://paste.googleplex.com/5738355712196608)
  2. Unit tests - Added WrappedSaysNotFound_NegativeCachingDisabled, CacheHit_Negative_Disabled_FetchOnly, and EmptyListing_NegativeCaching to verify strict bypass when disabled and proper tombstone storage when enabled.
  3. Integration tests - Ran existing E2E caching suites successfully

Any backward incompatible change? If so, please explain.

No.

…issing files

This change adds negative caching to StatCache, short-circuits LookUpChild on negative hits, and adds metrics/integration tests to verify.
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces negative stat caching to optimize performance in scenarios where applications frequently poll for non-existent files. By caching negative results (404s) and short-circuiting lookups in the filesystem layer, the change significantly reduces redundant backend network traffic. The implementation includes configurable TTL settings and comprehensive integration tests to ensure correctness and efficiency.

Highlights

  • Negative Stat Caching: Implemented negative caching for non-existent paths to reduce redundant backend GCS requests for workloads that poll missing files.
  • Short-circuit Logic: Updated LookUpChild to short-circuit on confirmed negative hits for both files and directories, preventing unnecessary backend calls.
  • Metrics and Testing: Added integration tests to verify that negative cache hits correctly short-circuit and do not emit backend network requests.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements negative stat-cache functionality to optimize lookups for non-existent files and directories, reducing redundant backend GCS requests. Key changes include updates to LookUpChild in dir.go to short-circuit on confirmed negative hits, and logic in fast_stat_bucket.go to manage negative cache entries. Documentation and tests were also added to support this feature. Review feedback suggests removing speculative negative caching in insertListing to avoid correctness issues with paginated GCS listings and adding a warning in the documentation regarding the risks of using an infinite TTL for negative entries.

Comment thread internal/storage/caching/fast_stat_bucket.go Outdated
Comment thread docs/semantics.md
…s blocking CI

- Remove speculative negative caching from listing path to avoid pagination bugs (addressed review comment).
- Add warning about infinite negative TTL usage in semantics.md (addressed review comment).
- Fix deadlock in createFile's defer in fs.go when error occurs before child inode is minted.
- Fix data race in downloader Job by deep-copying MinObject, preventing race with reader thread.
- Fix missing lock in fake bucket's MoveObject, resolving race with StatObject.
…hing, add doc warning

- Revert data race fixes in downloader Job (moved to separate PR).
- Revert name length checks and deadlock fix in fs.go (moved to separate PR).
- Revert name length checks in dir.go (moved to separate PR), keeping only LookUpChild short-circuiting.
- Revert MoveObject lock fix in fake bucket (moved to separate PR), keeping FetchOnlyFromCache checks.
- Remove speculative negative caching on empty directory listing to avoid pagination bugs (addressed review comment).
- Add warning about infinite TTL usage for negative caching in semantics.md (addressed review comment).
@alleaditya alleaditya requested review from vadlakondaswetha and removed request for geertj May 25, 2026 13:36
@alleaditya alleaditya self-assigned this May 25, 2026
@alleaditya alleaditya added execute-perf-test Execute performance test in PR execute-integration-tests Run only integration tests execute-integration-tests-on-zb To run E2E tests on zonal bucket. and removed execute-perf-test Execute performance test in PR execute-integration-tests Run only integration tests labels May 25, 2026
@alleaditya alleaditya requested a review from raj-prince June 1, 2026 08:31
@alleaditya alleaditya enabled auto-merge (squash) June 1, 2026 08:31
@alleaditya alleaditya disabled auto-merge June 1, 2026 08:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

execute-integration-tests-on-zb To run E2E tests on zonal bucket.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant