Add local rpc cache access to avoid timing issues due to delayed commits by mh0lt · Pull Request #18575 · erigontech/erigon

mh0lt · 2026-01-06T18:56:06Z

This is a set of fixes that resolves the issues encountered on hive post: #18495.

The fix here makes the SharedDomain update data avilible to the RPC code via a kvcache.Cache object. This is a temporary fix which I will rationalize while testing this branch: https://github.com/erigontech/erigon/tree/add_execution_context_with_caches which I cam currently testing.

The main issues with the current fix are:

The LocalCache I introduced needs beter management of its initialization.
The SharedDomains name is really confusing and will change to ExecutionContext
There is now duplicate code between the updates in the SharedDomain objects and the contents of the Accumulator in shards.Notifications.

I don't think 3 is new I think it has been there since application level updated where introduced to SharedDomains. It may be that we need to incude/use the accumulator within the exeuction domain, itmaybe we need to remove it and use a cache object. I have not done this code analysis yet, but the change points in the PR will likley be where the effected code needs to be changed.

I think we also need to review how changes get propogated to remore RPC's. But all of that is for another PR, this is just enough change to make all of the test cases pass.

Copilot

Pull request overview

This PR addresses timing issues in RPC calls by introducing a local cache mechanism that provides access to uncommitted state data from the execution layer, preventing race conditions when state changes haven't yet been committed to the database.

Key Changes:

Added a new Cache type that provides RPC layer access to in-memory execution state
Integrated the cache into state reader creation for historical and latest state queries
Enhanced error messages for pruned state with more diagnostic information
Refactored forkchoice update logic to manage execution context lifecycle

Reviewed changes

Copilot reviewed 31 out of 31 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
rpc/transactions/tracing.go	Added stateCache parameter to ComputeBlockContext and use cached reader when available
rpc/rpchelper/helper.go	Implemented CreateHistoryCachedStateReader and cachedHistoryReaderV3 for historical state access
rpc/jsonrpc/*.go	Updated calls to pass stateCache parameter through the call chain
execution/execmodule/*.go	Added Cache/CacheView types and integrated with EthereumExecutionModule
db/state/temporal_mem_batch.go	Changed from single-value to versioned storage to support historical queries
node/eth/backend.go	Wired up execmodule.Cache to RPC configuration
common/dbg/*.go	Renamed ContextWithDebug to WithDebug for consistency

Comments suppressed due to low confidence (1)

execution/stagedsync/stageloop/stageloop.go:1

Changed from context.WithTimeout(ms.Ctx, 10*time.Second) to context.WithCancel(ms.Ctx). Removing the timeout could cause tests to hang indefinitely if the stream doesn't receive expected messages. Consider if this timeout removal is intentional or if a longer timeout would be more appropriate.

// Copyright 2024 The Erigon Authors

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-07T08:55:28Z

 	engine rules.Engine

-	doingPostForkchoice atomic.Bool
+	doingPostForkchoice bool


Changed from atomic.Bool to plain bool, but this field is accessed and modified across goroutines (e.g., in runPostForkchoiceInBackground). Without atomic operations or proper locking for all accesses, this creates a data race.

This misses the bigger I have also introduced an object level lock becuase the currentContext also needs tp be protected, there is no point in maintaining a seperate lock here. Its a style choice, (i.e. don't mix impicid and explicid locking.

AskAlexSharov · 2026-01-09T12:36:39Z

+type dataWithTxNum struct {
+	data  []byte
+	txNum uint64
+	dir   iodir


dir seems not used

Yes - but this is not new to this PR

AskAlexSharov · 2026-01-09T12:40:09Z

+		if !ok {
+			return nil, false, nil
+		}
+		for i, dataWithTxNum := range dataWithTxNums {


maybe here for i := 0; ;i++{ style will be faster - because dataWithTxNum is non-pointer and will be copied on each iter.

AskAlexSharov · 2026-01-09T12:41:15Z

 	latestStateLock sync.RWMutex
-	domains         [kv.DomainLen]map[string]dataWithStep
-	storage         *btree2.Map[string, dataWithStep] // TODO: replace hardcoded domain name to per-config configuration of available Guarantees/AccessMethods (range vs get)
+	domains         [kv.DomainLen]map[string][]dataWithTxNum


for me unclear - how much elements will be in this array per key? where is limit? no limit?

There is no limit - so in practice its determined by the number of updates per key for the lifeting of the SD. Maybe we ned to distinguish between forck choice and sync. I think for sync we probably only need the latest, for FC we need all, but will only be handling an low number of blocsk.

I'll address this in th next PR which deals with caching more generally.

AskAlexSharov · 2026-01-09T12:43:13Z

 	latestStateLock sync.RWMutex
-	domains         [kv.DomainLen]map[string]dataWithStep
-	storage         *btree2.Map[string, dataWithStep] // TODO: replace hardcoded domain name to per-config configuration of available Guarantees/AccessMethods (range vs get)
+	domains         [kv.DomainLen]map[string][]dataWithTxNum


i think then also need change SizeEstimate() method: from * 4 to * 8

…ntech/erigon into add_local_rpc_cache_access

…#20735) ## Summary Move the `status != ExecutionStatusSuccess` check above the state-diff wait loop in `InsertChain` so a rejected block returns promptly instead of blocking on `stream.Recv()` for events that will never arrive. ## Root cause In the Hive [`eels/consume-rlp`](https://hive.ethpandaops.io/#/test/generic/1776837107-92f3a4a260c3454b01b849f4a5979b95) run against `erigon_default`, **539 tests fail** — all with the same pattern. Client logs show Erigon correctly rejecting the invalid block: ``` [4/6 Execution] Execution failed err="invalid block, txnIdx=N, blob gas limit reached" Cannot update chain head err="updateForkChoice: [4/6 Execution] invalid block..." [WARN] flag --externalcl was provided, but no CL requests to engine-api in 2m0s ``` But `erigon import` then hangs in `InsertChain`'s `stream.Recv()` loop waiting for state-diff events that never arrive for a rejected block. That blocks the import command long enough for Hive's 180s `--client.checktimelimit` to fire, so the main erigon process never starts and Hive reports "unable to connect to client (erigon_default)". The wait loop (introduced in erigontech#18575, Jan 2026) was ordered with the status check *after* it: ```go // old order — unreachable on bad block: for len(insertedBlocks) > 0 { req, err := stream.Recv(); ... } // hangs here ... if status != ExecutionStatusSuccess { return fmt.Errorf(...) } // never reached ``` The `if status != Success` branch was never reachable for a rejected block because the loop blocks first. ## Why this only hits 539/43924 tests Tests whose invalid block fails the structural `HashCheck(true)` check (empty receipt hash, mismatched tx root, etc.) return early from `InsertChain` before reaching the stream wait. The 539 failures are all cases where the block is structurally valid but semantically rejected during execution: | count | test | |---:|---| | 198 | `test_insufficient_balance_blob_tx_combinations` | | 149 | `test_invalid_blob_gas_used_in_header` | | 106 | `test_invalid_block_blob_count` | | 20 | `test_invalid_layout` | | 19 | `test_invalid_multi_type_requests` | | … | (plus `test_*_negative` / `test_system_contract_*`, totaling 47 more) | E.g. for `test_invalid_block_blob_count`, single-tx variants like `(7,)` fail structurally and pass the hive test, while multi-tx variants like `(1,1,5)` — where each tx is valid but the block exceeds the blob limit — only fail during execution and hit this hang. ## Fix Check `status` immediately after `UpdateForkChoice` and return an error before entering the wait loop. The wait loop is now only reached on success, which is the only path that actually produces state-diff events. `WriteHeadBlockHash(lvh)` also moves below the check, which is correct: on rejection `lvh` is the existing head (from `rawdb.ReadHeadBlockHash` in `updateForkChoice`'s invalid-block branch), so the old code was either writing a no-op or (more commonly) never reaching that write anyway because it hung first. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: info@weblogix.biz <admin@10gbps.weblogix.it>

## Summary Move the `status != ExecutionStatusSuccess` check above the state-diff wait loop in `InsertChain` so a rejected block returns promptly instead of blocking on `stream.Recv()` for events that will never arrive. ## Root cause In the Hive [`eels/consume-rlp`](https://hive.ethpandaops.io/#/test/generic/1776837107-92f3a4a260c3454b01b849f4a5979b95) run against `erigon_default`, **539 tests fail** — all with the same pattern. Client logs show Erigon correctly rejecting the invalid block: ``` [4/6 Execution] Execution failed err="invalid block, txnIdx=N, blob gas limit reached" Cannot update chain head err="updateForkChoice: [4/6 Execution] invalid block..." [WARN] flag --externalcl was provided, but no CL requests to engine-api in 2m0s ``` But `erigon import` then hangs in `InsertChain`'s `stream.Recv()` loop waiting for state-diff events that never arrive for a rejected block. That blocks the import command long enough for Hive's 180s `--client.checktimelimit` to fire, so the main erigon process never starts and Hive reports "unable to connect to client (erigon_default)". The wait loop (introduced in #18575, Jan 2026) was ordered with the status check *after* it: ```go // old order — unreachable on bad block: for len(insertedBlocks) > 0 { req, err := stream.Recv(); ... } // hangs here ... if status != ExecutionStatusSuccess { return fmt.Errorf(...) } // never reached ``` The `if status != Success` branch was never reachable for a rejected block because the loop blocks first. ## Why this only hits 539/43924 tests Tests whose invalid block fails the structural `HashCheck(true)` check (empty receipt hash, mismatched tx root, etc.) return early from `InsertChain` before reaching the stream wait. The 539 failures are all cases where the block is structurally valid but semantically rejected during execution: | count | test | |---:|---| | 198 | `test_insufficient_balance_blob_tx_combinations` | | 149 | `test_invalid_blob_gas_used_in_header` | | 106 | `test_invalid_block_blob_count` | | 20 | `test_invalid_layout` | | 19 | `test_invalid_multi_type_requests` | | … | (plus `test_*_negative` / `test_system_contract_*`, totaling 47 more) | E.g. for `test_invalid_block_blob_count`, single-tx variants like `(7,)` fail structurally and pass the hive test, while multi-tx variants like `(1,1,5)` — where each tx is valid but the block exceeds the blob limit — only fail during execution and hit this hang. ## Fix Check `status` immediately after `UpdateForkChoice` and return an error before entering the wait loop. The wait loop is now only reached on success, which is the only path that actually produces state-diff events. `WriteHeadBlockHash(lvh)` also moves below the check, which is correct: on rejection `lvh` is the existing head (from `rawdb.ReadHeadBlockHash` in `updateForkChoice`'s invalid-block branch), so the old code was either writing a no-op or (more commonly) never reaching that write anyway because it hung first. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: info@weblogix.biz <admin@10gbps.weblogix.it>

mh0lt added 14 commits December 30, 2025 17:50

added local cache

d77c617

check for nil context

7488553

additional logging

7c0a62d

set head hash on merge extending fork

9d62856

add GetAsOf to cache

3e54cec

fixed wait for import to complete

47ec376

adjust test timeout

27229d4

add missing method

e23d06e

add not found detail

3a3fbc7

add missing method

39aca76

delay currentContext update (to avoid tx races)

99ba0ba

fixes for race test fails

6457b9d

Merge branch 'main' into add_local_rpc_cache_access

d11b431

add stack logic

7942a82

mh0lt requested review from AskAlexSharov, Giulio2002, canepat, sudeepdino008, taratorio and yperbasis as code owners January 6, 2026 18:56

mh0lt added 6 commits January 6, 2026 19:40

adjust recv timeouts

7167812

add stack to missing impl err

c828d3a

add getasof to dummy

2c4218b

remove printers

6250bcc

getasof in view not cache

052a858

remove recv timeout

09883cd

yperbasis requested a review from Copilot January 7, 2026 08:54

Copilot AI reviewed Jan 7, 2026

View reviewed changes

mh0lt added 2 commits January 7, 2026 11:57

review fixes

98d3288

break if all inserted blocks are completed

eb4d4fc

mh0lt added 2 commits January 7, 2026 12:17

revert info to debug

022d953

fix lint

3f524c0

yperbasis requested changes Jan 8, 2026

View reviewed changes

Comment thread db/state/inverted_index.go Outdated

nitpick

76b43ed

yperbasis approved these changes Jan 9, 2026

View reviewed changes

AskAlexSharov reviewed Jan 9, 2026

View reviewed changes

AskAlexSharov approved these changes Jan 9, 2026

View reviewed changes

mh0lt added 3 commits January 10, 2026 11:08

Merge branch 'main' into add_local_rpc_cache_access

f14d2f6

Merge branch 'add_local_rpc_cache_access' of https://github.com/erigo…

0d70dce

…ntech/erigon into add_local_rpc_cache_access

post merge fixes

370c72e

mh0lt merged commit a1fd2c9 into main Jan 11, 2026
26 of 30 checks passed

mh0lt deleted the add_local_rpc_cache_access branch January 11, 2026 11:47

AskAlexSharov mentioned this pull request Jan 15, 2026

batch: estimateSize more strict #18674

Closed

sudeepdino008 mentioned this pull request Feb 24, 2026

skip in-mem history accumulation in sd for offline execution #19433

Merged

yperbasis mentioned this pull request Apr 22, 2026

cmd/utils/app: fix import hang when block fails execution #20735

Merged

Conversation

mh0lt commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mh0lt commented Jan 6, 2026 •

edited

Loading