Add local rpc cache access to avoid timing issues due to delayed commits#18575
Add local rpc cache access to avoid timing issues due to delayed commits#18575
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses timing issues in RPC calls by introducing a local cache mechanism that provides access to uncommitted state data from the execution layer, preventing race conditions when state changes haven't yet been committed to the database.
Key Changes:
- Added a new
Cachetype that provides RPC layer access to in-memory execution state - Integrated the cache into state reader creation for historical and latest state queries
- Enhanced error messages for pruned state with more diagnostic information
- Refactored forkchoice update logic to manage execution context lifecycle
Reviewed changes
Copilot reviewed 31 out of 31 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| rpc/transactions/tracing.go | Added stateCache parameter to ComputeBlockContext and use cached reader when available |
| rpc/rpchelper/helper.go | Implemented CreateHistoryCachedStateReader and cachedHistoryReaderV3 for historical state access |
| rpc/jsonrpc/*.go | Updated calls to pass stateCache parameter through the call chain |
| execution/execmodule/*.go | Added Cache/CacheView types and integrated with EthereumExecutionModule |
| db/state/temporal_mem_batch.go | Changed from single-value to versioned storage to support historical queries |
| node/eth/backend.go | Wired up execmodule.Cache to RPC configuration |
| common/dbg/*.go | Renamed ContextWithDebug to WithDebug for consistency |
Comments suppressed due to low confidence (1)
execution/stagedsync/stageloop/stageloop.go:1
- Changed from
context.WithTimeout(ms.Ctx, 10*time.Second)tocontext.WithCancel(ms.Ctx). Removing the timeout could cause tests to hang indefinitely if the stream doesn't receive expected messages. Consider if this timeout removal is intentional or if a longer timeout would be more appropriate.
// Copyright 2024 The Erigon Authors
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| engine rules.Engine | ||
|
|
||
| doingPostForkchoice atomic.Bool | ||
| doingPostForkchoice bool |
There was a problem hiding this comment.
Changed from atomic.Bool to plain bool, but this field is accessed and modified across goroutines (e.g., in runPostForkchoiceInBackground). Without atomic operations or proper locking for all accesses, this creates a data race.
There was a problem hiding this comment.
This misses the bigger I have also introduced an object level lock becuase the currentContext also needs tp be protected, there is no point in maintaining a seperate lock here. Its a style choice, (i.e. don't mix impicid and explicid locking.
| type dataWithTxNum struct { | ||
| data []byte | ||
| txNum uint64 | ||
| dir iodir |
There was a problem hiding this comment.
Yes - but this is not new to this PR
| if !ok { | ||
| return nil, false, nil | ||
| } | ||
| for i, dataWithTxNum := range dataWithTxNums { |
There was a problem hiding this comment.
maybe here for i := 0; ;i++{ style will be faster - because dataWithTxNum is non-pointer and will be copied on each iter.
| latestStateLock sync.RWMutex | ||
| domains [kv.DomainLen]map[string]dataWithStep | ||
| storage *btree2.Map[string, dataWithStep] // TODO: replace hardcoded domain name to per-config configuration of available Guarantees/AccessMethods (range vs get) | ||
| domains [kv.DomainLen]map[string][]dataWithTxNum |
There was a problem hiding this comment.
for me unclear - how much elements will be in this array per key? where is limit? no limit?
There was a problem hiding this comment.
There is no limit - so in practice its determined by the number of updates per key for the lifeting of the SD. Maybe we ned to distinguish between forck choice and sync. I think for sync we probably only need the latest, for FC we need all, but will only be handling an low number of blocsk.
I'll address this in th next PR which deals with caching more generally.
| latestStateLock sync.RWMutex | ||
| domains [kv.DomainLen]map[string]dataWithStep | ||
| storage *btree2.Map[string, dataWithStep] // TODO: replace hardcoded domain name to per-config configuration of available Guarantees/AccessMethods (range vs get) | ||
| domains [kv.DomainLen]map[string][]dataWithTxNum |
There was a problem hiding this comment.
i think then also need change SizeEstimate() method: from * 4 to * 8
…#20735) ## Summary Move the `status != ExecutionStatusSuccess` check above the state-diff wait loop in `InsertChain` so a rejected block returns promptly instead of blocking on `stream.Recv()` for events that will never arrive. ## Root cause In the Hive [`eels/consume-rlp`](https://hive.ethpandaops.io/#/test/generic/1776837107-92f3a4a260c3454b01b849f4a5979b95) run against `erigon_default`, **539 tests fail** — all with the same pattern. Client logs show Erigon correctly rejecting the invalid block: ``` [4/6 Execution] Execution failed err="invalid block, txnIdx=N, blob gas limit reached" Cannot update chain head err="updateForkChoice: [4/6 Execution] invalid block..." [WARN] flag --externalcl was provided, but no CL requests to engine-api in 2m0s ``` But `erigon import` then hangs in `InsertChain`'s `stream.Recv()` loop waiting for state-diff events that never arrive for a rejected block. That blocks the import command long enough for Hive's 180s `--client.checktimelimit` to fire, so the main erigon process never starts and Hive reports "unable to connect to client (erigon_default)". The wait loop (introduced in erigontech#18575, Jan 2026) was ordered with the status check *after* it: ```go // old order — unreachable on bad block: for len(insertedBlocks) > 0 { req, err := stream.Recv(); ... } // hangs here ... if status != ExecutionStatusSuccess { return fmt.Errorf(...) } // never reached ``` The `if status != Success` branch was never reachable for a rejected block because the loop blocks first. ## Why this only hits 539/43924 tests Tests whose invalid block fails the structural `HashCheck(true)` check (empty receipt hash, mismatched tx root, etc.) return early from `InsertChain` before reaching the stream wait. The 539 failures are all cases where the block is structurally valid but semantically rejected during execution: | count | test | |---:|---| | 198 | `test_insufficient_balance_blob_tx_combinations` | | 149 | `test_invalid_blob_gas_used_in_header` | | 106 | `test_invalid_block_blob_count` | | 20 | `test_invalid_layout` | | 19 | `test_invalid_multi_type_requests` | | … | (plus `test_*_negative` / `test_system_contract_*`, totaling 47 more) | E.g. for `test_invalid_block_blob_count`, single-tx variants like `(7,)` fail structurally and pass the hive test, while multi-tx variants like `(1,1,5)` — where each tx is valid but the block exceeds the blob limit — only fail during execution and hit this hang. ## Fix Check `status` immediately after `UpdateForkChoice` and return an error before entering the wait loop. The wait loop is now only reached on success, which is the only path that actually produces state-diff events. `WriteHeadBlockHash(lvh)` also moves below the check, which is correct: on rejection `lvh` is the existing head (from `rawdb.ReadHeadBlockHash` in `updateForkChoice`'s invalid-block branch), so the old code was either writing a no-op or (more commonly) never reaching that write anyway because it hung first. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: info@weblogix.biz <admin@10gbps.weblogix.it>
## Summary Move the `status != ExecutionStatusSuccess` check above the state-diff wait loop in `InsertChain` so a rejected block returns promptly instead of blocking on `stream.Recv()` for events that will never arrive. ## Root cause In the Hive [`eels/consume-rlp`](https://hive.ethpandaops.io/#/test/generic/1776837107-92f3a4a260c3454b01b849f4a5979b95) run against `erigon_default`, **539 tests fail** — all with the same pattern. Client logs show Erigon correctly rejecting the invalid block: ``` [4/6 Execution] Execution failed err="invalid block, txnIdx=N, blob gas limit reached" Cannot update chain head err="updateForkChoice: [4/6 Execution] invalid block..." [WARN] flag --externalcl was provided, but no CL requests to engine-api in 2m0s ``` But `erigon import` then hangs in `InsertChain`'s `stream.Recv()` loop waiting for state-diff events that never arrive for a rejected block. That blocks the import command long enough for Hive's 180s `--client.checktimelimit` to fire, so the main erigon process never starts and Hive reports "unable to connect to client (erigon_default)". The wait loop (introduced in #18575, Jan 2026) was ordered with the status check *after* it: ```go // old order — unreachable on bad block: for len(insertedBlocks) > 0 { req, err := stream.Recv(); ... } // hangs here ... if status != ExecutionStatusSuccess { return fmt.Errorf(...) } // never reached ``` The `if status != Success` branch was never reachable for a rejected block because the loop blocks first. ## Why this only hits 539/43924 tests Tests whose invalid block fails the structural `HashCheck(true)` check (empty receipt hash, mismatched tx root, etc.) return early from `InsertChain` before reaching the stream wait. The 539 failures are all cases where the block is structurally valid but semantically rejected during execution: | count | test | |---:|---| | 198 | `test_insufficient_balance_blob_tx_combinations` | | 149 | `test_invalid_blob_gas_used_in_header` | | 106 | `test_invalid_block_blob_count` | | 20 | `test_invalid_layout` | | 19 | `test_invalid_multi_type_requests` | | … | (plus `test_*_negative` / `test_system_contract_*`, totaling 47 more) | E.g. for `test_invalid_block_blob_count`, single-tx variants like `(7,)` fail structurally and pass the hive test, while multi-tx variants like `(1,1,5)` — where each tx is valid but the block exceeds the blob limit — only fail during execution and hit this hang. ## Fix Check `status` immediately after `UpdateForkChoice` and return an error before entering the wait loop. The wait loop is now only reached on success, which is the only path that actually produces state-diff events. `WriteHeadBlockHash(lvh)` also moves below the check, which is correct: on rejection `lvh` is the existing head (from `rawdb.ReadHeadBlockHash` in `updateForkChoice`'s invalid-block branch), so the old code was either writing a no-op or (more commonly) never reaching that write anyway because it hung first. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: info@weblogix.biz <admin@10gbps.weblogix.it>
This is a set of fixes that resolves the issues encountered on hive post: #18495.
The fix here makes the SharedDomain update data avilible to the RPC code via a kvcache.Cache object. This is a temporary fix which I will rationalize while testing this branch: https://github.com/erigontech/erigon/tree/add_execution_context_with_caches which I cam currently testing.
The main issues with the current fix are:
I don't think 3 is new I think it has been there since application level updated where introduced to SharedDomains. It may be that we need to incude/use the accumulator within the exeuction domain, itmaybe we need to remove it and use a cache object. I have not done this code analysis yet, but the change points in the PR will likley be where the effected code needs to be changed.
I think we also need to review how changes get propogated to remore RPC's. But all of that is for another PR, this is just enough change to make all of the test cases pass.