Skip to content

Add local rpc cache access to avoid timing issues due to delayed commits#18575

Merged
mh0lt merged 28 commits intomainfrom
add_local_rpc_cache_access
Jan 11, 2026
Merged

Add local rpc cache access to avoid timing issues due to delayed commits#18575
mh0lt merged 28 commits intomainfrom
add_local_rpc_cache_access

Conversation

@mh0lt
Copy link
Copy Markdown
Contributor

@mh0lt mh0lt commented Jan 6, 2026

This is a set of fixes that resolves the issues encountered on hive post: #18495.

The fix here makes the SharedDomain update data avilible to the RPC code via a kvcache.Cache object. This is a temporary fix which I will rationalize while testing this branch: https://github.com/erigontech/erigon/tree/add_execution_context_with_caches which I cam currently testing.

The main issues with the current fix are:

  1. The LocalCache I introduced needs beter management of its initialization.
  2. The SharedDomains name is really confusing and will change to ExecutionContext
  3. There is now duplicate code between the updates in the SharedDomain objects and the contents of the Accumulator in shards.Notifications.

I don't think 3 is new I think it has been there since application level updated where introduced to SharedDomains. It may be that we need to incude/use the accumulator within the exeuction domain, itmaybe we need to remove it and use a cache object. I have not done this code analysis yet, but the change points in the PR will likley be where the effected code needs to be changed.

I think we also need to review how changes get propogated to remore RPC's. But all of that is for another PR, this is just enough change to make all of the test cases pass.

@yperbasis yperbasis requested a review from Copilot January 7, 2026 08:54
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses timing issues in RPC calls by introducing a local cache mechanism that provides access to uncommitted state data from the execution layer, preventing race conditions when state changes haven't yet been committed to the database.

Key Changes:

  • Added a new Cache type that provides RPC layer access to in-memory execution state
  • Integrated the cache into state reader creation for historical and latest state queries
  • Enhanced error messages for pruned state with more diagnostic information
  • Refactored forkchoice update logic to manage execution context lifecycle

Reviewed changes

Copilot reviewed 31 out of 31 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
rpc/transactions/tracing.go Added stateCache parameter to ComputeBlockContext and use cached reader when available
rpc/rpchelper/helper.go Implemented CreateHistoryCachedStateReader and cachedHistoryReaderV3 for historical state access
rpc/jsonrpc/*.go Updated calls to pass stateCache parameter through the call chain
execution/execmodule/*.go Added Cache/CacheView types and integrated with EthereumExecutionModule
db/state/temporal_mem_batch.go Changed from single-value to versioned storage to support historical queries
node/eth/backend.go Wired up execmodule.Cache to RPC configuration
common/dbg/*.go Renamed ContextWithDebug to WithDebug for consistency
Comments suppressed due to low confidence (1)

execution/stagedsync/stageloop/stageloop.go:1

  • Changed from context.WithTimeout(ms.Ctx, 10*time.Second) to context.WithCancel(ms.Ctx). Removing the timeout could cause tests to hang indefinitely if the stream doesn't receive expected messages. Consider if this timeout removal is intentional or if a longer timeout would be more appropriate.
// Copyright 2024 The Erigon Authors

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread rpc/jsonrpc/receipts/receipts_generator.go Outdated
Comment thread db/state/temporal_mem_batch.go
Comment thread db/state/temporal_mem_batch.go
Comment thread db/state/inverted_index.go Outdated
Comment thread execution/execmodule/forkchoice.go
Comment thread rpc/rpchelper/helper.go Outdated
engine rules.Engine

doingPostForkchoice atomic.Bool
doingPostForkchoice bool
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed from atomic.Bool to plain bool, but this field is accessed and modified across goroutines (e.g., in runPostForkchoiceInBackground). Without atomic operations or proper locking for all accesses, this creates a data race.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This misses the bigger I have also introduced an object level lock becuase the currentContext also needs tp be protected, there is no point in maintaining a seperate lock here. Its a style choice, (i.e. don't mix impicid and explicid locking.

Comment thread cmd/utils/app/import_cmd.go
Comment thread execution/execmodule/forkchoice.go Outdated
Comment thread db/state/inverted_index.go Outdated
type dataWithTxNum struct {
data []byte
txNum uint64
dir iodir
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dir seems not used

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - but this is not new to this PR

if !ok {
return nil, false, nil
}
for i, dataWithTxNum := range dataWithTxNums {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe here for i := 0; ;i++{ style will be faster - because dataWithTxNum is non-pointer and will be copied on each iter.

latestStateLock sync.RWMutex
domains [kv.DomainLen]map[string]dataWithStep
storage *btree2.Map[string, dataWithStep] // TODO: replace hardcoded domain name to per-config configuration of available Guarantees/AccessMethods (range vs get)
domains [kv.DomainLen]map[string][]dataWithTxNum
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for me unclear - how much elements will be in this array per key? where is limit? no limit?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no limit - so in practice its determined by the number of updates per key for the lifeting of the SD. Maybe we ned to distinguish between forck choice and sync. I think for sync we probably only need the latest, for FC we need all, but will only be handling an low number of blocsk.

I'll address this in th next PR which deals with caching more generally.

latestStateLock sync.RWMutex
domains [kv.DomainLen]map[string]dataWithStep
storage *btree2.Map[string, dataWithStep] // TODO: replace hardcoded domain name to per-config configuration of available Guarantees/AccessMethods (range vs get)
domains [kv.DomainLen]map[string][]dataWithTxNum
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think then also need change SizeEstimate() method: from * 4 to * 8

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@mh0lt mh0lt merged commit a1fd2c9 into main Jan 11, 2026
26 of 30 checks passed
@mh0lt mh0lt deleted the add_local_rpc_cache_access branch January 11, 2026 11:47
Sahil-4555 pushed a commit to Sahil-4555/erigon that referenced this pull request Apr 23, 2026
…#20735)

## Summary

Move the `status != ExecutionStatusSuccess` check above the state-diff
wait loop in `InsertChain` so a rejected block returns promptly instead
of blocking on `stream.Recv()` for events that will never arrive.

## Root cause

In the Hive
[`eels/consume-rlp`](https://hive.ethpandaops.io/#/test/generic/1776837107-92f3a4a260c3454b01b849f4a5979b95)
run against `erigon_default`, **539 tests fail** — all with the same
pattern. Client logs show Erigon correctly rejecting the invalid block:

```
[4/6 Execution] Execution failed  err="invalid block, txnIdx=N, blob gas limit reached"
Cannot update chain head          err="updateForkChoice: [4/6 Execution] invalid block..."
[WARN] flag --externalcl was provided, but no CL requests to engine-api in 2m0s
```

But `erigon import` then hangs in `InsertChain`'s `stream.Recv()` loop
waiting for state-diff events that never arrive for a rejected block.
That blocks the import command long enough for Hive's 180s
`--client.checktimelimit` to fire, so the main erigon process never
starts and Hive reports "unable to connect to client (erigon_default)".

The wait loop (introduced in erigontech#18575, Jan 2026) was ordered with the
status check *after* it:

```go
// old order — unreachable on bad block:
for len(insertedBlocks) > 0 { req, err := stream.Recv(); ... }   // hangs here
...
if status != ExecutionStatusSuccess { return fmt.Errorf(...) }    // never reached
```

The `if status != Success` branch was never reachable for a rejected
block because the loop blocks first.

## Why this only hits 539/43924 tests

Tests whose invalid block fails the structural `HashCheck(true)` check
(empty receipt hash, mismatched tx root, etc.) return early from
`InsertChain` before reaching the stream wait. The 539 failures are all
cases where the block is structurally valid but semantically rejected
during execution:

| count | test |
|---:|---|
| 198 | `test_insufficient_balance_blob_tx_combinations` |
| 149 | `test_invalid_blob_gas_used_in_header` |
| 106 | `test_invalid_block_blob_count` |
| 20  | `test_invalid_layout` |
| 19  | `test_invalid_multi_type_requests` |
| … | (plus `test_*_negative` / `test_system_contract_*`, totaling 47
more) |

E.g. for `test_invalid_block_blob_count`, single-tx variants like `(7,)`
fail structurally and pass the hive test, while multi-tx variants like
`(1,1,5)` — where each tx is valid but the block exceeds the blob limit
— only fail during execution and hit this hang.

## Fix

Check `status` immediately after `UpdateForkChoice` and return an error
before entering the wait loop. The wait loop is now only reached on
success, which is the only path that actually produces state-diff
events. `WriteHeadBlockHash(lvh)` also moves below the check, which is
correct: on rejection `lvh` is the existing head (from
`rawdb.ReadHeadBlockHash` in `updateForkChoice`'s invalid-block branch),
so the old code was either writing a no-op or (more commonly) never
reaching that write anyway because it hung first.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: info@weblogix.biz <admin@10gbps.weblogix.it>
lupin012 pushed a commit that referenced this pull request Apr 24, 2026
## Summary

Move the `status != ExecutionStatusSuccess` check above the state-diff
wait loop in `InsertChain` so a rejected block returns promptly instead
of blocking on `stream.Recv()` for events that will never arrive.

## Root cause

In the Hive
[`eels/consume-rlp`](https://hive.ethpandaops.io/#/test/generic/1776837107-92f3a4a260c3454b01b849f4a5979b95)
run against `erigon_default`, **539 tests fail** — all with the same
pattern. Client logs show Erigon correctly rejecting the invalid block:

```
[4/6 Execution] Execution failed  err="invalid block, txnIdx=N, blob gas limit reached"
Cannot update chain head          err="updateForkChoice: [4/6 Execution] invalid block..."
[WARN] flag --externalcl was provided, but no CL requests to engine-api in 2m0s
```

But `erigon import` then hangs in `InsertChain`'s `stream.Recv()` loop
waiting for state-diff events that never arrive for a rejected block.
That blocks the import command long enough for Hive's 180s
`--client.checktimelimit` to fire, so the main erigon process never
starts and Hive reports "unable to connect to client (erigon_default)".

The wait loop (introduced in #18575, Jan 2026) was ordered with the
status check *after* it:

```go
// old order — unreachable on bad block:
for len(insertedBlocks) > 0 { req, err := stream.Recv(); ... }   // hangs here
...
if status != ExecutionStatusSuccess { return fmt.Errorf(...) }    // never reached
```

The `if status != Success` branch was never reachable for a rejected
block because the loop blocks first.

## Why this only hits 539/43924 tests

Tests whose invalid block fails the structural `HashCheck(true)` check
(empty receipt hash, mismatched tx root, etc.) return early from
`InsertChain` before reaching the stream wait. The 539 failures are all
cases where the block is structurally valid but semantically rejected
during execution:

| count | test |
|---:|---|
| 198 | `test_insufficient_balance_blob_tx_combinations` |
| 149 | `test_invalid_blob_gas_used_in_header` |
| 106 | `test_invalid_block_blob_count` |
| 20  | `test_invalid_layout` |
| 19  | `test_invalid_multi_type_requests` |
| … | (plus `test_*_negative` / `test_system_contract_*`, totaling 47
more) |

E.g. for `test_invalid_block_blob_count`, single-tx variants like `(7,)`
fail structurally and pass the hive test, while multi-tx variants like
`(1,1,5)` — where each tx is valid but the block exceeds the blob limit
— only fail during execution and hit this hang.

## Fix

Check `status` immediately after `UpdateForkChoice` and return an error
before entering the wait loop. The wait loop is now only reached on
success, which is the only path that actually produces state-diff
events. `WriteHeadBlockHash(lvh)` also moves below the check, which is
correct: on rejection `lvh` is the existing head (from
`rawdb.ReadHeadBlockHash` in `updateForkChoice`'s invalid-block branch),
so the old code was either writing a no-op or (more commonly) never
reaching that write anyway because it hung first.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: info@weblogix.biz <admin@10gbps.weblogix.it>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants