Skip to content

Optimize Stream Sync Peer Discovery: Reduce Startup Time to Seconds#4937

Merged
GheisMohammadi merged 5 commits intodevfrom
improve/discovery
Mar 18, 2026
Merged

Optimize Stream Sync Peer Discovery: Reduce Startup Time to Seconds#4937
GheisMohammadi merged 5 commits intodevfrom
improve/discovery

Conversation

@GheisMohammadi
Copy link
Copy Markdown
Collaborator

Summary

This PR addresses the critical issue where stream sync takes ~15 minutes to find enough peers on node startup, while restarting the service allows much faster peer discovery. The solution implements a comprehensive peer discovery optimization strategy with startup mode, network-aware limits, and improved timing algorithms.

Problem Statement

Issue: Stream sync on node startup takes approximately 15 minutes to find minimum streams and begin syncing, while restarting the service enables much faster peer discovery.

Root Causes:

  • 30-second discovery intervals combined with cooldowns create long delays
  • Hardcoded peer discovery limits not optimized for different networks
  • No early exit mechanism when sufficient peers are found
  • Inefficient timing algorithms with aggressive backoff

Solution Overview

1. Startup Mode Implementation

  • Faster Initial Discovery: Reduced timeouts and intervals during first 10 minutes
  • Early Exit: Automatically exits startup mode when enough streams are found
  • Adaptive Timing: Different timing strategies for startup vs normal operation

2. Network-Aware Peer Discovery

  • Configuration-Aligned: Target valid peers match actual stream sync requirements
  • Network-Specific Limits: Different DHT request limits per network type
  • Realistic Expectations: Success rates based on actual DHT behavior

3. Improved Timing Algorithms

  • Centralized Constants: All timing values moved to const.go
  • Adaptive Sleep: Increase sleep when peers found, decrease when none found
  • Better Backoff: 30% decrease instead of aggressive /= 2

Configuration Alignment

Network InitStreams Target Valid DHT Request Success Rate
Mainnet 8 8 20 40%
Testnet 3 3 10 30%
Pangaea 3 3 10 30%
Partner 3 3 10 30%
Stressnet 4 4 12 33%
Devnet 4 4 12 33%
Localnet 4 4 12 33%

@GheisMohammadi GheisMohammadi self-assigned this Aug 8, 2025
@GheisMohammadi GheisMohammadi changed the title Optimize Stream Sync Peer Discovery: Reduce Startup Time from 15 Minutes to Seconds Optimize Stream Sync Peer Discovery: Reduce Startup Time to Seconds Aug 8, 2025
@GheisMohammadi GheisMohammadi marked this pull request as draft August 8, 2025 02:13
- Add SetEnoughStreamsCallback method to Operator interface
- Implement SetEnoughStreamsCallback in streamManager struct
- Add enoughStreamsCallback field to streamManager
- Call callback when softHaveEnoughStreams() is true
- Enables early exit from startup mode when enough streams are found
- Add startup mode with faster timings for initial peer discovery
- Centralize all timing constants in const.go
- Add network-specific DHT request limits and target valid peer counts
- Align target valid peers with stream sync configuration requirements
- Improve adaptive sleep logic with SleepDecreaseRatio
- Add comprehensive timing constants for startup vs normal modes
- Add startup mode with faster timings for initial peer discovery
- Implement early exit from startup mode when enough streams found
- Add network-specific peer discovery limits and validation
- Improve advertise() function with better error handling and logging
- Add isValidPeer() stub for future peer validation logic
- Use configurable DHT request limits instead of hardcoded values
- Add target valid peer counting and filtering
- Improve advertiseLoop() with adaptive timing based on discovery success
- Add TestProtocol_Advertise with mock discovery and proper setup
- Add TestProtocol_AdvertiseLoop with timing validation
- Add TestProtocol_ExitStartupMode for startup mode functionality
- Add TestProtocol_GetDHTRequestLimit for network-specific limits
- Add TestProtocol_GetTargetValidPeers for target peer validation
- Fix interface compilation errors by adding SetEnoughStreamsCallback stubs
- Add mockDiscovery struct to properly mock discovery interface
- Update tests to handle nil pointer scenarios and proper initialization
@GheisMohammadi GheisMohammadi marked this pull request as ready for review March 11, 2026 14:25
@GheisMohammadi GheisMohammadi requested review from Frozen and mur-me March 11, 2026 14:25
@mur-me
Copy link
Copy Markdown
Collaborator

mur-me commented Mar 12, 2026

Well, I've tested this PR on the devnet 9 RPCs and the results are promising:

Hostname Sync Type Started At (UTC) Finished At (UTC) Duration
dco-exptest0-01 sync 2026-03-12T08:18:54 2026-03-12T08:29:21 10m 26s
dco-exptest0-01 epochsync 2026-03-12T08:18:54 2026-03-12T08:18:54 <1s
dco-exptest0-02 sync 2026-03-12T08:20:34 2026-03-12T08:25:04 4m 30s
dco-exptest0-02 epochsync 2026-03-12T08:20:34 2026-03-12T08:32:04 11m 30s
dco-exptest0-03 sync 2026-03-12T08:21:36 2026-03-12T08:27:35 5m 59s
dco-exptest0-03 epochsync 2026-03-12T08:21:36 2026-03-12T08:24:34 2m 58s
dco-exptest0-04 sync 2026-03-12T08:23:07 2026-03-12T08:33:40 10m 33s
dco-exptest0-04 epochsync 2026-03-12T08:23:07 2026-03-12T08:31:47 8m 40s
dco-exptest0-05 sync 2026-03-12T08:24:47 2026-03-12T08:27:18 2m 31s
dco-exptest0-05 epochsync 2026-03-12T08:24:47 2026-03-12T08:33:48 9m 01s
dco-exptest1-01 sync 2026-03-12T08:25:49 2026-03-12T08:37:06 11m 17s
dco-exptest1-01 epochsync 2026-03-12T08:25:49 2026-03-12T08:26:43 54s
dco-exptest1-02 sync 2026-03-12T08:27:35 2026-03-12T08:27:59 24s
dco-exptest1-02 epochsync 2026-03-12T08:27:35 2026-03-12T08:27:59 24s
dco-exptest1-03 sync 2026-03-12T08:29:14 2026-03-12T08:35:13 5m 59s
dco-exptest1-03 epochsync 2026-03-12T08:29:14 2026-03-12T08:30:13 59s
dco-exptest1-04 sync 2026-03-12T08:30:54 2026-03-12T08:31:23 29s
dco-exptest1-04 epochsync 2026-03-12T08:30:54 2026-03-12T08:30:54 <1s

And how it was before on the previous 9 March update:

Hostname Sync Type Started At Finished At Duration
dco-exptest0-01 epochsync 2026-03-09 14:32:00 2026-03-09 14:32:20 20s
dco-exptest0-01 sync 2026-03-09 14:32:00 2026-03-09 16:35:30 2h 03m 30s
dco-exptest0-02 epochsync 2026-03-09 14:36:00 2026-03-09 18:37:15 4h 01m 15s
dco-exptest0-02 sync 2026-03-09 14:36:00 2026-03-09 16:44:15 2h 08m 15s
dco-exptest0-03 epochsync 2026-03-09 14:37:30 2026-03-09 14:38:15 45s
dco-exptest0-03 sync 2026-03-09 14:37:30 2026-03-09 16:47:15 2h 09m 45s
dco-exptest0-04 epochsync 2026-03-09 14:36:45 2026-03-09 14:38:45 2m
dco-exptest0-04 sync 2026-03-09 14:36:45 2026-03-09 15:33:45 57m
dco-exptest0-05 epochsync 2026-03-09 14:40:00 2026-03-09 14:42:30 2m 30s
dco-exptest0-05 sync 2026-03-09 14:40:00 2026-03-09 15:00:30 20m 30s
dco-exptest1-01 epochsync 2026-03-09 14:37:30 2026-03-09 14:38:15 45s
dco-exptest1-01 sync 2026-03-09 14:37:30 2026-03-09 14:39:15 1m 45s
dco-exptest1-02 epochsync 2026-03-09 14:36:00 2026-03-09 14:38:00 2m
dco-exptest1-02 sync 2026-03-09 14:36:00 2026-03-09 14:49:00 13m
dco-exptest1-03 epochsync 2026-03-09 14:38:30 2026-03-09 14:39:30 1m
dco-exptest1-03 sync 2026-03-09 14:38:30 2026-03-09 14:49:30 11m
dco-exptest1-04 epochsync 2026-03-09 14:39:30 2026-03-09 14:40:15 45s
dco-exptest1-04 sync 2026-03-09 14:39:30 2026-03-09 14:57:15 17m 45s

@mur-me
Copy link
Copy Markdown
Collaborator

mur-me commented Mar 12, 2026

And percentage improvement numbers below.
NOTE: everything less then 10-15 minutes can be treated as good result, so ignore this -%

Hostname Sync Type Old Duration New Duration Improvement
dco-exptest0-01 sync 2h 03m 30s 10m 26s 91%
dco-exptest0-01 epochsync 20s <1s 95%
dco-exptest0-02 sync 2h 08m 15s 4m 30s 96%
dco-exptest0-02 epochsync 4h 01m 15s 11m 30s 95%
dco-exptest0-03 sync 2h 09m 45s 5m 59s 95%
dco-exptest0-03 epochsync 45s 2m 58s -295% absolute result is good
dco-exptest0-04 sync 57m 10m 33s 81%
dco-exptest0-04 epochsync 2m 8m 40s -333% absolute result is good
dco-exptest0-05 sync 20m 30s 2m 31s 87%
dco-exptest0-05 epochsync 2m 30s 9m 01s -260% absolute result is good
dco-exptest1-01 sync 1m 45s 11m 17s -544% absolute result is good
dco-exptest1-01 epochsync 45s 54s -20%
dco-exptest1-02 sync 13m 24s 96%
dco-exptest1-02 epochsync 2m 24s 80%
dco-exptest1-03 sync 11m 5m 59s 45%
dco-exptest1-03 epochsync 1m 59s 1%
dco-exptest1-04 sync 17m 45s 29s 97%
dco-exptest1-04 epochsync 45s <1s 98%

@mur-me
Copy link
Copy Markdown
Collaborator

mur-me commented Mar 12, 2026

And overall improvement:

Shard Sync Type Avg Old Duration Avg New Duration Improvement
0 sync 1h 27m 24s 6m 07s ~93% faster
0 epochsync 48m 58s 4m 45s ~90% faster
1 sync 7m 55s 3m 12s ~59% faster
1 epochsync 1m 03s 34s ~46% faster

@mur-me
Copy link
Copy Markdown
Collaborator

mur-me commented Mar 12, 2026

Additional thanks for the Exiting startup mode early - sufficient peers found log, now I can easily calculate time spend on the addition of enough streams via logs.

LogQL query:

{ip=~"$instance"}
|= `[StreamManager] initialized with trusted peer checking enabled`
  or `Exiting startup mode early - sufficient peers found`
| json
| logfmt
| protocol_ID=~"$topic" or Protocol=~"$topic"
| line_format "{{.protocol_ID}}{{.Protocol}} topic on {{.host}} {{.ip}} {{.message}}"

Copy link
Copy Markdown
Collaborator

@mur-me mur-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, good improvement on stream discovery speed after the node restart

@GheisMohammadi GheisMohammadi merged commit 0d238a7 into dev Mar 18, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

libp2p Peer to Peer networking p2p_stream stream-sync

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants