Skip to content

chore(spanner): use channel affinity#13231

Open
rahul2393 wants to merge 6 commits into
mainfrom
use-channel-affinity
Open

chore(spanner): use channel affinity#13231
rahul2393 wants to merge 6 commits into
mainfrom
use-channel-affinity

Conversation

@rahul2393
Copy link
Copy Markdown
Contributor

Internal reference: go/grpc-gcp-fixes#bookmark=id.q4x3oa8l672

@rahul2393 rahul2393 requested review from a team as code owners May 20, 2026 04:46
}

public static TracingFramework getActiveTracingFramework() {
synchronized (lock) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: this change is unrelated, but since it was small perf change I included it here, this was on critical request path being shown up in benchmark mutex profile

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request replaces the use of logical affinity keys and explicit unbinding with direct channel ID affinity using AtomicReference for Spanner operations. This change simplifies the integration with grpc-gcp by removing the need for affinity-key map management and the associated cleanup logic. Additionally, SpannerOptions was updated to use a volatile variable for the active tracing framework, removing unnecessary synchronization. I have no feedback to provide.

@rahul2393
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a ChannelAffinityRef mechanism to GcpManagedChannel to facilitate sticky channel routing without the need for an internal affinity-key map. This new approach is integrated into the Spanner client, replacing the previous CHANNEL_HINT and UNBIND_CHANNEL_HINT logic, and allows for more direct channel management during retries. The changes also include the removal of the GrpcGcpAffinityUtil class, the deprecation of explicit affinity cleanup methods in SpannerRpc, and updates to SpannerOptions for more efficient tracing framework access. Regarding the feedback, a potential NullPointerException was identified in GcpManagedChannel.newCall when no channels are available; it is recommended to handle this case by returning a NoopGcpClientCall.

Comment on lines +2023 to +2024
return new GcpClientCall.SimpleGcpClientCall<>(
this, getChannelRefByAffinityRef(channelAffinityRef), methodDescriptor, callOptions);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

getChannelRefByAffinityRef can return null if no channels are available (e.g., if pickLeastBusyChannel returns null). This would cause a NullPointerException inside the SimpleGcpClientCall constructor, which likely doesn't expect a null ChannelRef.

The existing code for handling affinity by key (getChannelRef(affinityKey)) checks for a null ChannelRef and returns a NoopGcpClientCall in that case. A similar null check should be added here to gracefully handle the case where no channel is available.

      ChannelRef channelRef = getChannelRefByAffinityRef(channelAffinityRef);
      if (channelRef == null) {
        logger.log(Level.WARNING, log("No channel available for affinity ref, this should not happen"));
        return new GcpClientCall.NoopGcpClientCall<>(
            Status.UNAVAILABLE.withDescription(
                "No channel available for affinity ref, this should not happen"));
      }
      return new GcpClientCall.SimpleGcpClientCall<>(this, channelRef, methodDescriptor, callOptions);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed:
pickLeastBusyChannel never returns null and NoopGcpClientCall doesn't exist — same null-handling (none) as the sibling getChannelRef(key) path, so no change needed.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request replaces the CHANNEL_HINT and UNBIND_CHANNEL_HINT mechanism with a new ChannelAffinityRef system for managing sticky channel routing in gRPC-GCP. This change eliminates the need for gRPC-GCP to maintain internal affinity-key maps for Spanner operations. Additionally, SpannerOptions was optimized by replacing synchronized access to the tracing framework with a volatile field. Review feedback identified a potential logic error in the channel selection loop where the 'use different channel' flag could be lost during retries, and recommended adding activity checks when picking fallback channels.

Comment on lines +1704 to +1718
while (true) {
ChannelRef channelRef = affinityRef.channelRef.get();
boolean useDifferentChannel = affinityRef.useDifferentChannelOnNextCall.getAndSet(false);
if (!useDifferentChannel && channelRef != null && channelRef.isActive()) {
return channelRef;
}

ChannelRef selectedChannelRef =
useDifferentChannel
? pickLeastBusyChannelDifferentFrom(channelRef)
: pickLeastBusyChannel(/* forFallback= */ false);
if (affinityRef.channelRef.compareAndSet(channelRef, selectedChannelRef)) {
return selectedChannelRef;
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The useDifferentChannelOnNextCall.getAndSet(false) call is located inside the while (true) loop. If the compareAndSet operation fails due to contention, the useDifferentChannel flag will be false in the next iteration, potentially losing the intent to pick a different channel. Moving the getAndSet call outside the loop ensures the flag is correctly handled even if retries occur. Additionally, once the reference has changed (causing the CAS to fail), the "different channel" requirement is effectively satisfied for the next iteration.

    boolean useDifferentChannel = affinityRef.useDifferentChannelOnNextCall.getAndSet(false);
    while (true) {
      ChannelRef channelRef = affinityRef.channelRef.get();
      if (!useDifferentChannel && channelRef != null && channelRef.isActive()) {
        return channelRef;
      }

      ChannelRef selectedChannelRef =
          useDifferentChannel
              ? pickLeastBusyChannelDifferentFrom(channelRef)
              : pickLeastBusyChannel(/* forFallback= */ false);
      if (affinityRef.channelRef.compareAndSet(channelRef, selectedChannelRef)) {
        return selectedChannelRef;
      }
      useDifferentChannel = false;
    }
References
  1. In performance-sensitive code, prefer using explicit locks or atomic operations over the 'synchronized' keyword to protect shared state while ensuring thread safety and visibility.
  2. Ensure that state flags in channel management logic are thread-safe, using constructs like AtomicBoolean or AtomicInteger.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both versions current/suggested consume the flag once on the first iteration and drop it after — the suggested rewrite is the same behavior with extra lines.

Comment on lines +1729 to +1733
for (ChannelRef candidate : channelRefs) {
if (candidate != excludedChannelRef) {
return candidate;
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The fallback loop in pickLeastBusyChannelDifferentFrom returns the first candidate that is not the excluded one, which might not be the least busy channel. Additionally, it should verify that the candidate is still active using candidate.isActive(), as channels might be marked inactive before being removed from the channelRefs list.

Suggested change
for (ChannelRef candidate : channelRefs) {
if (candidate != excludedChannelRef) {
return candidate;
}
}
for (ChannelRef candidate : channelRefs) {
if (candidate != excludedChannelRef && candidate.isActive()) {
return candidate;
}
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale review

@rahul2393
Copy link
Copy Markdown
Contributor Author

Screenshot 2026-05-20 at 7 11 55 PM

Prober running stale_read at 400QPS on DirectPath shows 200us save

@rahul2393 rahul2393 force-pushed the use-channel-affinity branch from 1f07a11 to 36329d8 Compare May 20, 2026 14:27
@rahul2393
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new ChannelAffinityRef mechanism in GcpManagedChannel to replace the previous hint-based affinity system, simplifying channel lifecycle management for Cloud Spanner. The changes remove the need for explicit affinity unbinding and the GrpcGcpAffinityUtil class. Additionally, SpannerOptions was updated to use a volatile field for the tracing framework to improve concurrency. A critical review comment points out that getChannelRefByAffinityRef could return null, potentially causing a NullPointerException in newCall, and suggests throwing an exception instead to maintain invariants.

? pickLeastBusyChannelDifferentFrom(channelRef)
: pickLeastBusyChannel(/* forFallback= */ false);
if (affinityRef.channelRef.compareAndSet(channelRef, selectedChannelRef)) {
return selectedChannelRef;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

If selectedChannelRef is null, this method can return null. The caller, newCall, passes this to the GcpClientCall constructor, which will cause a NullPointerException as it does not expect a null ChannelRef. This method should throw an exception if no channel is available instead of returning null to maintain the expected invariant.

References
  1. Avoid adding defensive null checks for values that are guaranteed to be non-null by design, as this can hide invariant breaks. Instead, ensure the method returns a valid object or throws an exception.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither pickLeastBusyChannel nor pickLeastBusyChannelDifferentFrom can return null — empty-pool case throws on channelRefs.get(0) long before this point, so the NPE path doesn't exist.

@rahul2393 rahul2393 requested a review from olavloite May 20, 2026 14:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant