Skip to content

This PR fixes UNPREPARED handling in the Python driver execution path…#725

Open
calebxyz wants to merge 1 commit intomasterfrom
scylladb-27657-v2
Open

This PR fixes UNPREPARED handling in the Python driver execution path…#725
calebxyz wants to merge 1 commit intomasterfrom
scylladb-27657-v2

Conversation

@calebxyz
Copy link

@calebxyz calebxyz commented Mar 4, 2026

Problem

When the server returned UNPREPARED (PreparedQueryNotFound) during an EXECUTE, the driver could fail to recover in some valid cases (cache miss, race, coordinator restart, id mismatch edge cases or crushes), even when the in-flight request still had enough PreparedStatement context to recover.

Expected behavior

If the driver still has the original PreparedStatement context, it should:

  1. issue PREPARE,
  2. then retry the original EXECUTE, regardless of why the coordinator no longer recognizes the prepared id.

What this PR changes

In ResponseFuture PreparedQueryNotFound handling:

  1. Cache the in-flight self.prepared_statement first (when present).
  2. Attempt lookup by the returned UNPREPARED id from cluster._prepared_statements.
  3. Reprepare using the resolved statement.
  4. If returned-id lookup fails:
    • fallback to in-flight context if available,
    • otherwise fail as unknown prepared statement.

This preserves correctness while improving recovery in real-world coordinator invalidation/race/restart scenarios.

Tests

Added/updated unit coverage in tests/unit/test_response_future.py:

  • validates reprepare path for PreparedQueryNotFound,
  • validates fallback to in-flight context,
  • validates preference for statement resolved by returned UNPREPARED id when available.

Fixes: scylladb/scylladb#27657

Pre-review checklist

  • I have split my patch into logically separate commits.
  • All commit messages clearly explain what they change and why.
  • I added relevant tests for new features and bug fixes.
  • All commits compile, pass static checks and pass test.
  • PR description sums up the changes and reasons why they should be introduced.
  • I have provided docstrings for the public items that I want to introduce.
  • I have adjusted the documentation in ./docs/source/.
  • I added appropriate Fixes: annotations to PR description.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the Python driver’s recovery behavior when an EXECUTE receives an UNPREPARED (PreparedQueryNotFound) error, allowing the driver to re-PREPARE and retry using either the returned query id (preferred) or the in-flight prepared statement context (fallback).

Changes:

  • Update ResponseFuture._set_result() to cache in-flight prepared_statement, resolve by returned UNPREPARED id when possible, and fall back to in-flight context when not.
  • Adjust logging for unknown-prepared and mismatch/fallback scenarios.
  • Add unit tests covering fallback to local context and preference for returned-id cache resolution.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
cassandra/cluster.py Updates UNPREPARED handling in ResponseFuture._set_result() to improve reprepare/retry correctness across cache-miss and id-mismatch scenarios.
tests/unit/test_response_future.py Adds unit coverage for the new UNPREPARED resolution and fallback behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

else:
query_id = response.info
# Cache local in-flight context first so lookup by either id can succeed.
self.session.cluster._prepared_statements[self.prepared_statement.query_id] = self.prepared_statement
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds to cluster._prepared_statements via direct dict assignment. The Cluster has add_prepared() which wraps updates in _prepared_statement_lock (and _prepared_statements is a WeakValueDictionary, which is not thread-safe). To avoid races with concurrent prepares/updates, use self.session.cluster.add_prepared(...) (or take _prepared_statement_lock) instead of writing the dict directly.

Suggested change
self.session.cluster._prepared_statements[self.prepared_statement.query_id] = self.prepared_statement
self.session.cluster.add_prepared(self.prepared_statement)

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well almost right the api is def add_prepared(self, query_id, prepared_statement):

if not self.prepared_statement:
log.error("Tried to execute unknown prepared statement: id=%s",
query_id.encode('hex'))
log.error(f"Tried to execute unknown prepared statement: id={hexlify(query_id)}")
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new log.error uses an f-string with hexlify(query_id), which eagerly formats even if the log level is disabled and also renders the bytes representation (e.g., b'616263') in the message. Prefer logger parameterized formatting (e.g., log.error("... %s", ...)) and consider decoding/formatting the hex value to a plain string for readability/consistency with other messages (e.g., .decode('ascii') or query_id.hex()).

Suggested change
log.error(f"Tried to execute unknown prepared statement: id={hexlify(query_id)}")
log.error(
"Tried to execute unknown prepared statement: id=%s",
hexlify(query_id).decode('ascii'),
)

Copilot uses AI. Check for mistakes.
@Lorak-mmk Lorak-mmk self-requested a review March 4, 2026 10:36
@calebxyz calebxyz force-pushed the scylladb-27657-v2 branch from 0f3737f to 8ed0715 Compare March 4, 2026 10:36
@calebxyz calebxyz force-pushed the scylladb-27657-v2 branch 2 times, most recently from c7706e5 to 3d3dc24 Compare March 4, 2026 10:45
… for prepared statements.

  ### Problem

  When the server returned UNPREPARED (PreparedQueryNotFound) during an EXECUTE, the driver could fail to recover in some valid cases (cache miss, race, coordinator restart, id mismatch edge cases), even when the in-flight request still had enough PreparedStatement context to recover.

  ### Expected behavior

  If the driver still has the original PreparedStatement context, it should:

  1. issue PREPARE,
  2. then retry the original EXECUTE,
     regardless of why the coordinator no longer recognizes the prepared id.

  ### What this PR changes

  In ResponseFuture PreparedQueryNotFound handling:

  1. Cache the in-flight self.prepared_statement first (when present).
  2. Attempt lookup by the returned UNPREPARED id from cluster._prepared_statements.
  3. Reprepare using the resolved statement.
  4. If returned-id lookup fails:
      - fallback to in-flight context if available,
      - otherwise fail as unknown prepared statement.

  This preserves correctness while improving recovery in real-world coordinator invalidation/race/restart scenarios.

  ### Tests

  Added/updated unit coverage in tests/unit/test_response_future.py:

  - validates reprepare path for PreparedQueryNotFound,
  - validates fallback to in-flight context,
  - validates preference for statement resolved by returned UNPREPARED id when available.

Fixes: scylladb/scylladb#27657
@calebxyz calebxyz force-pushed the scylladb-27657-v2 branch from 3d3dc24 to 9815e0f Compare March 4, 2026 10:48
rf.prepared_statement.query_string = "SELECT * FROM foobar"
rf.prepared_statement.keyspace = "FooKeyspace"

# Different query id in UNPREPARED response should not prevent reprepare when local context exists.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not? Prepared id suddenly changing seems like a serious protocol violation. What are the scenarios when it can happen?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because thats what i understand , whats expected , there can be a crush a race with the server , and then the prepared statement will be evicted and we need to rerun it.
it has lower performance but returning an error is incorrect.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not questioning the need for automatic repreparation. It is a well-known mechanism, and all drivers do this, including Python driver before your changes.
What this review comment is about is about prepared statements id. Those IDs are constant - dependent on query string, and connection keyspace. All existing drivers depend on that.

Statement getting evicted from the cache is a known and handled scenario, but the id doesn't change in this scenario, even across restarts.

Comment on lines 4716 to +4752
@@ -4745,11 +4743,13 @@ def _set_result(self, host, connection, pool, response):
(current_keyspace, prepared_keyspace)))
return

log.debug("Re-preparing unrecognized prepared statement against host %s: %s",
host, prepared_statement.query_string)
prepared_keyspace = prepared_statement.keyspace \
log.debug(
"Re-preparing unrecognized prepared statement against host %s: %s",
host, self.prepared_statement.query_string
)
prepared_keyspace = self.prepared_statement.keyspace \
if ProtocolVersion.uses_keyspace_flag(self.session.cluster.protocol_version) else None
prepare_message = PrepareMessage(query=prepared_statement.query_string,
prepare_message = PrepareMessage(query=self.prepared_statement.query_string,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm looking at those changes and I don't see what you are trying to achieve.
In PR description you wrote:

Expected behavior

If the driver still has the original PreparedStatement context, it should:

  1. issue PREPARE,
  2. then retry the original EXECUTE, regardless of why the coordinator no longer recognizes the prepared id.

But this is what the driver already does?

  1. Cache the in-flight self.prepared_statement first (when present).

Why does it need to be done first?

In general the existing logic looks sound to me:

  • In first step it determines id of preprared statement, trying both self.prepared_statement and response.info, and also validating protocol assumptions.
  • Then it resolves prepared statement object, also trying both cluster-scoped cache (self.session.cluster._prepared_statements) and self.prepared_statement. It also adds self.prepared_statement to cache in that case.

So, what problem is this PR trying to solve? In what scenario does existing logic fail?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is one of this scenarios scylladb/scylladb#27657
the driver should retry but it doesn't

Copy link

@Lorak-mmk Lorak-mmk Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linked issue describes the need for automatic reprepare. Driver already had mechanism for automatic repreparation - your PR doesn't add it, just modifies its logic. Which is why I'm asking what EXACTLY was wrong with the previous logic, how exactly it not working.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we'd have a test that fails before the fix and passes after it. May not be easy to reproduce perhaps.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we'd have a test that fails before the fix and passes after it.

Sure, but for now I'd be happy with a textual reasoning, so that I can understand the changes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Lorak-mmk , here is the part of the mentioned discussion: scylladb/scylladb#27657 (comment)

If I understand correctly there was an experiment when Scylla returned UNPREPARED = 0x2500 sc to driver and it didnt repreapre the statement. So if driver already has mechanism for automatic repreparation it didnt work during the experiment - scylladb/scylladb#27657 (comment)

Copy link

@Lorak-mmk Lorak-mmk Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I got that. What I don't understand from this PR description is why exactly it didn't work, and how those changes address that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this PR the logic works roughly like this:

If self.prepared_statement exists, the code forces the query_id to be the local one (self.prepared_statement.query_id).
It then asserts that this id matches the one returned by the server (response.info).
After that it tries to resolve the statement:
It looks it up in the cache using
cluster._prepared_statements[query_id].
If that lookup fails (KeyError), it falls back to the local prepared statement (self.prepared_statement) and stores it in the cache under that query_id.

now this is might be problematic because:

  1. It ignores the server’s returned ID when a local statement exists
    If we already have a local prepared statement, the code overwrites the ID with the local one before doing the lookup.
    That means the ID returned by the server is effectively ignored in this path, even though it might be the correct one to use.

  2. The assertion can break recovery
    The code asserts that the local ID and the server-returned ID must match.
    However, in real situations (for example cache invalidation, server restart or crushes), the server might legitimately return a different ID.
    When that happens, the assertion triggers and kills the flow before the driver has a chance to recover or re-prepare the statement gracefully.

  3. Cache updates may reinforce the wrong mapping
    If the cache lookup fails, the code stores the local prepared statement back into the cache under the current query_id.
    But in this branch the query_id is the local ID, not necessarily the one the server returned.
    Because of this, the logic never explicitly tries the clean path of:
    First resolve using the ID the server returned, and only fall back to the local statement if needed.

In some LWT tests we actually see this behavior, instead of re-sending its just returns does nothing and waits forever

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I got that. What I don't understand from this PR description is why exactly it didn't work, and how those changes address that.

why exactly it didn't work

This was actually a question for the drivers team :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was actually a question for the drivers team :)

sadly almost 2 months ago :(

@mykaul
Copy link

mykaul commented Mar 4, 2026

Dear @copilot (1) explain what is wrong in the current implementation and how this code change fixes the issue (2) can you come up with a test that fails before the change and passes after?

Copy link

Copilot AI commented Mar 4, 2026

@mykaul I've opened a new pull request, #726, to work on those changes. Once the pull request is ready, I'll request review from you.

@mykaul
Copy link

mykaul commented Mar 4, 2026

@mykaul I've opened a new pull request, #726, to work on those changes. Once the pull request is ready, I'll request review from you.

Oh dear. Oh well. Will take a look.

@calebxyz
Copy link
Author

calebxyz commented Mar 4, 2026

@mykaul I've opened a new pull request, #726, to work on those changes. Once the pull request is ready, I'll request review from you.

Oh dear. Oh well. Will take a look.

@mykaul was my explanation bad ? :) #725 (comment)

@mykaul
Copy link

mykaul commented Mar 4, 2026

@mykaul I've opened a new pull request, #726, to work on those changes. Once the pull request is ready, I'll request review from you.

Oh dear. Oh well. Will take a look.

@mykaul was my explanation bad ? :) #725 (comment)

Honestly - #726 (comment) is not bad as an explanation.

(What I'm still missing is how we can get to a situation of a different ID. That's not very clear to me, but I suspect it's a server-side issue?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

6 participants