Skip to content

hash-based deterministic sampling (duckDB only)#3122

Open
David Eccles (gringer) (gringer) wants to merge 2 commits into
moj-analytical-services:masterfrom
gringer:deterministic_u
Open

hash-based deterministic sampling (duckDB only)#3122
David Eccles (gringer) (gringer) wants to merge 2 commits into
moj-analytical-services:masterfrom
gringer:deterministic_u

Conversation

@gringer

@gringer David Eccles (gringer) (gringer) commented Jun 19, 2026

Copy link
Copy Markdown

Type of PR

  • BUG
  • FEAT
  • MAINT
  • DOC

Is your Pull Request linked to an existing Issue or Pull Request?

This is an attempt at fixing #2882. The current bernoulli sampling is not deterministic when using DuckDB (as can be demonstrated by repeated runs of the u probability estimation, see here). I have not tested this with any other database backends, so am not providing a change for any other backends.

Give a brief description for the solution you have provided

The sampling approach is modified to use a hash of the unique_id field (together with the seed, if provided) for ordering the table prior to choosing a sample with a defined row count limit. This produces a fixed sample size that should be reproducible when used with the same input table and seed.

PR Checklist

  • Added documentation for changes
  • Added feature to example notebooks or tutorial (if appropriate)
  • Added tests (if appropriate)
  • Updated CHANGELOG.md (if appropriate)
  • Made changes based off the latest version of Splink
  • Run the linter
  • Run the spellchecker (if appropriate)

@RobinL

Copy link
Copy Markdown
Member

Interesting, thanks. In practice does this definitely give deterministic results? I e. Does it fix the issues you were having before?

@gringer

David Eccles (gringer) (gringer) commented Jun 19, 2026

Copy link
Copy Markdown
Author

It does for u probabilities, at least in the one case that I've tested; I haven't checked further than that.

I got different u probability results from different runs with the bernoulli sampling, and the same results from different runs when doing hash-based sampling.

I've been trying to think through what might work as a test case, testing the sampling method, rather than any higher-level mechanisms:

  1. Carry out sampling on a large table that includes a numeric unique ID
  2. Select the top 100 rows of that sampled table, reverse-sorted by unique ID
  3. Add up the unique ID column from that table
  4. Repeat from step 1, and see if the sum changes

@RobinL

Robin Linacre (RobinL) commented Jun 19, 2026

Copy link
Copy Markdown
Member

Thanks - that's useful. I'm not sure what the underlying mechanism is here (i.e. what's making the difference) because we thought that duckdb bernoulli sampling was deterministic.

In any case, I think this change is a good idea, because it's more consistent with how we're doing random row selection elsewhere in Splink 5.

One other point: I'm not sure you realised this but we recently swapped over branches so master is now the Splink 5 development branch, and Splink 4 is now on the splink4_maintenance branch. So this change targets Splink 5. I think this is desirable, because Splink 5 shifts over to a hash based approach for other sampling, so it makes sense for that change to be consistent and be used for estimate_u as well. So I think I'd rather this be a Splink 5 only thing. I guess also it would break backwards compatibility on the seed parameter, so there's some justification for not putting it in Splink 4. If you really need it in Splink 4, I'm still happy to consider doing that.

I actually just did an vibe-coded experiment here to see what this looks like as a more fundamental change where we remove random sampling using tablesample type approaches from all backends. If this goes smoothly, and you're happy for me to do so, I'll add some tidied up commits along these lines to this PR before merging, and close 3123

@gringer

Copy link
Copy Markdown
Author

That'd be great, thanks

@gringer

David Eccles (gringer) (gringer) commented Jun 22, 2026

Copy link
Copy Markdown
Author

Creating a minimally-reproducible test case for non-reproduciblity is proving difficult, because my above example had reproducible results (i.e. it didn't have reproducibility issues), and the sampling process is more complex than I had expected. It seems to be something like this:

  1. Create concatenated table
  2. Sort by unique id [note: does not include source field, so there might be some ordering issues there, but I didn't encounter reproducibility issues when that was the only factor]
  3. Sample the sorted table (e.g. via bernoulli sampling)
  4. Split the sampled table into tables based on the source field
  5. Carry out chunked 'block and count'

Maybe there's something unexpected going on with query optimisation that's beyond my current understanding.

@gringer

David Eccles (gringer) (gringer) commented Jun 23, 2026

Copy link
Copy Markdown
Author

It's possible that this could also fix another error that we're getting, which variously manifests as a segmentation fault and a bitpacking error on a dataset that has a lot of duplicated rows (where names and DOB are almost always identical, but an integer address ID is different):

u probability estimation error: "Invalid bitpacking mode"
2026-06-23 16:38:45,625 [INFO] datalinking.py:180 - Estimating linking prior...
2026-06-23 16:38:47,040 [INFO] training.py:154 - Probability two random records match is estimated to be  2.62e-07.
This means that amongst all possible pairwise record comparisons, one in 3,812,084.23 are expected to match.  With 200,431,215,806,272 total possible comparisons, we expect a total of around 52,577,856.00 matching pairs
2026-06-23 16:38:47,040 [INFO] datalinking.py:186 - Estimating u probabilities using random sampling...
2026-06-23 16:38:47,040 [INFO] estimate_u.py:337 - ----- Estimating u probabilities using random sampling -----
2026-06-23 16:38:47,040 [INFO] estimate_u.py:338 - Estimating u with: max_pairs = 10,000,000, min_count_per_level = 100, num_chunks = 10
2026-06-23 16:38:52,094 [ERROR] datalinking.py:847 - Data linking failed! Error executing the following sql for table `__splink__df_concat_sample`(__splink__df_concat_sample_b5656a32b):
CREATE TABLE __splink__df_concat_sample_b5656a32b AS
WITH

__splink__df_concat as (
    select
    '__splink__input_table_0' as source_dataset,
    "DayOfBirth",
    "MonthOfBirth",
    "NYSIIS_first_names",
    "NYSIIS_first_names_array",
    "NYSIIS_last_names",
    "NYSIIS_last_names_array",
    "YearOfBirth",
    "snz_uid",
    "snz_unique_nbr",
    "splink_birth_date",
    "splink_first_names",
    "splink_first_names_array",
    "splink_full_names_array",
    "splink_last_names",
    "splink_last_names_array",
    "splink_meshblock_code",
    "splink_p_address_register_uid",
    "splink_r_address_register_uid",
    "unique_id"
            from __splink__input_table_0

    UNION ALL

    select
    '__splink__input_table_1' as source_dataset,
    "DayOfBirth",
    "MonthOfBirth",
    "NYSIIS_first_names",
    "NYSIIS_first_names_array",
    "NYSIIS_last_names",
    "NYSIIS_last_names_array",
    "YearOfBirth",
    "snz_uid",
    "snz_unique_nbr",
    "splink_birth_date",
    "splink_first_names",
    "splink_first_names_array",
    "splink_full_names_array",
    "splink_last_names",
    "splink_last_names_array",
    "splink_meshblock_code",
    "splink_p_address_register_uid",
    "splink_r_address_register_uid",
    "unique_id"
            from __splink__input_table_1
)
select *
from (select * from __splink__df_concat order by "unique_id")
USING SAMPLE bernoulli(0.02233661298903751%) REPEATABLE(1)

Error was: INTERNAL Error: Invalid bitpacking mode
...
splink.internals.exceptions.SplinkException: Error executing the following sql for table `__splink__df_concat_sample`(__splink__df_concat_sample_b5656a32b):
CREATE TABLE __splink__df_concat_sample_b5656a32b AS
WITH

__splink__df_concat as (
    select
    '__splink__input_table_0' as source_dataset,
    "DayOfBirth",
    "MonthOfBirth",
    "NYSIIS_first_names",
    "NYSIIS_first_names_array",
    "NYSIIS_last_names",
    "NYSIIS_last_names_array",
    "YearOfBirth",
    "snz_uid",
    "snz_unique_nbr",
    "splink_birth_date",
    "splink_first_names",
    "splink_first_names_array",
    "splink_full_names_array",
    "splink_last_names",
    "splink_last_names_array",
    "splink_meshblock_code",
    "splink_p_address_register_uid",
    "splink_r_address_register_uid",
    "unique_id"
            from __splink__input_table_0

    UNION ALL

    select
    '__splink__input_table_1' as source_dataset,
    "DayOfBirth",
    "MonthOfBirth",
    "NYSIIS_first_names",
    "NYSIIS_first_names_array",
    "NYSIIS_last_names",
    "NYSIIS_last_names_array",
    "YearOfBirth",
    "snz_uid",
    "snz_unique_nbr",
    "splink_birth_date",
    "splink_first_names",
    "splink_first_names_array",
    "splink_full_names_array",
    "splink_last_names",
    "splink_last_names_array",
    "splink_meshblock_code",
    "splink_p_address_register_uid",
    "splink_r_address_register_uid",
    "unique_id"
            from __splink__input_table_1
)
select *
from (select * from __splink__df_concat order by "unique_id")
USING SAMPLE bernoulli(0.02233661298903751%) REPEATABLE(1)

Error was: INTERNAL Error: Invalid bitpacking mode
u probability estimation error: "Segmentation fault"
2026-06-23 16:44:35,801 [INFO] datalinking.py:180 - Estimating linking prior...
2026-06-23 16:44:37,150 [INFO] training.py:154 - Probability two random records match is estimated to be  2.62e-07.
This means that amongst all possible pairwise record comparisons, one in 3,812,084.23 are expected to match.  With 200,431,215,806,272 total possible comparisons, we expect a total of around 52,577,856.00 matching pairs
2026-06-23 16:44:37,151 [INFO] datalinking.py:186 - Estimating u probabilities using random sampling...
2026-06-23 16:44:37,151 [INFO] estimate_u.py:337 - ----- Estimating u probabilities using random sampling -----
2026-06-23 16:44:37,151 [INFO] estimate_u.py:338 - Estimating u with: max_pairs = 10,000,000, min_count_per_level = 100, num_chunks = 10
Segmentation fault (core dumped)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants