hash-based deterministic sampling (duckDB only) by gringer · Pull Request #3122 · moj-analytical-services/splink

David Eccles (gringer) (gringer) · 2026-06-19T05:50:06Z

Type of PR

BUG
FEAT
MAINT
DOC

Is your Pull Request linked to an existing Issue or Pull Request?

This is an attempt at fixing #2882. The current bernoulli sampling is not deterministic when using DuckDB (as can be demonstrated by repeated runs of the u probability estimation, see here). I have not tested this with any other database backends, so am not providing a change for any other backends.

Give a brief description for the solution you have provided

The sampling approach is modified to use a hash of the unique_id field (together with the seed, if provided) for ordering the table prior to choosing a sample with a defined row count limit. This produces a fixed sample size that should be reproducible when used with the same input table and seed.

PR Checklist

Added documentation for changes
Added feature to example notebooks or tutorial (if appropriate)
Added tests (if appropriate)
Updated CHANGELOG.md (if appropriate)
Made changes based off the latest version of Splink
Run the linter
Run the spellchecker (if appropriate)

Robin Linacre (RobinL) · 2026-06-19T06:09:15Z

Interesting, thanks. In practice does this definitely give deterministic results? I e. Does it fix the issues you were having before?

David Eccles (gringer) (gringer) · 2026-06-19T07:58:59Z

It does for u probabilities, at least in the one case that I've tested; I haven't checked further than that.

I got different u probability results from different runs with the bernoulli sampling, and the same results from different runs when doing hash-based sampling.

I've been trying to think through what might work as a test case, testing the sampling method, rather than any higher-level mechanisms:

Carry out sampling on a large table that includes a numeric unique ID
Select the top 100 rows of that sampled table, reverse-sorted by unique ID
Add up the unique ID column from that table
Repeat from step 1, and see if the sum changes

Robin Linacre (RobinL) · 2026-06-19T08:27:10Z

Thanks - that's useful. I'm not sure what the underlying mechanism is here (i.e. what's making the difference) because we thought that duckdb bernoulli sampling was deterministic.

In any case, I think this change is a good idea, because it's more consistent with how we're doing random row selection elsewhere in Splink 5.

One other point: I'm not sure you realised this but we recently swapped over branches so master is now the Splink 5 development branch, and Splink 4 is now on the splink4_maintenance branch. So this change targets Splink 5. I think this is desirable, because Splink 5 shifts over to a hash based approach for other sampling, so it makes sense for that change to be consistent and be used for estimate_u as well. So I think I'd rather this be a Splink 5 only thing. I guess also it would break backwards compatibility on the seed parameter, so there's some justification for not putting it in Splink 4. If you really need it in Splink 4, I'm still happy to consider doing that.

I actually just did an vibe-coded experiment here to see what this looks like as a more fundamental change where we remove random sampling using tablesample type approaches from all backends. If this goes smoothly, and you're happy for me to do so, I'll add some tidied up commits along these lines to this PR before merging, and close 3123

David Eccles (gringer) (gringer) · 2026-06-19T09:36:57Z

That'd be great, thanks

David Eccles (gringer) (gringer) · 2026-06-22T06:01:54Z

Creating a minimally-reproducible test case for non-reproduciblity is proving difficult, because my above example had reproducible results (i.e. it didn't have reproducibility issues), and the sampling process is more complex than I had expected. It seems to be something like this:

Create concatenated table
Sort by unique id [note: does not include source field, so there might be some ordering issues there, but I didn't encounter reproducibility issues when that was the only factor]
Sample the sorted table (e.g. via bernoulli sampling)
Split the sampled table into tables based on the source field
Carry out chunked 'block and count'

Maybe there's something unexpected going on with query optimisation that's beyond my current understanding.

David Eccles (gringer) (gringer) · 2026-06-23T05:50:24Z

It's possible that this could also fix another error that we're getting, which variously manifests as a segmentation fault and a bitpacking error on a dataset that has a lot of duplicated rows (where names and DOB are almost always identical, but an integer address ID is different):

u probability estimation error: "Invalid bitpacking mode"

2026-06-23 16:38:45,625 [INFO] datalinking.py:180 - Estimating linking prior...
2026-06-23 16:38:47,040 [INFO] training.py:154 - Probability two random records match is estimated to be  2.62e-07.
This means that amongst all possible pairwise record comparisons, one in 3,812,084.23 are expected to match.  With 200,431,215,806,272 total possible comparisons, we expect a total of around 52,577,856.00 matching pairs
2026-06-23 16:38:47,040 [INFO] datalinking.py:186 - Estimating u probabilities using random sampling...
2026-06-23 16:38:47,040 [INFO] estimate_u.py:337 - ----- Estimating u probabilities using random sampling -----
2026-06-23 16:38:47,040 [INFO] estimate_u.py:338 - Estimating u with: max_pairs = 10,000,000, min_count_per_level = 100, num_chunks = 10
2026-06-23 16:38:52,094 [ERROR] datalinking.py:847 - Data linking failed! Error executing the following sql for table `__splink__df_concat_sample`(__splink__df_concat_sample_b5656a32b):
CREATE TABLE __splink__df_concat_sample_b5656a32b AS
WITH

__splink__df_concat as (
    select
    '__splink__input_table_0' as source_dataset,
    "DayOfBirth",
    "MonthOfBirth",
    "NYSIIS_first_names",
    "NYSIIS_first_names_array",
    "NYSIIS_last_names",
    "NYSIIS_last_names_array",
    "YearOfBirth",
    "snz_uid",
    "snz_unique_nbr",
    "splink_birth_date",
    "splink_first_names",
    "splink_first_names_array",
    "splink_full_names_array",
    "splink_last_names",
    "splink_last_names_array",
    "splink_meshblock_code",
    "splink_p_address_register_uid",
    "splink_r_address_register_uid",
    "unique_id"
            from __splink__input_table_0

    UNION ALL

    select
    '__splink__input_table_1' as source_dataset,
    "DayOfBirth",
    "MonthOfBirth",
    "NYSIIS_first_names",
    "NYSIIS_first_names_array",
    "NYSIIS_last_names",
    "NYSIIS_last_names_array",
    "YearOfBirth",
    "snz_uid",
    "snz_unique_nbr",
    "splink_birth_date",
    "splink_first_names",
    "splink_first_names_array",
    "splink_full_names_array",
    "splink_last_names",
    "splink_last_names_array",
    "splink_meshblock_code",
    "splink_p_address_register_uid",
    "splink_r_address_register_uid",
    "unique_id"
            from __splink__input_table_1
)
select *
from (select * from __splink__df_concat order by "unique_id")
USING SAMPLE bernoulli(0.02233661298903751%) REPEATABLE(1)

Error was: INTERNAL Error: Invalid bitpacking mode
...
splink.internals.exceptions.SplinkException: Error executing the following sql for table `__splink__df_concat_sample`(__splink__df_concat_sample_b5656a32b):
CREATE TABLE __splink__df_concat_sample_b5656a32b AS
WITH

__splink__df_concat as (
    select
    '__splink__input_table_0' as source_dataset,
    "DayOfBirth",
    "MonthOfBirth",
    "NYSIIS_first_names",
    "NYSIIS_first_names_array",
    "NYSIIS_last_names",
    "NYSIIS_last_names_array",
    "YearOfBirth",
    "snz_uid",
    "snz_unique_nbr",
    "splink_birth_date",
    "splink_first_names",
    "splink_first_names_array",
    "splink_full_names_array",
    "splink_last_names",
    "splink_last_names_array",
    "splink_meshblock_code",
    "splink_p_address_register_uid",
    "splink_r_address_register_uid",
    "unique_id"
            from __splink__input_table_0

    UNION ALL

    select
    '__splink__input_table_1' as source_dataset,
    "DayOfBirth",
    "MonthOfBirth",
    "NYSIIS_first_names",
    "NYSIIS_first_names_array",
    "NYSIIS_last_names",
    "NYSIIS_last_names_array",
    "YearOfBirth",
    "snz_uid",
    "snz_unique_nbr",
    "splink_birth_date",
    "splink_first_names",
    "splink_first_names_array",
    "splink_full_names_array",
    "splink_last_names",
    "splink_last_names_array",
    "splink_meshblock_code",
    "splink_p_address_register_uid",
    "splink_r_address_register_uid",
    "unique_id"
            from __splink__input_table_1
)
select *
from (select * from __splink__df_concat order by "unique_id")
USING SAMPLE bernoulli(0.02233661298903751%) REPEATABLE(1)

Error was: INTERNAL Error: Invalid bitpacking mode

u probability estimation error: "Segmentation fault"

2026-06-23 16:44:35,801 [INFO] datalinking.py:180 - Estimating linking prior...
2026-06-23 16:44:37,150 [INFO] training.py:154 - Probability two random records match is estimated to be  2.62e-07.
This means that amongst all possible pairwise record comparisons, one in 3,812,084.23 are expected to match.  With 200,431,215,806,272 total possible comparisons, we expect a total of around 52,577,856.00 matching pairs
2026-06-23 16:44:37,151 [INFO] datalinking.py:186 - Estimating u probabilities using random sampling...
2026-06-23 16:44:37,151 [INFO] estimate_u.py:337 - ----- Estimating u probabilities using random sampling -----
2026-06-23 16:44:37,151 [INFO] estimate_u.py:338 - Estimating u with: max_pairs = 10,000,000, min_count_per_level = 100, num_chunks = 10
Segmentation fault (core dumped)

David Eccles (gringer) (gringer) force-pushed the deterministic_u branch from fb5c90b to 0a24949 Compare June 23, 2026 05:36

hash-based deterministic sampling (duckDB only)

430bf84

David Eccles (gringer) (gringer) force-pushed the deterministic_u branch from 0a24949 to 430bf84 Compare June 24, 2026 00:00

Update CHANGELOG.md

78e08a6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

hash-based deterministic sampling (duckDB only)#3122

hash-based deterministic sampling (duckDB only)#3122
David Eccles (gringer) (gringer) wants to merge 2 commits into
moj-analytical-services:masterfrom
gringer:deterministic_u

David Eccles (gringer) (gringer) commented Jun 19, 2026 •

edited

Loading

Uh oh!

Robin Linacre (RobinL) commented Jun 19, 2026

Uh oh!

David Eccles (gringer) (gringer) commented Jun 19, 2026 •

edited

Loading

Uh oh!

Robin Linacre (RobinL) commented Jun 19, 2026 •

edited

Loading

Uh oh!

David Eccles (gringer) (gringer) commented Jun 19, 2026

Uh oh!

David Eccles (gringer) (gringer) commented Jun 22, 2026 •

edited

Loading

Uh oh!

David Eccles (gringer) (gringer) commented Jun 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

David Eccles (gringer) (gringer) commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Type of PR

Is your Pull Request linked to an existing Issue or Pull Request?

Give a brief description for the solution you have provided

PR Checklist

Uh oh!

Robin Linacre (RobinL) commented Jun 19, 2026

Uh oh!

David Eccles (gringer) (gringer) commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Robin Linacre (RobinL) commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

David Eccles (gringer) (gringer) commented Jun 19, 2026

Uh oh!

David Eccles (gringer) (gringer) commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

David Eccles (gringer) (gringer) commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

David Eccles (gringer) (gringer) commented Jun 19, 2026 •

edited

Loading

David Eccles (gringer) (gringer) commented Jun 19, 2026 •

edited

Loading

Robin Linacre (RobinL) commented Jun 19, 2026 •

edited

Loading

David Eccles (gringer) (gringer) commented Jun 22, 2026 •

edited

Loading

David Eccles (gringer) (gringer) commented Jun 23, 2026 •

edited

Loading