hash-based deterministic sampling (duckDB only)#3122
hash-based deterministic sampling (duckDB only)#3122David Eccles (gringer) (gringer) wants to merge 2 commits into
Conversation
|
Interesting, thanks. In practice does this definitely give deterministic results? I e. Does it fix the issues you were having before? |
|
It does for u probabilities, at least in the one case that I've tested; I haven't checked further than that. I got different u probability results from different runs with the bernoulli sampling, and the same results from different runs when doing hash-based sampling. I've been trying to think through what might work as a test case, testing the sampling method, rather than any higher-level mechanisms:
|
|
Thanks - that's useful. I'm not sure what the underlying mechanism is here (i.e. what's making the difference) because we thought that duckdb bernoulli sampling was deterministic. In any case, I think this change is a good idea, because it's more consistent with how we're doing random row selection elsewhere in Splink 5. One other point: I'm not sure you realised this but we recently swapped over branches so I actually just did an vibe-coded experiment here to see what this looks like as a more fundamental change where we remove random sampling using |
|
That'd be great, thanks |
|
Creating a minimally-reproducible test case for non-reproduciblity is proving difficult, because my above example had reproducible results (i.e. it didn't have reproducibility issues), and the sampling process is more complex than I had expected. It seems to be something like this:
Maybe there's something unexpected going on with query optimisation that's beyond my current understanding. |
fb5c90b to
0a24949
Compare
|
It's possible that this could also fix another error that we're getting, which variously manifests as a segmentation fault and a bitpacking error on a dataset that has a lot of duplicated rows (where names and DOB are almost always identical, but an integer address ID is different): u probability estimation error: "Invalid bitpacking mode"u probability estimation error: "Segmentation fault" |
0a24949 to
430bf84
Compare
Type of PR
Is your Pull Request linked to an existing Issue or Pull Request?
This is an attempt at fixing #2882. The current bernoulli sampling is not deterministic when using DuckDB (as can be demonstrated by repeated runs of the u probability estimation, see here). I have not tested this with any other database backends, so am not providing a change for any other backends.
Give a brief description for the solution you have provided
The sampling approach is modified to use a hash of the
unique_idfield (together with the seed, if provided) for ordering the table prior to choosing a sample with a defined row count limit. This produces a fixed sample size that should be reproducible when used with the same input table and seed.PR Checklist