-
Firstly, a huge thanks for Splink. I'm am stunned at how complete and well documented the package is. It's easily amongst some of the very best OSS I've had the pleasure of using ❤️ My question – I'm dealing with a dataset of ~10M user records that is keyed by a tenant ID. I'd looking to dedupe only within each tenant. I have 2 ideas of how to approach this but no strong sense of which is preferable:
Any thoughts? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Thanks, appreciate the feedback on the docs! On (1), this is not expected - if your blocking rule has an The performance partly depends on how many tenants you have and which backend you're using. If you have less than, say, about 100, and are using DuckDb, option (2) should give you good performance. If you're using Spark at a guess (1) will be the faster approach. If you're still having trouble with leakage in (1), if you could provide a reprex we can look into it. |
Beta Was this translation helpful? Give feedback.
Thanks, appreciate the feedback on the docs!
On (1), this is not expected - if your blocking rule has an
AND l.tenant_id != r.tenant_id
then Splink should not create inter-tenant comparisons.The performance partly depends on how many tenants you have and which backend you're using.
If you have less than, say, about 100, and are using DuckDb, option (2) should give you good performance. If you're using Spark at a guess (1) will be the faster approach.
If you're still having trouble with leakage in (1), if you could provide a reprex we can look into it.