Deduping multi-tenanted data #2579

imranolas · 2025-01-06T12:09:59Z

imranolas
Jan 6, 2025

Firstly, a huge thanks for Splink. I'm am stunned at how complete and well documented the package is. It's easily amongst some of the very best OSS I've had the pleasure of using ❤️

My question – I'm dealing with a dataset of ~10M user records that is keyed by a tenant ID. I'd looking to dedupe only within each tenant. I have 2 ideas of how to approach this but no strong sense of which is preferable:

Use blocking rules to limit comparisons between only records within the same tenant. In my tests I saw a leakage of records between tenants within a cluster. Not sure if this is expected or if I made an error.
Create a linker for each tenant and run it as a dedicated task. I sense this is simpler and offers better guarantees of correctness but I worry about performance.

Any thoughts?

Answered by RobinL

Jan 6, 2025

Thanks, appreciate the feedback on the docs!

On (1), this is not expected - if your blocking rule has an AND l.tenant_id != r.tenant_id then Splink should not create inter-tenant comparisons.

The performance partly depends on how many tenants you have and which backend you're using.

If you have less than, say, about 100, and are using DuckDb, option (2) should give you good performance. If you're using Spark at a guess (1) will be the faster approach.

If you're still having trouble with leakage in (1), if you could provide a reprex we can look into it.

View full answer

RobinL · 2025-01-06T13:33:09Z

RobinL
Jan 6, 2025
Maintainer

Thanks, appreciate the feedback on the docs!

On (1), this is not expected - if your blocking rule has an AND l.tenant_id != r.tenant_id then Splink should not create inter-tenant comparisons.

The performance partly depends on how many tenants you have and which backend you're using.

If you have less than, say, about 100, and are using DuckDb, option (2) should give you good performance. If you're using Spark at a guess (1) will be the faster approach.

If you're still having trouble with leakage in (1), if you could provide a reprex we can look into it.

1 reply

imranolas Jan 8, 2025
Author

That's good to know. I've not been able to recreate the issue so I reckon it was my error somewhere.

This has been immensely helpful, thanks 🙏

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deduping multi-tenanted data #2579

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Deduping multi-tenanted data #2579

imranolas Jan 6, 2025

Replies: 1 comment · 1 reply

RobinL Jan 6, 2025 Maintainer

imranolas Jan 8, 2025 Author

imranolas
Jan 6, 2025

Replies: 1 comment 1 reply

RobinL
Jan 6, 2025
Maintainer

imranolas Jan 8, 2025
Author