Skip to content

Deduping multi-tenanted data #2579

Answered by RobinL
imranolas asked this question in Q&A
Discussion options

You must be logged in to vote

Thanks, appreciate the feedback on the docs!

On (1), this is not expected - if your blocking rule has an AND l.tenant_id != r.tenant_id then Splink should not create inter-tenant comparisons.

The performance partly depends on how many tenants you have and which backend you're using.

If you have less than, say, about 100, and are using DuckDb, option (2) should give you good performance. If you're using Spark at a guess (1) will be the faster approach.

If you're still having trouble with leakage in (1), if you could provide a reprex we can look into it.

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@imranolas
Comment options

Answer selected by imranolas
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants