cluster_pairwise_predictions_at_threshold shows me links below threshold_match_probability #2528
Replies: 1 comment 1 reply
-
When we cluster in Splink we use the 'connected components' algorithm - effectively ignoring any links below the threshold, we cluster together any records that are still connected to each other even if this happens via another record. For example, with an edges table:
clustering at a threshold of 0.95 we would find all three records in the same cluster:
This is because although By default cluster studio will show you all links that exist in the edges data you supply to it, regardless of whether or not it falls below the clustering threshold. This can be very useful for model QA - if you find many clusters with lots of links below this threshold, it may be a sign of issues with your model (perhaps the high-probability links you have could be false positives, or maybe your low-probability links are false negatives because your model is missing nuance). If you want to view clusters without these links there is a slider in the controls section that allows you to filter out all links below a given threshold. If you really wanted to remove them entirely from the dashboard you could filter your edges table before passing it in to creat the dashboard. |
Beta Was this translation helpful? Give feedback.
-
Hello Splink community!
I have this code:
Since I set threshold_match_probability=0.95, I expect the tool to show me only clusters for which the probability match between nodes is higher than 0.95. However, the tool shows me link below that threshold.
Do you have any idea why?
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions