Replies: 3 comments 4 replies
-
Hi @adelinor. This is great work! We haven't actually done any internal benchmarking of this kind, so this is super useful. You ask:
I do not know this for sure, but I think it's most likely the specific SQL comparisons which are leading to the longer runtimes, than purely because more data is being added to the table. I'm not surprised about the 1000 chars text adding a lot to the runtime, as this is quite a computationally intensive calculation (could you consider truncating the text, or somehow cleaning it with e.g. regex prior to linking?) I am much more surprised about the date functions adding a lot of runtime. Are you able to post the SQL for the exact comparison for 'how many = 5'? I think i'd like to convert that one into an issue, so we can investigate it more thoroughly |
Beta Was this translation helpful? Give feedback.
-
Hi @RobinL , Many thanks for commenting on this discussion. Here attached is the comparison rule added in the execution How many=5 Looking at the generated SQL by the expression {
"sql_condition": "jaccard(`col_l`, `col_r`) >= x",
"label_for_charts": "Jaccard >= x",
"m_probability": 0.0123,
"u_probability": 0.0456
}, It would be preferable to compute the jaccard distance once and use the result in each threshold comparison. Let me know if I should try something that would help the performance investigation. Kind regards |
Beta Was this translation helpful? Give feedback.
-
This is an old thread now, but for future reference we've just added a section to the docs benchmarking the performance of different comparison functions: |
Beta Was this translation helpful? Give feedback.
-
Hi,
The execution times shown below are for the predict step with Splink version 3.9.2 running on Databricks (pyspark).
All executions of the predict step are done with the same data: 169 million comparisons pass the blocking rules and same settings except for the comparison rules.
The first run has only a single comparison (How many = 1), for the column shown in Adding. The second run adds a new comparison rule (How many = 2).
The first visible impact is a custom date comparison (see How many = 5) which adds 40mins of execution time. Should I troubleshoot the SQL expressions or is the added time because of more data being added to the Splink predict result table?
The comparison of 1000 chars text has a big impact on execution times (see How many = 7): adding about 3 hours :( Is this to be expected?
Any ideas or suggestions are welcome. Many thanks.
Beta Was this translation helpful? Give feedback.
All reactions