Comparison rules and impact on execution times (pyspark) #1365

adelinor · 2023-06-26T10:55:11Z

adelinor
Jun 26, 2023

Hi,

The execution times shown below are for the predict step with Splink version 3.9.2 running on Databricks (pyspark).

All executions of the predict step are done with the same data: 169 million comparisons pass the blocking rules and same settings except for the comparison rules.
The first run has only a single comparison (How many = 1), for the column shown in Adding. The second run adds a new comparison rule (How many = 2).

How many?	Adding	Execution time	Comment
1	vmp_brand_names	24min 45s	cl.jaro_winkler_at_thresholds
2	vmp_active_ingredient_names	32min 21s	cl.jaro_winkler_at_thresholds
3	number_affected	32min 39s	Custom numeric comparison with thresholds
4	number_treated	32min 30s	Custom numeric comparison with thresholds
5	onset_date	1h14min 34s	Custom comparison
6	minimum_age_years	1h15min 31s	Custom numeric comparison with thresholds
7	narrative_sample	4h14min 41s	cl.jaccard_at_thresholds size ~1000 chars
8	narrative_size	4h13min 31s	Custom numeric comparison with thresholds
9	veddra_llt_codes	4h36min 26s	Custom array comparison on multiple cols
10	breed_codes	4h50min 31s	Custom array comparison

The first visible impact is a custom date comparison (see How many = 5) which adds 40mins of execution time. Should I troubleshoot the SQL expressions or is the added time because of more data being added to the Splink predict result table?

The comparison of 1000 chars text has a big impact on execution times (see How many = 7): adding about 3 hours :( Is this to be expected?

Any ideas or suggestions are welcome. Many thanks.

RobinL · 2023-07-02T14:51:00Z

RobinL
Jul 2, 2023
Maintainer

Hi @adelinor.

This is great work! We haven't actually done any internal benchmarking of this kind, so this is super useful.

You ask:

Should I troubleshoot the SQL expressions or is the added time because of more data being added to the Splink predict result table?

I do not know this for sure, but I think it's most likely the specific SQL comparisons which are leading to the longer runtimes, than purely because more data is being added to the table.

I'm not surprised about the 1000 chars text adding a lot to the runtime, as this is quite a computationally intensive calculation (could you consider truncating the text, or somehow cleaning it with e.g. regex prior to linking?)

I am much more surprised about the date functions adding a lot of runtime. Are you able to post the SQL for the exact comparison for 'how many = 5'? I think i'd like to convert that one into an issue, so we can investigate it more thoroughly

0 replies

adelinor · 2023-07-03T14:34:48Z

adelinor
Jul 3, 2023
Author

Hi @RobinL ,

Many thanks for commenting on this discussion. Here attached is the comparison rule added in the execution How many=5
onsetdate_comparisonrule.txt .

Looking at the generated SQL by the expression cl.jaccard_at_thresholds('col', [v1, v2, v3]), I can understand why it adds a lot of execution time. The following SQL conditions will be repeated for each threshold:

                {
                    "sql_condition": "jaccard(`col_l`, `col_r`) >= x",
                    "label_for_charts": "Jaccard >= x",
                    "m_probability": 0.0123,
                    "u_probability": 0.0456
                },

It would be preferable to compute the jaccard distance once and use the result in each threshold comparison.

Let me know if I should try something that would help the performance investigation.

Kind regards

2 replies

maruthiservian Nov 15, 2023

Hi @adelinor,

Hope you are doing good! Sorry for opening this discussion again.
Could you please share the Databricks (pyspark) cluster configuration details?
I am also trying to achieve the same, but king of not sure about Databricks (pyspark) cluster configuration.
It would really helpful, if you can share the details and thanks in advance.

Kind Regards,
Maruthi

adelinor Nov 17, 2023
Author

Hi @maruthiservian ,

The Databricks cluster configuration used was most likely the following:
Runtime 11.3 (Apache Spark 3.3.0, Scala 2.12)
Worker type&Driver type: Standard_F16 (32GB Memory, 16 Cores)
Min workers: 8, Max workers: 20

Kind regards, Adelino

RobinL · 2024-12-05T17:16:36Z

RobinL
Dec 5, 2024
Maintainer

This is an old thread now, but for future reference we've just added a section to the docs benchmarking the performance of different comparison functions:
https://moj-analytical-services.github.io/splink/topic_guides/performance/performance_of_comparison_functions.html

2 replies

adelinor Dec 5, 2024
Author

Hi, That's very useful indeed.
When having multiple thresholds in a comparison, it would help greatly to save the computed value of the calculation, e.g. jaccard(col_l, col_r), rather than evaluating it for each threshold.

RobinL Dec 5, 2024
Maintainer

Yeah, agreed - it's a good idea. I've tried this before with hand coded SQL and 'pushing up' common calculations definitely is faster (i.e. the SQL optimiser doesn't do it for you).

It's definitely possible, but not trivial, to work out how to do this automatically/generally for any arbitrary settings. It's something I'd like to do, just haven't had time

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparison rules and impact on execution times (pyspark) #1365

{{title}}

Replies: 3 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Comparison rules and impact on execution times (pyspark) #1365

adelinor Jun 26, 2023

Replies: 3 comments · 4 replies

RobinL Jul 2, 2023 Maintainer

adelinor Jul 3, 2023 Author

maruthiservian Nov 15, 2023

adelinor Nov 17, 2023 Author

RobinL Dec 5, 2024 Maintainer

adelinor Dec 5, 2024 Author

RobinL Dec 5, 2024 Maintainer

adelinor
Jun 26, 2023

Replies: 3 comments 4 replies

RobinL
Jul 2, 2023
Maintainer

adelinor
Jul 3, 2023
Author

adelinor Nov 17, 2023
Author

RobinL
Dec 5, 2024
Maintainer

adelinor Dec 5, 2024
Author

RobinL Dec 5, 2024
Maintainer