[batch] make batches query go brrrrrrr #14629

ehigham · 2024-07-19T20:49:55Z

This PR optimises the V2 batches query along with its associated subqueries.
Listing batches for the CI user now executes in 1s were previously it had taken ~14s.
Resolves #14599

ehigham · 2024-07-24T20:12:17Z

Verified that the queries return the same results for ci by running both old and new queries and diffing the results.

ehigham · 2024-07-25T19:49:53Z

I believe 14366 played a part in motivating the lateral joins. For that query, the execution pans are identical. Here, I'm seeing better performance with the more natural query. I didn't go further for compatibility with user-defined filter operations.

chrisvittal

Okay. Wrapped my head around this query and its parts. I think I'm convinced that this doesn't alter the results of this query.

batch/batch/front_end/query/query.py

cjllanwarne · 2024-07-29T19:02:39Z

batch/batch/front_end/query/query.py

@@ -478,10 +482,14 @@ def __init__(self, term: str):
        self.term = term

    def query(self) -> Tuple[str, List[str]]:


Nit-picking perhaps, but is this "query" naming semantic correct when this is really producing a WHERE clause? It's also a little confusing downstream in the query_v2 function where all of these additions are called "queries" when they're really just lists of clauses being generated (or not) for a single larger query.

I think some SQL implementations might call this an Expression - I didn't find a definite grammar from a quick google. I'm not going to make these kinds of changes in this PR. If you'd like to make a follow-up change to improve this that might help clean up confusion.

cjllanwarne · 2024-07-29T19:38:34Z

batch/batch/front_end/query/query.py

+EXISTS (
+  SELECT NULL
+  FROM job_group_attributes AS attrs
+  WHERE attrs.batch_id = batches.id


I see this new query explicitly mentions batches.id which wasn't part of the previous query. Are we introducing a new field here? Or was this just implicitly included before? Is it possible to document somehow the tables we expect to run these queries against, and what assumptions they're making about what other tables have already have been joined in (and on what keys)?

That's a good point - I should just use the job_groups table.

cjllanwarne · 2024-07-30T15:06:34Z

batch/batch/front_end/query/query_v2.py

+FROM batches
+INNER JOIN job_groups


I might benefit from a in-person explanation for why these are the same. It seems possible that

job_groups LEFT JOIN batches would give a different result to batches INNER JOIN job_groups

Like... wouldn't a job group that isn't represented in batches would appear in the left query but not the right query?

But maybe there's some data-model reason why that's not the case? (eg that it's impossible to be a job group without a batch?)

This question also applies for all the other joins down the chain

The aim of the query is to return a list of batches annotated with some high level metadata like cost, job statuses and whether or not the batch has been cancelled.

To convince yourself that they're the same, you also need to look at the data in the where-clause (defined above this f-string). Of note, we're filtering for job groups with job_group_id = 0.

All batches have at least one job group with job_group_id = 0 (ie the root job group). Performing and inner join or left join will yield the same answer. Operationally I'm doing batches inner join (job_groups filtered by job_group_id = 0).

A job group cannot exist outside of a batch.

A batch is cancelled when a record exists in the job_groups_cancelled table with the same id and job_group_id = 0, ie:

SELECT B.id, C.id IS NOT NULL AS cancelled FROM batches AS B LEFT JOIN job_groups_cancelled AS C USING (id) WHERE C.job_group_id = 0;

For efficency, we do not add a record into job_groups_cancelled for every cancelled job group.
Instead, a job group is cancelled if a record exists in job_groups_cancelled for an ancestral job group.

The table job_group_self_and_ancestors tells us a job group's ancestor_id and level for each job group.
For the root job group, this ancestor_id is the job group's job_group_id (as the table name implies).

SELECT G.batch_id , G.job_group_id , C.job_group_id IS NOT NULL AS cancelled FROM job_groups AS G INNER JOIN job_groups_self_and_ancestors AS A ON G.batch_id = A.batch_id AND G.job_group_id = A.job_group_id LEFT JOIN job_groups_cancelled AS C ON G.batch_id = C.id AND A.ancestor_id = C.job_group_id;

As I understand it, job_groups_n_jobs_in_complete_states contains aggregated information for the job group hierarchy so we only need the record for job_group_id 0.

The lateral join on the table derived from aggregated_job_group_costs is to avoid computing the costs of the batch for all batches and then joining.

This all is beginning to make sense. Since I tripped over this many times understanding the rest of the query, how would you feel about doing something like INNER JOIN job_groups AS root_job_group here, to make it a little easier for humans to parse the tables' name/intention/usage in the rest of the query?

That might indeed make the query more human readable, but names like job_groups are exposed to user defined queries that you see in other files. A follow-up change would improve that but harder to test.

Co-authored-by: Chris Llanwarne <[email protected]>

cjllanwarne

I think I've convinced myself I understand this now. I've requested changes because I have a minor suggestion, but feel free to push back on that, in which case I'll convert into an approval

cjllanwarne · 2024-07-31T14:41:25Z

batch/batch/front_end/query/query.py

+  WHERE attrs.batch_id = job_groups.batch_id
+    AND attrs.job_group_id = job_groups.job_group_id


Does attrs.job_group_id = job_groups.job_group_id not imply attrs.batch_id = job_groups.batch_id?

No, job_group_id is unique to the batch and is not globally unique.

cjllanwarne · 2024-07-31T15:07:04Z

batch/batch/front_end/query/query_v2.py

+INNER JOIN billing_project_users
+  ON batches.billing_project = billing_project_users.billing_project


Also not in this PR, but thinking out loud - this took me a while to parse since it's not used elsewhere in the query and not in the return value. Eventually I noticed it's part of a where clause that's generated above and joined in below.

So thinking out loud, that feels a little icky because we have to remember to add this here and we have to get the table names right for a where clause we happen to be adding below. I wonder whether we could change the Query interface from its current two parts (where clause, parameters) to three parts (join clause, where clause, parameters) to make that a little harder to mix up.

Completely agree - there's a lot to remember here. This part of the query (plus the billing_project_users.user = ? in the WHERE clause) is what controls what users can see.
Your suggestion would be really useful for #14410 FYI

cjllanwarne · 2024-07-31T15:13:40Z

batch/batch/front_end/query/query_v2.py

+  ON batches.id = job_groups_n_jobs_in_complete_states.id
+  AND job_groups.job_group_id = job_groups_n_jobs_in_complete_states.job_group_id
+LEFT JOIN (SELECT *, 1 AS cancelled FROM job_groups_cancelled) AS cancelled_t
+  ON batches.id = cancelled_t.id


Not in this PR either, but it was confusing to me that the ID field in a table called job_groups_cancelled represents the batch ID, not the job group ID

I think this was one field that was missed when job groups were added. Created #14646

cjllanwarne · 2024-07-31T15:23:05Z

batch/batch/front_end/query/query_v2.py

+FROM batches
+INNER JOIN job_groups


This all is beginning to make sense. Since I tripped over this many times understanding the rest of the query, how would you feel about doing something like INNER JOIN job_groups AS root_job_group here, to make it a little easier for humans to parse the tables' name/intention/usage in the rest of the query?

cjllanwarne

I agree this change is equivalent to what was there before and should be much more performant 👍

This reverts commit 21ad23a.

This reverts commit 21ad23a. (partially) Resolves #14599

#14629 improved the speed of listing a user's batches but introduced a large regression for listing all batches readble by that user. This change fixes that regression by making use index hints and `STRAIGHT_JOIN`s. The index hint tells MySQL to never consider the index `batches_deleted` as it has very low cardinality. In some forms of this query, the planner tries to use it to its peril. A problem in query 0 with #14629 (see below) was that fewer filters on batches made the optimiser consider joins in a suboptimal order - it did a table scan on `job_groups` first then sorted the results by to `batches.id DESC` instead of doing an index scan on `batches` in reverse. Using `STRAIGHT_JOIN`s instead of `INNER JOIN` mades the optimiser start from `batches` and read its index in reverse before considering other tables in subsequent joins. From the [documentation](https://dev.mysql.com/doc/refman/8.4/en/join.html): > STRAIGHT_JOIN is similar to JOIN, except that the left table is always read before the right table. This can be used for those (few) cases for which the join optimizer processes the tables in a suboptimal order. This is advantageous for a couple of reasons: - We want to list newer batches first - For this query, the `batches` table has more applicables indexes - We want the variable to order by to be in the primary key of the first table so we can read the index in reverse Before and after timings, collected by running the query 5 times, then using profiles gathered by MySQL. ``` +-------+---------------------------------------------------* | query | description | +-------+---------------------------------------------------+ | 0 | All batches accessible to user `ci` | | 1 | All batches accessible to user `ci` owned by `ci` | +-------+---------------------------------------------------* +-------+--------+--------------------------------------------------------+------------+------------+ | query | branch | timings | mean | stdev | +-------+--------+--------------------------------------------------------+------------+------------+ | 0 | main | 0.05894400,0.05207850,0.07067875,0.06281800,0.060250 | 0.06095385 | 0.00602255 | | 1 | main | 14.1106150,12.2619323,13.8442850,12.0749633,14.0297822 | 13.2643156 | 0.90087263 | +-------+--------+--------------------------------------------------------+------------+------------+ | 0 | PR | 0.04717375,0.04974350,0.04312150,0.04070850,0.04193650 | 0.04453675 | 0.00339069 | | 1 | PR | 0.04423925,0.03967550,0.03935425,0.04056875,0.05286850 | 0.04334125 | 0.00507140 | +-------+--------+--------------------------------------------------------+------------+------------+ ``` I'm hopeful that this won't introduce regressions for most use cases. While I haven't benchmarked other queries, the MySQL client does feel more responsive for a wider array of users. One notable exception is for the user `dking` who owns 3.7x more batches than has access to, of which all have been deleted. I don't think this is a common enough use case to make this query even more complicated than it already is. Resolves #14599

ehigham added 4 commits July 19, 2024 16:49

[batch] make batches query go brrrrrrr

537f7af

oops

3732acc

more oops

0c5c3f7

dont need the index

f2faadd

ehigham requested a review from cjllanwarne July 24, 2024 20:12

ehigham assigned cjllanwarne Jul 24, 2024

ehigham requested a review from chrisvittal July 29, 2024 19:09

ehigham assigned chrisvittal Jul 29, 2024

chrisvittal approved these changes Jul 29, 2024

View reviewed changes

cjllanwarne reviewed Jul 30, 2024

View reviewed changes

ehigham and others added 2 commits July 30, 2024 11:18

Update batch/batch/front_end/query/query.py

d3eff45

Co-authored-by: Chris Llanwarne <[email protected]>

dont ref batches table

5dbf80e

ehigham requested a review from cjllanwarne July 30, 2024 16:16

cjllanwarne requested changes Jul 31, 2024

View reviewed changes

ehigham requested a review from cjllanwarne July 31, 2024 15:41

ehigham added WIP and removed WIP labels Jul 31, 2024

cjllanwarne approved these changes Jul 31, 2024

View reviewed changes

hail-ci-robot merged commit 21ad23a into hail-is:main Jul 31, 2024
3 checks passed

ehigham deleted the ehigham/batches-go-brrrrr branch July 31, 2024 20:08

ehigham added a commit to ehigham/hail that referenced this pull request Aug 1, 2024

Revert "[batch] make batches query go brrrrrrr (hail-is#14629)"

b68514d

This reverts commit 21ad23a.

hail-ci-robot pushed a commit that referenced this pull request Aug 1, 2024

Revert "[batch] make batches query go brrrrrrr (#14629)" (#14648)

84dd26c

This reverts commit 21ad23a. (partially) Resolves #14599

ehigham mentioned this pull request Aug 1, 2024

[batch] make batches query go brrrrrrr (for realsies) #14649

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[batch] make batches query go brrrrrrr #14629

[batch] make batches query go brrrrrrr #14629

ehigham commented Jul 19, 2024 •

edited

Loading

ehigham commented Jul 24, 2024

ehigham commented Jul 25, 2024 •

edited

Loading

chrisvittal left a comment

cjllanwarne Jul 29, 2024

ehigham Jul 30, 2024 •

edited

Loading

cjllanwarne Jul 29, 2024

ehigham Jul 30, 2024

cjllanwarne Jul 30, 2024

ehigham Jul 30, 2024

ehigham Jul 30, 2024 •

edited

Loading

ehigham Jul 30, 2024 •

edited

Loading

ehigham Jul 30, 2024

ehigham Jul 30, 2024

cjllanwarne Jul 31, 2024

ehigham Jul 31, 2024

cjllanwarne left a comment

cjllanwarne Jul 31, 2024

ehigham Jul 31, 2024 •

edited

Loading

cjllanwarne Jul 31, 2024

ehigham Jul 31, 2024 •

edited

Loading

cjllanwarne Jul 31, 2024

ehigham Jul 31, 2024

cjllanwarne Jul 31, 2024

cjllanwarne left a comment

		@@ -478,10 +482,14 @@ def __init__(self, term: str):
		self.term = term

		def query(self) -> Tuple[str, List[str]]:

		WHERE attrs.batch_id = job_groups.batch_id
		AND attrs.job_group_id = job_groups.job_group_id

		INNER JOIN billing_project_users
		ON batches.billing_project = billing_project_users.billing_project

[batch] make batches query go brrrrrrr #14629

[batch] make batches query go brrrrrrr #14629

Conversation

ehigham commented Jul 19, 2024 • edited Loading

ehigham commented Jul 24, 2024

ehigham commented Jul 25, 2024 • edited Loading

chrisvittal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ehigham Jul 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ehigham Jul 30, 2024 • edited Loading

Choose a reason for hiding this comment

ehigham Jul 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cjllanwarne left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ehigham Jul 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ehigham Jul 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cjllanwarne left a comment

Choose a reason for hiding this comment

ehigham commented Jul 19, 2024 •

edited

Loading

ehigham commented Jul 25, 2024 •

edited

Loading

ehigham Jul 30, 2024 •

edited

Loading

ehigham Jul 30, 2024 •

edited

Loading

ehigham Jul 30, 2024 •

edited

Loading

ehigham Jul 31, 2024 •

edited

Loading

ehigham Jul 31, 2024 •

edited

Loading