-
Notifications
You must be signed in to change notification settings - Fork 249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[batch] make batches query go brrrrrrr #14629
[batch] make batches query go brrrrrrr #14629
Conversation
Verified that the queries return the same results for ci by running both old and new queries and diffing the results. |
I believe 14366 played a part in motivating the lateral joins. For that query, the execution pans are identical. Here, I'm seeing better performance with the more natural query. I didn't go further for compatibility with user-defined filter operations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. Wrapped my head around this query and its parts. I think I'm convinced that this doesn't alter the results of this query.
@@ -478,10 +482,14 @@ def __init__(self, term: str): | |||
self.term = term | |||
|
|||
def query(self) -> Tuple[str, List[str]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit-picking perhaps, but is this "query" naming semantic correct when this is really producing a WHERE clause? It's also a little confusing downstream in the query_v2 function where all of these additions are called "queries" when they're really just lists of clauses being generated (or not) for a single larger query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think some SQL implementations might call this an Expression
- I didn't find a definite grammar from a quick google. I'm not going to make these kinds of changes in this PR. If you'd like to make a follow-up change to improve this that might help clean up confusion.
batch/batch/front_end/query/query.py
Outdated
EXISTS ( | ||
SELECT NULL | ||
FROM job_group_attributes AS attrs | ||
WHERE attrs.batch_id = batches.id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this new query explicitly mentions batches.id
which wasn't part of the previous query. Are we introducing a new field here? Or was this just implicitly included before? Is it possible to document somehow the tables we expect to run these queries against, and what assumptions they're making about what other tables have already have been joined in (and on what keys)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point - I should just use the job_groups table.
FROM batches | ||
INNER JOIN job_groups |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might benefit from a in-person explanation for why these are the same. It seems possible that
job_groups LEFT JOIN batches
would give a different result to batches INNER JOIN job_groups
Like... wouldn't a job group that isn't represented in batches would appear in the left query but not the right query?
But maybe there's some data-model reason why that's not the case? (eg that it's impossible to be a job group without a batch?)
This question also applies for all the other joins down the chain
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The aim of the query is to return a list of batches annotated with some high level metadata like cost, job statuses and whether or not the batch has been cancelled.
To convince yourself that they're the same, you also need to look at the data in the where-clause (defined above this f-string). Of note, we're filtering for job groups with job_group_id = 0
.
All batches have at least one job group with job_group_id = 0
(ie the root job group). Performing and inner join or left join will yield the same answer. Operationally I'm doing batches
inner join
(job_groups
filtered by job_group_id = 0
).
A job group cannot exist outside of a batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A batch is cancelled when a record exists in the job_groups_cancelled
table with the same id and job_group_id = 0
, ie:
SELECT B.id, C.id IS NOT NULL AS cancelled
FROM batches AS B
LEFT JOIN job_groups_cancelled AS C USING (id)
WHERE C.job_group_id = 0;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For efficency, we do not add a record into job_groups_cancelled
for every cancelled job group.
Instead, a job group is cancelled if a record exists in job_groups_cancelled
for an ancestral job group.
The table job_group_self_and_ancestors
tells us a job group's ancestor_id
and level
for each job group.
For the root job group, this ancestor_id
is the job group's job_group_id
(as the table name implies).
SELECT G.batch_id
, G.job_group_id
, C.job_group_id IS NOT NULL AS cancelled
FROM job_groups AS G
INNER JOIN job_groups_self_and_ancestors AS A
ON G.batch_id = A.batch_id
AND G.job_group_id = A.job_group_id
LEFT JOIN job_groups_cancelled AS C
ON G.batch_id = C.id
AND A.ancestor_id = C.job_group_id;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand it, job_groups_n_jobs_in_complete_states
contains aggregated information for the job group hierarchy so we only need the record for job_group_id 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lateral join on the table derived from aggregated_job_group_costs
is to avoid computing the costs of the batch for all batches and then joining.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all is beginning to make sense. Since I tripped over this many times understanding the rest of the query, how would you feel about doing something like INNER JOIN job_groups AS root_job_group
here, to make it a little easier for humans to parse the tables' name/intention/usage in the rest of the query?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That might indeed make the query more human readable, but names like job_groups
are exposed to user defined queries that you see in other files. A follow-up change would improve that but harder to test.
Co-authored-by: Chris Llanwarne <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I've convinced myself I understand this now. I've requested changes because I have a minor suggestion, but feel free to push back on that, in which case I'll convert into an approval
WHERE attrs.batch_id = job_groups.batch_id | ||
AND attrs.job_group_id = job_groups.job_group_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does attrs.job_group_id = job_groups.job_group_id
not imply attrs.batch_id = job_groups.batch_id
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, job_group_id
is unique to the batch and is not globally unique.
INNER JOIN billing_project_users | ||
ON batches.billing_project = billing_project_users.billing_project |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also not in this PR, but thinking out loud - this took me a while to parse since it's not used elsewhere in the query and not in the return value. Eventually I noticed it's part of a where clause that's generated above and joined in below.
So thinking out loud, that feels a little icky because we have to remember to add this here and we have to get the table names right for a where clause we happen to be adding below. I wonder whether we could change the Query
interface from its current two parts (where clause, parameters) to three parts (join clause, where clause, parameters) to make that a little harder to mix up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Completely agree - there's a lot to remember here. This part of the query (plus the billing_project_users.user = ?
in the WHERE
clause) is what controls what users can see.
Your suggestion would be really useful for #14410 FYI
ON batches.id = job_groups_n_jobs_in_complete_states.id | ||
AND job_groups.job_group_id = job_groups_n_jobs_in_complete_states.job_group_id | ||
LEFT JOIN (SELECT *, 1 AS cancelled FROM job_groups_cancelled) AS cancelled_t | ||
ON batches.id = cancelled_t.id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not in this PR either, but it was confusing to me that the ID field in a table called job_groups_cancelled represents the batch ID, not the job group ID
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was one field that was missed when job groups were added. Created #14646
FROM batches | ||
INNER JOIN job_groups |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all is beginning to make sense. Since I tripped over this many times understanding the rest of the query, how would you feel about doing something like INNER JOIN job_groups AS root_job_group
here, to make it a little easier for humans to parse the tables' name/intention/usage in the rest of the query?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this change is equivalent to what was there before and should be much more performant 👍
This reverts commit 21ad23a.
#14629 improved the speed of listing a user's batches but introduced a large regression for listing all batches readble by that user. This change fixes that regression by making use index hints and `STRAIGHT_JOIN`s. The index hint tells MySQL to never consider the index `batches_deleted` as it has very low cardinality. In some forms of this query, the planner tries to use it to its peril. A problem in query 0 with #14629 (see below) was that fewer filters on batches made the optimiser consider joins in a suboptimal order - it did a table scan on `job_groups` first then sorted the results by to `batches.id DESC` instead of doing an index scan on `batches` in reverse. Using `STRAIGHT_JOIN`s instead of `INNER JOIN` mades the optimiser start from `batches` and read its index in reverse before considering other tables in subsequent joins. From the [documentation](https://dev.mysql.com/doc/refman/8.4/en/join.html): > STRAIGHT_JOIN is similar to JOIN, except that the left table is always read before the right table. This can be used for those (few) cases for which the join optimizer processes the tables in a suboptimal order. This is advantageous for a couple of reasons: - We want to list newer batches first - For this query, the `batches` table has more applicables indexes - We want the variable to order by to be in the primary key of the first table so we can read the index in reverse Before and after timings, collected by running the query 5 times, then using profiles gathered by MySQL. ``` +-------+---------------------------------------------------* | query | description | +-------+---------------------------------------------------+ | 0 | All batches accessible to user `ci` | | 1 | All batches accessible to user `ci` owned by `ci` | +-------+---------------------------------------------------* +-------+--------+--------------------------------------------------------+------------+------------+ | query | branch | timings | mean | stdev | +-------+--------+--------------------------------------------------------+------------+------------+ | 0 | main | 0.05894400,0.05207850,0.07067875,0.06281800,0.060250 | 0.06095385 | 0.00602255 | | 1 | main | 14.1106150,12.2619323,13.8442850,12.0749633,14.0297822 | 13.2643156 | 0.90087263 | +-------+--------+--------------------------------------------------------+------------+------------+ | 0 | PR | 0.04717375,0.04974350,0.04312150,0.04070850,0.04193650 | 0.04453675 | 0.00339069 | | 1 | PR | 0.04423925,0.03967550,0.03935425,0.04056875,0.05286850 | 0.04334125 | 0.00507140 | +-------+--------+--------------------------------------------------------+------------+------------+ ``` I'm hopeful that this won't introduce regressions for most use cases. While I haven't benchmarked other queries, the MySQL client does feel more responsive for a wider array of users. One notable exception is for the user `dking` who owns 3.7x more batches than has access to, of which all have been deleted. I don't think this is a common enough use case to make this query even more complicated than it already is. Resolves #14599
This PR optimises the V2 batches query along with its associated subqueries.
Listing batches for the CI user now executes in 1s were previously it had taken ~14s.
Resolves #14599