-
Notifications
You must be signed in to change notification settings - Fork 249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[batch] make batches query go brrrrrrr #14629
Changes from 4 commits
537f7af
3732acc
0c5c3f7
f2faadd
d3eff45
5dbf80e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -455,10 +455,14 @@ def __init__(self, term: str): | |
self.term = term | ||
|
||
def query(self) -> Tuple[str, List[str]]: | ||
sql = """ | ||
((job_groups.batch_id, job_groups.job_group_id) IN | ||
(SELECT batch_id, job_group_id FROM job_group_attributes | ||
WHERE `key` = %s OR `value` = %s)) | ||
sql = """\ | ||
EXISTS ( | ||
SELECT NULL | ||
FROM job_group_attributes AS attrs | ||
WHERE attrs.batch_id = batches.id | ||
AND attrs.job_group_id = job_groups.job_group_id | ||
AND (attrs.`key` = %s or attrs.`value` = %s) | ||
ehigham marked this conversation as resolved.
Show resolved
Hide resolved
|
||
) | ||
""" | ||
return (sql, [self.term, self.term]) | ||
|
||
|
@@ -478,10 +482,14 @@ def __init__(self, term: str): | |
self.term = term | ||
|
||
def query(self) -> Tuple[str, List[str]]: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nit-picking perhaps, but is this "query" naming semantic correct when this is really producing a WHERE clause? It's also a little confusing downstream in the query_v2 function where all of these additions are called "queries" when they're really just lists of clauses being generated (or not) for a single larger query. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think some SQL implementations might call this an |
||
sql = """ | ||
((job_groups.batch_id, job_groups.job_group_id) IN | ||
(SELECT batch_id, job_group_id FROM job_group_attributes | ||
WHERE `key` LIKE %s OR `value` LIKE %s)) | ||
sql = """\ | ||
EXISTS ( | ||
SELECT NULL | ||
FROM job_group_attributes AS attrs | ||
WHERE attrs.batch_id = batches.id | ||
AND attrs.job_group_id = job_groups.job_group_id | ||
AND (attrs.`key` LIKE %s OR attrs.`value` LIKE %s) | ||
) | ||
""" | ||
escaped_term = f'%{self.term}%' | ||
return (sql, [escaped_term, escaped_term]) | ||
|
@@ -505,10 +513,15 @@ def query(self) -> Tuple[str, List[str]]: | |
value = self.value | ||
if isinstance(self.operator, PartialMatchOperator): | ||
value = f'%{value}%' | ||
sql = f""" | ||
((job_groups.batch_id, job_groups.job_group_id) IN | ||
(SELECT batch_id, job_group_id FROM job_group_attributes | ||
WHERE `key` = %s AND `value` {op} %s)) | ||
sql = f"""\ | ||
EXISTS ( | ||
SELECT NULL | ||
FROM job_group_attributes AS attrs | ||
WHERE attrs.batch_id = batches.id | ||
AND attrs.job_group_id = job_groups.job_group_id | ||
AND attrs.`key` = %s | ||
AND attrs.`value` {op} %s | ||
) | ||
""" | ||
return (sql, [self.key, value]) | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -125,40 +125,46 @@ def parse_list_batches_query_v2(user: str, q: str, last_batch_id: Optional[int]) | |
where_conditions.append(f'({cond})') | ||
where_args += args | ||
|
||
sql = f""" | ||
SELECT batches.*, | ||
cancelled_t.cancelled IS NOT NULL AS cancelled, | ||
sql = f"""\ | ||
SELECT | ||
batches.*, | ||
cancelled_t.cancelled <=> 1 AS cancelled, | ||
job_groups_n_jobs_in_complete_states.n_completed, | ||
job_groups_n_jobs_in_complete_states.n_succeeded, | ||
job_groups_n_jobs_in_complete_states.n_failed, | ||
job_groups_n_jobs_in_complete_states.n_cancelled, | ||
cost_t.cost, cost_t.cost_breakdown | ||
FROM job_groups | ||
LEFT JOIN batches ON batches.id = job_groups.batch_id | ||
LEFT JOIN billing_projects ON batches.billing_project = billing_projects.name | ||
LEFT JOIN job_groups_n_jobs_in_complete_states ON job_groups.batch_id = job_groups_n_jobs_in_complete_states.id AND job_groups.job_group_id = job_groups_n_jobs_in_complete_states.job_group_id | ||
LEFT JOIN LATERAL ( | ||
SELECT 1 AS cancelled | ||
FROM job_group_self_and_ancestors | ||
INNER JOIN job_groups_cancelled | ||
ON job_group_self_and_ancestors.batch_id = job_groups_cancelled.id AND | ||
job_group_self_and_ancestors.ancestor_id = job_groups_cancelled.job_group_id | ||
WHERE job_groups.batch_id = job_group_self_and_ancestors.batch_id AND | ||
job_groups.job_group_id = job_group_self_and_ancestors.job_group_id | ||
) AS cancelled_t ON TRUE | ||
STRAIGHT_JOIN billing_project_users ON batches.billing_project = billing_project_users.billing_project | ||
LEFT JOIN LATERAL ( | ||
SELECT COALESCE(SUM(`usage` * rate), 0) AS cost, JSON_OBJECTAGG(resources.resource, COALESCE(`usage` * rate, 0)) AS cost_breakdown | ||
FROM ( | ||
SELECT resource_id, CAST(COALESCE(SUM(`usage`), 0) AS SIGNED) AS `usage` | ||
cost_t.cost, | ||
cost_t.cost_breakdown | ||
FROM batches | ||
INNER JOIN job_groups | ||
Comment on lines
+138
to
+139
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I might benefit from a in-person explanation for why these are the same. It seems possible that
Like... wouldn't a job group that isn't represented in batches would appear in the left query but not the right query? But maybe there's some data-model reason why that's not the case? (eg that it's impossible to be a job group without a batch?) This question also applies for all the other joins down the chain There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The aim of the query is to return a list of batches annotated with some high level metadata like cost, job statuses and whether or not the batch has been cancelled. To convince yourself that they're the same, you also need to look at the data in the where-clause (defined above this f-string). Of note, we're filtering for job groups with All batches have at least one job group with A job group cannot exist outside of a batch. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A batch is cancelled when a record exists in the SELECT B.id, C.id IS NOT NULL AS cancelled
FROM batches AS B
LEFT JOIN job_groups_cancelled AS C USING (id)
WHERE C.job_group_id = 0; There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For efficency, we do not add a record into The table SELECT G.batch_id
, G.job_group_id
, C.job_group_id IS NOT NULL AS cancelled
FROM job_groups AS G
INNER JOIN job_groups_self_and_ancestors AS A
ON G.batch_id = A.batch_id
AND G.job_group_id = A.job_group_id
LEFT JOIN job_groups_cancelled AS C
ON G.batch_id = C.id
AND A.ancestor_id = C.job_group_id; There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As I understand it, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The lateral join on the table derived from There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This all is beginning to make sense. Since I tripped over this many times understanding the rest of the query, how would you feel about doing something like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That might indeed make the query more human readable, but names like |
||
ON batches.id = job_groups.batch_id | ||
INNER JOIN billing_projects | ||
ON batches.billing_project = billing_projects.name | ||
INNER JOIN billing_project_users | ||
ON batches.billing_project = billing_project_users.billing_project | ||
Comment on lines
+143
to
+144
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also not in this PR, but thinking out loud - this took me a while to parse since it's not used elsewhere in the query and not in the return value. Eventually I noticed it's part of a where clause that's generated above and joined in below. So thinking out loud, that feels a little icky because we have to remember to add this here and we have to get the table names right for a where clause we happen to be adding below. I wonder whether we could change the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Completely agree - there's a lot to remember here. This part of the query (plus the |
||
INNER JOIN job_groups_n_jobs_in_complete_states | ||
ON batches.id = job_groups_n_jobs_in_complete_states.id | ||
AND job_groups.job_group_id = job_groups_n_jobs_in_complete_states.job_group_id | ||
LEFT JOIN (SELECT *, 1 AS cancelled FROM job_groups_cancelled) AS cancelled_t | ||
ON batches.id = cancelled_t.id | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not in this PR either, but it was confusing to me that the ID field in a table called job_groups_cancelled represents the batch ID, not the job group ID There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this was one field that was missed when job groups were added. Created #14646 |
||
AND job_groups.job_group_id = cancelled_t.job_group_id | ||
INNER JOIN LATERAL ( | ||
WITH resource_costs AS ( | ||
SELECT | ||
resource_id, | ||
CAST(COALESCE(SUM(`usage`), 0) AS SIGNED) AS `usage` | ||
FROM aggregated_job_group_resources_v3 | ||
WHERE job_groups.batch_id = aggregated_job_group_resources_v3.batch_id AND job_groups.job_group_id = aggregated_job_group_resources_v3.job_group_id | ||
WHERE batch_id = batches.id | ||
GROUP BY resource_id | ||
) AS usage_t | ||
LEFT JOIN resources ON usage_t.resource_id = resources.resource_id | ||
) | ||
SELECT | ||
COALESCE(SUM(`usage` * rate), 0) AS cost, | ||
JSON_OBJECTAGG(resource, COALESCE(`usage` * rate, 0)) AS cost_breakdown | ||
FROM resource_costs | ||
INNER JOIN resources USING (resource_id) | ||
) AS cost_t ON TRUE | ||
WHERE {' AND '.join(where_conditions)} | ||
ORDER BY job_groups.batch_id DESC | ||
ORDER BY batches.id DESC | ||
LIMIT 51; | ||
""" | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this new query explicitly mentions
batches.id
which wasn't part of the previous query. Are we introducing a new field here? Or was this just implicitly included before? Is it possible to document somehow the tables we expect to run these queries against, and what assumptions they're making about what other tables have already have been joined in (and on what keys)?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point - I should just use the job_groups table.