RFC: New vtgate Metrics #17585

systay · 2025-01-21T08:58:36Z

1. Summary

Vitess currently classifies query plan types in a way that is neither intuitive nor helpful for performance analysis. In particular, QueriesProcessed and QueriesRouted rely on plan-type designations that are inconsistent across different operators (e.g., IN, Concatenate, DDL, Reference, FkCascade, and InsertSelect). This proposal introduces two new metrics and deprecates the older, less-informative ones.

2. Motivation

Inconsistent Plan-Type Metric
- The plan type was derived from the root operator’s name, leading to wide variability in reported plan types.
- Some operators (like Route) reported the route type, while others forwarded whatever their child operator returned.
Limited Usefulness
- Metrics such as QueriesProcessed and QueriesRouted provide only a coarse breakdown.
- Deeper insight (e.g., how complex or how many shards were involved) is missing or buried in logs/other metrics.
Need for Clarity
- We want an easily understandable categorization of query complexity to better diagnose query performance in vtgate.
- We also want to track how many shards each query touches, to inform sharding logic decisions and query optimization.

3. Proposed Changes

3.1 Deprecation

Metrics to Deprecate:
- QueriesProcessed
- QueriesRouted
These will be marked as deprecated but still exist for at least one release to allow for a smooth transition.

3.2 New Metric: `QueriesProcessedByQueryType`

We propose categorizing queries into eight distinct buckets to capture both common and potentially problematic execution patterns:

Passthrough
- The query is forwarded directly to a single shard without modification.
- Minimal overhead in vtgate; the fastest possible path.
MultiShard
- The query is routed to multiple shards, but not all shards in the cluster.
- vtgate may need to concatenate results but does not perform extensive operator logic beyond routing.
Scatter
- The query is sent to all shards.
- Indicates potential performance overhead, as every shard must be involved in the request.
Lookup
- The query requires at least two calls: typically a vindex lookup first, followed by the main query.
- A common pattern for partial fan-out or locating specific shard(s) through a lookup vindex.
Join
- The plan includes at least one join operator at vtgate level (e.g., join of two routes).
- Useful for quickly spotting queries that might be combining data across multiple shards.
Complex
- Catches any plan more involved than the above categories (e.g., subqueries, nested operators, multi-stage pipelines).
- Indicates a need for further investigation or optimization if it appears frequently.
OnlineDDL
- Vitess-managed DDL statements performed online (e.g., schema migrations orchestrated by vtgate).
- Tracked separately to measure usage and performance impact of online operations.
DirectDDL
- DDL statements that are directly passed to the underlying MySQL instances.
- Does not go through Vitess’s online migration flow.

Example Metric Name

QueriesProcessedByQueryType{queryType="Passthrough"}
QueriesProcessedByQueryType{queryType="Complex"}, etc.

3.3 New Metric: `QueriesProcessedByStatementType`

Purpose: Categorize queries by the high-level SQL statement type.

Possible Categories (not exhaustive):

SELECT
INSERT
UPDATE
DELETE
SET
DDL (could be further subdivided if desired)
Others as needed (ALTER, CREATE, DROP, etc.)

3.4 New Histogram: “Shards Accessed per Query”

Purpose: Track how many shards a query invocation talks to.
Buckets: 0, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512
Metric Name: ShardsAccessedHistogram (or similar).

This histogram shows the distribution of queries across shard counts, helping identify where queries may be fanning out more than expected.

4. Backward Compatibility

We will keep the old metrics running in parallel, marked as “DEPRECATED,” for at least one release cycle.
Developers and operators should be encouraged to migrate to the new metrics.

5. Open Questions

Granularity: Do we need more nuanced categories within “Complex”?
Bucket Sizes: Are the proposed histogram buckets (0, 1, 2, 4, …) adequate for most production workloads?
Eventual Removal: What is the timeline for fully removing QueriesProcessed and QueriesRouted?

The text was updated successfully, but these errors were encountered:

systay · 2025-01-21T10:40:06Z

I believe that implementing this RFC would solve the following issue: #16391

systay · 2025-01-22T10:15:33Z

I thought it would be useful to list the personas that might influence or benefit from these metrics and would be useful to consider in the analysis. Here are the most common ones in the context of a system like Vitess:

SRE / DevOps Engineer

Key Interests: Reliability, uptime, and scaling. They need quick insights into system health and clear signals for when to scale or investigate issues.
Metric Needs:
1. High-level “are we stable or at risk?” metrics (throughput, errors, latency).
2. Immediate detectability of anomalies (e.g., sudden spike in multi-shard queries).
3. Trends for capacity planning and alert thresholds.

Database Administrator (DBA)

Key Interests: Query performance, indexing strategies, schema design, and resource optimization.
Metric Needs:
1. Detailed breakdown of query behaviors (scatter vs. single-shard vs. multi-operator).
2. Histograms that reveal read/write patterns and fan-out.
3. Ability to diagnose root causes of slow or blocking queries (though DBAs often have more system access than other roles).

Application Developer

Key Interests: Ensuring queries are correct, efficient, and aligned with application logic. They may need feedback if their changes degrade or improve performance.
Metric Needs:
1. Simple categories (e.g., which queries are “simple vs. complex”) to spot if an app refactor introduced heavier queries.
2. Statement-type metrics (e.g., INSERT/SELECT spikes) to confirm expected usage patterns.
3. Real-time visibility into the effect of changes without having to parse logs or run deeper queries.

Customer Engineering (CE)

Key Interests: Diagnosing customer issues without direct access to logs or ability to run VEXPLAIN. Confirming that recommended fixes have been implemented.
Metric Needs:
1. Snapshot “fingerprint” metrics showing plan/operator usage.
2. Time-series data to validate changes over time (did queries or problematic plans?).
3. Enough granularity to detect newly introduced “bad” patterns or complex plans.

systay · 2025-01-22T11:31:56Z

Updated the categories for QueriesProcessedByQueryType , adding two new ones: Join and MultiShard

lizztheblizz · 2025-01-23T09:51:24Z

I like this renewed focus! The main question/concern I would have remains around breaking down the Complex category, as well as other known problematic categories.

Indicates a need for further investigation or optimization if it appears frequently.

Performance impact of queries in the Complex category has the potential to be exponential, so even a relatively small number of queries, percentage-wise, would always warrant a closer look. How can we use the proposed dimensions to "zoom in" on this category to better understand what is happening? Does the proposal intend to provide answers for the new types of diagnostic questions these categories will result in?

To give a more practical example, how would we fill the blank in the following process?

"Performance seems bad"
Identify a meaningful subset of Complex queries as reported by vtgate, let's say it's 2% of all reported queries. We have historical metrics to show that it recently increased from being just 1%.
???
"Performance is good"

One possible approach that doesn't rely purely on Metrics:
We could implement something analogous to the MySQL Slow Query Log, i.e. a configurable filtered log stream, capturing only queries from specified buckets, capturing queries surpassing a specific execution time, or any other combination of filters. It could even include the actual plans used for the queries in question. Then diagnostic step 3 could be:
3. Capture the filtered subset of queries from the vtgate query logs matching the Complex category and analyze it for new and existing patterns.

Are there ways we could answer the question using purely Metrics?

lizztheblizz · 2025-01-23T10:04:18Z

Unrelated to my prior comment, but another unmistakable Metrics gap in perspective that can only be gained from the vtgate's POV:

TransactionsProcessedByShardDistribution

SingleShard
MultiShard

TransactionsProcessedByTransactionType

ReadOnlySingleTable
ReadOnlyMultiTable
ReadWriteSingleTable
ReadWriteMultiTable

There might be more to explore there, but given that we already track the transaction context, it feels relatively trivial to start gathering some meaningful numbers about these events.

harshit-gangal · 2025-01-24T05:39:56Z

Unrelated to my prior comment, but another unmistakable Metrics gap in perspective that can only be gained from the vtgate's POV:

TransactionsProcessedByShardDistribution

SingleShard

MultiShard

TransactionsProcessedByTransactionType

ReadOnlySingleTable

ReadOnlyMultiTable

ReadWriteSingleTable

ReadWriteMultiTable

There might be more to explore there, but given that we already track the transaction context, it feels relatively trivial to start gathering some meaningful numbers about these events.

Good news! we do have the metrics now
old and still exists is QueriesRouted with Commit as PlanType that is not good enough

in v22 you will see CommitMode it will tell whether the commit happens with Single, Multi or 2pc

We just added a boolean to understand if anything is modified inside the transaction or not.
We can add a metric to emit that as well :)

harshit-gangal · 2025-01-24T05:47:15Z

I like this renewed focus! The main question/concern I would have remains around breaking down the Complex category, as well as other known problematic categories.

Indicates a need for further investigation or optimization if it appears frequently.

Performance impact of queries in the Complex category has the potential to be exponential, so even a relatively small number of queries, percentage-wise, would always warrant a closer look. How can we use the proposed dimensions to "zoom in" on this category to better understand what is happening? Does the proposal intend to provide answers for the new types of diagnostic questions these categories will result in?

To give a more practical example, how would we fill the blank in the following process?

"Performance seems bad"

Identify a meaningful subset of Complex queries as reported by vtgate, let's say it's 2% of all reported queries. We have historical metrics to show that it recently increased from being just 1%.

???

"Performance is good"

One possible approach that doesn't rely purely on Metrics: We could implement something analogous to the MySQL Slow Query Log, i.e. a configurable filtered log stream, capturing only queries from specified buckets, capturing queries surpassing a specific execution time, or any other combination of filters. It could even include the actual plans used for the queries in question. Then diagnostic step 3 could be: 3. Capture the filtered subset of queries from the vtgate query logs matching the Complex category and analyze it for new and existing patterns.

Are there ways we could answer the question using purely Metrics?

It is not feasible to emit slow queries in metrics. We can use querylog stream of this or the plan stream, which exists today.
For PSDB query insights already exists to show that.
For evaluation, I believe we can just look at the VTGate UI and search for complex plan type and explore optimization from there.

systay added Type: RFC Request For Comment Component: Query Serving labels Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: New vtgate Metrics #17585

RFC: New vtgate Metrics #17585

systay commented Jan 21, 2025 •

edited

Loading

systay commented Jan 21, 2025

systay commented Jan 22, 2025 •

edited

Loading

systay commented Jan 22, 2025 •

edited

Loading

lizztheblizz commented Jan 23, 2025

lizztheblizz commented Jan 23, 2025

harshit-gangal commented Jan 24, 2025

harshit-gangal commented Jan 24, 2025 •

edited

Loading

RFC: New vtgate Metrics #17585

RFC: New vtgate Metrics #17585

Comments

systay commented Jan 21, 2025 • edited Loading

1. Summary

2. Motivation

3. Proposed Changes

3.1 Deprecation

3.2 New Metric: QueriesProcessedByQueryType

Example Metric Name

3.3 New Metric: QueriesProcessedByStatementType

3.4 New Histogram: “Shards Accessed per Query”

4. Backward Compatibility

5. Open Questions

systay commented Jan 21, 2025

systay commented Jan 22, 2025 • edited Loading

systay commented Jan 22, 2025 • edited Loading

lizztheblizz commented Jan 23, 2025

lizztheblizz commented Jan 23, 2025

harshit-gangal commented Jan 24, 2025

harshit-gangal commented Jan 24, 2025 • edited Loading

systay commented Jan 21, 2025 •

edited

Loading

3.2 New Metric: `QueriesProcessedByQueryType`

3.3 New Metric: `QueriesProcessedByStatementType`

systay commented Jan 22, 2025 •

edited

Loading

systay commented Jan 22, 2025 •

edited

Loading

harshit-gangal commented Jan 24, 2025 •

edited

Loading