Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: New vtgate Metrics #17585

Open
systay opened this issue Jan 21, 2025 · 7 comments
Open

RFC: New vtgate Metrics #17585

systay opened this issue Jan 21, 2025 · 7 comments
Labels

Comments

@systay
Copy link
Collaborator

systay commented Jan 21, 2025

1. Summary

Vitess currently classifies query plan types in a way that is neither intuitive nor helpful for performance analysis. In particular, QueriesProcessed and QueriesRouted rely on plan-type designations that are inconsistent across different operators (e.g., IN, Concatenate, DDL, Reference, FkCascade, and InsertSelect). This proposal introduces two new metrics and deprecates the older, less-informative ones.

2. Motivation

  1. Inconsistent Plan-Type Metric

    • The plan type was derived from the root operator’s name, leading to wide variability in reported plan types.
    • Some operators (like Route) reported the route type, while others forwarded whatever their child operator returned.
  2. Limited Usefulness

    • Metrics such as QueriesProcessed and QueriesRouted provide only a coarse breakdown.
    • Deeper insight (e.g., how complex or how many shards were involved) is missing or buried in logs/other metrics.
  3. Need for Clarity

    • We want an easily understandable categorization of query complexity to better diagnose query performance in vtgate.
    • We also want to track how many shards each query touches, to inform sharding logic decisions and query optimization.

3. Proposed Changes

3.1 Deprecation

  • Metrics to Deprecate:
    • QueriesProcessed
    • QueriesRouted
  • These will be marked as deprecated but still exist for at least one release to allow for a smooth transition.

3.2 New Metric: QueriesProcessedByQueryType

We propose categorizing queries into eight distinct buckets to capture both common and potentially problematic execution patterns:

  1. Passthrough

    • The query is forwarded directly to a single shard without modification.
    • Minimal overhead in vtgate; the fastest possible path.
  2. MultiShard

    • The query is routed to multiple shards, but not all shards in the cluster.
    • vtgate may need to concatenate results but does not perform extensive operator logic beyond routing.
  3. Scatter

    • The query is sent to all shards.
    • Indicates potential performance overhead, as every shard must be involved in the request.
  4. Lookup

    • The query requires at least two calls: typically a vindex lookup first, followed by the main query.
    • A common pattern for partial fan-out or locating specific shard(s) through a lookup vindex.
  5. Join

    • The plan includes at least one join operator at vtgate level (e.g., join of two routes).
    • Useful for quickly spotting queries that might be combining data across multiple shards.
  6. Complex

    • Catches any plan more involved than the above categories (e.g., subqueries, nested operators, multi-stage pipelines).
    • Indicates a need for further investigation or optimization if it appears frequently.
  7. OnlineDDL

    • Vitess-managed DDL statements performed online (e.g., schema migrations orchestrated by vtgate).
    • Tracked separately to measure usage and performance impact of online operations.
  8. DirectDDL

    • DDL statements that are directly passed to the underlying MySQL instances.
    • Does not go through Vitess’s online migration flow.

Example Metric Name

  • QueriesProcessedByQueryType{queryType="Passthrough"}
  • QueriesProcessedByQueryType{queryType="Complex"}, etc.

3.3 New Metric: QueriesProcessedByStatementType

  • Purpose: Categorize queries by the high-level SQL statement type.

Possible Categories (not exhaustive):

  • SELECT
  • INSERT
  • UPDATE
  • DELETE
  • SET
  • DDL (could be further subdivided if desired)
  • Others as needed (ALTER, CREATE, DROP, etc.)

3.4 New Histogram: “Shards Accessed per Query”

  • Purpose: Track how many shards a query invocation talks to.
  • Buckets: 0, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512
  • Metric Name: ShardsAccessedHistogram (or similar).

This histogram shows the distribution of queries across shard counts, helping identify where queries may be fanning out more than expected.

4. Backward Compatibility

  • We will keep the old metrics running in parallel, marked as “DEPRECATED,” for at least one release cycle.
  • Developers and operators should be encouraged to migrate to the new metrics.

5. Open Questions

  1. Granularity: Do we need more nuanced categories within “Complex”?
  2. Bucket Sizes: Are the proposed histogram buckets (0, 1, 2, 4, …) adequate for most production workloads?
  3. Eventual Removal: What is the timeline for fully removing QueriesProcessed and QueriesRouted?
@systay
Copy link
Collaborator Author

systay commented Jan 21, 2025

I believe that implementing this RFC would solve the following issue: #16391

@systay
Copy link
Collaborator Author

systay commented Jan 22, 2025

I thought it would be useful to list the personas that might influence or benefit from these metrics and would be useful to consider in the analysis. Here are the most common ones in the context of a system like Vitess:

  1. SRE / DevOps Engineer
  • Key Interests: Reliability, uptime, and scaling. They need quick insights into system health and clear signals for when to scale or investigate issues.
  • Metric Needs:
    1. High-level “are we stable or at risk?” metrics (throughput, errors, latency).
    2. Immediate detectability of anomalies (e.g., sudden spike in multi-shard queries).
    3. Trends for capacity planning and alert thresholds.
  1. Database Administrator (DBA)
  • Key Interests: Query performance, indexing strategies, schema design, and resource optimization.
  • Metric Needs:
    1. Detailed breakdown of query behaviors (scatter vs. single-shard vs. multi-operator).
    2. Histograms that reveal read/write patterns and fan-out.
    3. Ability to diagnose root causes of slow or blocking queries (though DBAs often have more system access than other roles).
  1. Application Developer
  • Key Interests: Ensuring queries are correct, efficient, and aligned with application logic. They may need feedback if their changes degrade or improve performance.
  • Metric Needs:
    1. Simple categories (e.g., which queries are “simple vs. complex”) to spot if an app refactor introduced heavier queries.
    2. Statement-type metrics (e.g., INSERT/SELECT spikes) to confirm expected usage patterns.
    3. Real-time visibility into the effect of changes without having to parse logs or run deeper queries.
  1. Customer Engineering (CE)
  • Key Interests: Diagnosing customer issues without direct access to logs or ability to run VEXPLAIN. Confirming that recommended fixes have been implemented.
  • Metric Needs:
    1. Snapshot “fingerprint” metrics showing plan/operator usage.
    2. Time-series data to validate changes over time (did queries or problematic plans?).
    3. Enough granularity to detect newly introduced “bad” patterns or complex plans.

@systay
Copy link
Collaborator Author

systay commented Jan 22, 2025

Updated the categories for QueriesProcessedByQueryType , adding two new ones: Join and MultiShard

@lizztheblizz
Copy link

I like this renewed focus! The main question/concern I would have remains around breaking down the Complex category, as well as other known problematic categories.

Indicates a need for further investigation or optimization if it appears frequently.

Performance impact of queries in the Complex category has the potential to be exponential, so even a relatively small number of queries, percentage-wise, would always warrant a closer look. How can we use the proposed dimensions to "zoom in" on this category to better understand what is happening? Does the proposal intend to provide answers for the new types of diagnostic questions these categories will result in?

To give a more practical example, how would we fill the blank in the following process?

  1. "Performance seems bad"
  2. Identify a meaningful subset of Complex queries as reported by vtgate, let's say it's 2% of all reported queries. We have historical metrics to show that it recently increased from being just 1%.
  3. ???
  4. "Performance is good"

One possible approach that doesn't rely purely on Metrics:
We could implement something analogous to the MySQL Slow Query Log, i.e. a configurable filtered log stream, capturing only queries from specified buckets, capturing queries surpassing a specific execution time, or any other combination of filters. It could even include the actual plans used for the queries in question. Then diagnostic step 3 could be:
3. Capture the filtered subset of queries from the vtgate query logs matching the Complex category and analyze it for new and existing patterns.

Are there ways we could answer the question using purely Metrics?

@lizztheblizz
Copy link

Unrelated to my prior comment, but another unmistakable Metrics gap in perspective that can only be gained from the vtgate's POV:

TransactionsProcessedByShardDistribution

  • SingleShard
  • MultiShard

TransactionsProcessedByTransactionType

  • ReadOnlySingleTable
  • ReadOnlyMultiTable
  • ReadWriteSingleTable
  • ReadWriteMultiTable

There might be more to explore there, but given that we already track the transaction context, it feels relatively trivial to start gathering some meaningful numbers about these events.

@harshit-gangal
Copy link
Member

Unrelated to my prior comment, but another unmistakable Metrics gap in perspective that can only be gained from the vtgate's POV:

TransactionsProcessedByShardDistribution

  • SingleShard
  • MultiShard

TransactionsProcessedByTransactionType

  • ReadOnlySingleTable
  • ReadOnlyMultiTable
  • ReadWriteSingleTable
  • ReadWriteMultiTable

There might be more to explore there, but given that we already track the transaction context, it feels relatively trivial to start gathering some meaningful numbers about these events.

Good news! we do have the metrics now
old and still exists is QueriesRouted with Commit as PlanType that is not good enough

in v22 you will see CommitMode it will tell whether the commit happens with Single, Multi or 2pc

We just added a boolean to understand if anything is modified inside the transaction or not.
We can add a metric to emit that as well :)

@harshit-gangal
Copy link
Member

harshit-gangal commented Jan 24, 2025

I like this renewed focus! The main question/concern I would have remains around breaking down the Complex category, as well as other known problematic categories.

Indicates a need for further investigation or optimization if it appears frequently.

Performance impact of queries in the Complex category has the potential to be exponential, so even a relatively small number of queries, percentage-wise, would always warrant a closer look. How can we use the proposed dimensions to "zoom in" on this category to better understand what is happening? Does the proposal intend to provide answers for the new types of diagnostic questions these categories will result in?

To give a more practical example, how would we fill the blank in the following process?

  1. "Performance seems bad"
  2. Identify a meaningful subset of Complex queries as reported by vtgate, let's say it's 2% of all reported queries. We have historical metrics to show that it recently increased from being just 1%.
  3. ???
  4. "Performance is good"

One possible approach that doesn't rely purely on Metrics: We could implement something analogous to the MySQL Slow Query Log, i.e. a configurable filtered log stream, capturing only queries from specified buckets, capturing queries surpassing a specific execution time, or any other combination of filters. It could even include the actual plans used for the queries in question. Then diagnostic step 3 could be: 3. Capture the filtered subset of queries from the vtgate query logs matching the Complex category and analyze it for new and existing patterns.

Are there ways we could answer the question using purely Metrics?

It is not feasible to emit slow queries in metrics. We can use querylog stream of this or the plan stream, which exists today.
For PSDB query insights already exists to show that.
For evaluation, I believe we can just look at the VTGate UI and search for complex plan type and explore optimization from there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants