[release-v2.6] [DOC] Add doc for compare function for metrics doc (#4035

) Co-authored-by: Martin Disibio <[email protected]> Co-authored-by: Kim Nylander <[email protected]>
grafana · Aug 30, 2024 · f924cbf · f924cbf
1 parent 54c9622
commit f924cbf
Showing 1 changed file with 71 additions and 20 deletions.
diff --git a/docs/sources/tempo/traceql/metrics-queries.md b/docs/sources/tempo/traceql/metrics-queries.md
@@ -31,32 +31,50 @@ This capability is available in Grafana Cloud and Grafana (10.4 and newer).
 
 To enable TraceQL metrics, refer to [Configure TraceQL metrics](https://grafana.com/docs/tempo/latest/operations/traceql-metrics/) for more information.
 
+## Exemplars
+
+Exemplars are a powerful feature of TraceQL metrics.
+They allow you to see an exact trace that contributed to a given metric value.
+This is particularly useful when you want to understand why a given metric is high or low.
+
+Exemplars are available in TraceQL metrics for all functions.
+To get exemplars, you need to configure it in the query-frontend with the parameter `query_frontend.metrics.exemplars`,
+or pass a query hint in your query.
+
+```
+{ name = "GET /:endpoint" } | quantile_over_time(duration, .99) by (span.http.target) with (exemplars=true)
+```
+
 ## Functions
 
 TraceQL supports include `rate`, `count_over_time`, `quantile_over_time`, and `histogram_over_time` functions.
 These functions can be added as an operator at the end of any TraceQL query.
 
 `rate`
-: calculates the number of matching spans per second
-  
+: Calculates the number of matching spans per second
+
 `count_over_time`
-: counts the number of matching spans per time interval (see the `step` API parameter)
- 
+: Counts the number of matching spans per time interval (see the `step` API parameter)
+
 `quantile_over_time`
-: the quantile of the values in the specified interval
-  
+: The quantile of the values in the specified interval
+
 `histogram_over_time`
-: evaluate frequency distribution over time. Example: `histogram_over_time(duration) by (span.foo)`
+: Evaluate frequency distribution over time. Example: `histogram_over_time(duration) by (span.foo)`
+
+`compare`
+: Used to split the stream of spans into two groups: a selection and a baseline. The function returns time-series for all attributes found on the spans to highlight the differences between the two groups.
 
-## The `rate` function
+### The `rate` function
 
 The following query shows the rate of errors by service and span name.
 
 ```
 { status = error } | rate() by (resource.service.name, name)
 ```
 
-This example calculates the rate of the erroring spans coming from the service `foo`. Rate is a `spans/sec` quantity.
+This example calculates the rate of the erroring spans coming from the service `foo`.
+Rate is a `spans/sec` quantity.
 
 ```
 { resource.service.name = "foo" && status = error } | rate()
@@ -69,12 +87,14 @@ Combined with the `by()` operator, this can be even more powerful.
 ```
 
 This example still rates the erroring spans in the service `foo` but the metrics have been broken
-down by HTTP route. This might let you determine that `/api/sad` had a higher rate of erroring
+down by HTTP route.
+This might let you determine that `/api/sad` had a higher rate of erroring
 spans than `/api/happy`, for example.
 
 ### The `quantile_over_time` and `histogram_over_time` functions
 
-The `quantile_over_time()` and `histogram_over_time()` functions let you aggregate numerical values, such as the all important span duration. Notice that you can specify multiple quantiles in the same query.
+The `quantile_over_time()` and `histogram_over_time()` functions let you aggregate numerical values, such as the all important span duration.
+You can specify multiple quantiles in the same query.
 
 ```
 { name = "GET /:endpoint" } | quantile_over_time(duration, .99, .9, .5)
@@ -94,16 +114,47 @@ To demonstrate this flexibility, consider this nonsensical quantile on `span.htt
 { name = "GET /:endpoint" } | quantile_over_time(span.http.status_code, .99, .9, .5)
 ```
 
-## Exemplars
+### The `compare` function
 
-Exemplars are a powerful feature of TraceQL metrics.
-They allow you to see an exact trace that contributed to a given metric value.
-This is particularly useful when you want to understand why a given metric is high or low.
+This adds a new metrics function `compare` which is used to split the stream of spans into two groups: a selection and a baseline.
+It returns time-series for all attributes found on the spans to highlight the differences between the two groups.
+This is a powerful function that is best understood by looking at example outputs below:
 
-Exemplars are available in TraceQL metrics for all functions.
-To get exemplars, you need to configure it in the query-frontend with the parameter `query_frontend.metrics.exemplars`,
-or pass a query hint in your query.
+The function is used like other metrics functions: when it's placed after any search query, and converts it into a metrics query:
+`...any spanset pipeline... | compare({subset filters}, <topN>, <start timestamp>, <end timestamp>)`
 
+Example:
+```
+{ resource.service.name="a" && span.http.path="/myapi" } | compare({status=error})
+```
+This function is generally run as an instant query.  It may return may exceed gRPC payloads when run as a query range.
+#### Parameters
+
+The `compare` function has four parameters:
+
+1. Required. The first parameter is a spanset filter for choosing the subset of spans. This filter is executed against the incoming spans. If it matches, then the span is considered to be part of the selection. Otherwise, it is part of the baseline.  Common filters are expected to be things like `{status=error}` (what is different about errors?) or `{duration>1s}` (what is different about slow spans?)
+
+1. Optional. The second parameter is the top `N` values to return per attribute. If an attribute exceeds this limit in either the selection group or baseline group, then only the top `N` values (based on frequency) are returned, and an error indicator for the attribute is included output (see below).  Defaults to `10`.
+
+1. Optional. Start and End timestamps in Unix nanoseconds, which can be used to constrain the selection window by time, in addition to the filter. For example, the overall query could cover the past hour, and the selection window only a 5 minute time period in which there was an anomaly. These timestamps must both be given, or neither.
+
+#### Output
+
+The outputs are flat time-series for each attribute/value found in the spans.
+
+Each series has a label `__meta_type` which denotes which group it is in, either `selection` or `baseline`.
+
+Example output series:
+```
+{ __meta_type="baseline", resource.cluster="prod" } 123
+{ __meta_type="baseline", resource.cluster="qa" } 124
+{ __meta_type="selection", resource.cluster="prod" } 456   <--- significant difference detected
+{ __meta_type="selection", resource.cluster="qa" } 125
+{ __meta_type="selection", resource.cluster="dev"} 126  <--- cluster=dev was found in the highlighted spans but not in the baseline
+```
+
+When an attribute reaches the topN limit, there will also be present an error indicator.
+This example means the attribute `resource.cluster` had too many values.
+```
+{ __meta_error="__too_many_values__", resource.cluster=<nil> }
 ```
-{ name = "GET /:endpoint" } | quantile_over_time(duration, .99) by (span.http.target) with (exemplars=true)
-```