Skip to content

ESQL basic exponential histogram support #137549

@JonasKunz

Description

@JonasKunz

Add basic ESQL support for the new exponential_histogram ES field type ( in tech-preview):

Stretch for 9.3 / likely post 9.3:

  • Add support for COUNT aggregation
  • Maybe add a new VALUE_COUNT aggregation (depends on the semantics of COUNT)
  • Add support for VARIANCE aggregation
  • Add support for STD_DEV aggregation
  • Add support for MEDIAN_ABSOLUTE_DEVIATION aggregation
  • Add support for == operator
  • Add support for string conversion (TO_STRING, TO_EXPONENTIAL_HISTOGRAM)

Details

For 9.3, we are putting the new exponential_histogram field type into tech-preview and would also like to provide the means to query and analyze such fields using ES|QL.
While this issue focuses on exponential histograms, the semantics defined here for aggregations and functions should apply in the same way to other numeric value sketch types (else e.g. the histogram field type with T-Digests).

Histograms are a form of pre-aggregated metrics. E.g. OpenTelemetry SDKs periodically export histograms, each individual histogram summarizing a set observations.
The typical observability use-case for histograms is observing response times, which I'll use as example throughout this explanation.

So for example, an OpenTelemetry SDK monitors the HTTP response times of an endpoint and has an export interval of 10 seconds.
So every 10 seconds, a histogram is ingested into ES summarizing all the response times that were observed in those 10 seconds.
In case of exponential histograms, this histogram consists of

  • The sum of all observed response times
  • The minimum observed response time
  • The maximum observed response time
  • A set of buckets (growing exponentially in their width, hence the name) with counts for approximating the distribution
  • The total number of observations, which is exactly the sum of the counts of all buckets. E.g. if 17 requests occured in the last 10 second window, this will be 17

Note that exponential histograms come with the exact sum, min and max. In contrast to this our classic histogram type does not store these currently.
However, they can be approximated based on the buckets (or rather centroids in case of T-Digest).

Scalar functions on histograms

Because histograms essentially represent a set of observations instead of a single observation, it does make sense to define scalar functions for them which otherwise typically would be aggregations.
For example it is possible and makes sense to compute a percentile on a single histogram value, while the same doesn't make sense for a double.

The plan for this issue is to make use of this fact and define several scalar functions on histograms, as they are easy to develop and maintain and can be used to implement the desired aggregations via surrogates.
Note that these functions are currently not planned to be exposed to users, they are purely used in surrogates.
We can later expose them if we want to, but then we should go through the full language design process for them. Right now they should be considered just an implementation detail.

We'll add the following internal scalar functions as building blocks for the aggregations:

  • HISTOGRAM_PERCENTILE(histogram, percentile): estimates the given percentile for the given histogram
  • HISTOGRAM_SUM(histogram): extracts (or in case of TDigest approximates) the sum of all observations used to construct this histogram
  • HISTOGRAM_VALUE_COUNT(histogram): extracts the number of observations used to construct this histogram
  • HISTOGRAM_MIN(histogram): extracts (or in case of TDigest approximates) the minimum of all observations used to construct this histogram
  • HISTOGRAM_MAX(histogram): extracts (or in case of TDigest approximates) the maximum of all observations used to construct this histogram

Note that we might combine the implementation of sum of this functions, as they will only used as surrogates at the moment anyway.
E.g. we might implement a single EXTRACT_HISTOGRAM_COMPONENT(histogram, enumValue) function for sum/min/max/valueCount instead of individual functions, similar to how it was done for aggregate metric double.

Percentile (and merge) aggregation

In order to estimate percentiles, we need to merge histograms first before estimating the percentiles.
So we'll add a MERGE aggregation, which merges all input histograms into a single histogram of the same type.
Similar to the scalar functions, I'm planning on not exposing this aggregation yet to end users. Instead it will be used as a surrogate to implement the existing PERCENTILE aggregation:

PERCENTILE(histogram, percentile) = HISTOGRAM_PERCENTILE(MERGE(histogram), percentile).

Merge is a lossy operation, as we typically have a restricted budget for buckets.
Therefore, it should be avoided where possible. E.g. we shouldn't implement VARIANCE(histogram) as HISTOGRAM_VARIANCE(MERGE(histogram)).

Sum, Avg, min and max aggregation

I would propose to semantically define AVG, SUM, MIN and MAX on histograms as applying the corresponding aggregation on all observations summarized by the histograms combined.

As a result, we would implementthe aggregations using the following surrogates:

  • AVG(histogram) = SUM(HISTOGRAM_SUM(histogram)) / SUM(HISTOGRAM_VALUE_COUNT(histogram))
  • SUM(histogram) = SUM(HISTOGRAM_SUM(histogram))
  • MIN(histogram) = MIN(HISTOGRAM_MIN(histogram))
  • MAX(histogram) = MAX(HISTOGRAM_MAX(histogram))

I'd propose this behaviour as it leads to beginner-friendly behaviour and queries:

FROM http_metrics
| STATS PERCENTILE(responseTime, 90), AVG(responseTime), MIN(responseTime), MAX(responseTime) BY endpoint

This query gives you the percentile, average and min and max of all observed response times.

Count (and maybe Value_Count) aggregation

Count is a more difficult aggregation to define, as there are two possible semantics:

  • Count being a "count the number of non-null rows" function: COUNT will return the number of histograms
  • Count aligning to the AVG and friends definition proposed above: COUNT(histogram) = SUMN(HISTOGRAM_COUNT(histogram)) will return the number of observations (!= number of histograms)

Coming from the SQL world, my gut feeling here is that the COUNT semantic should be the first alternative suggested as it shouldn't have special behaviour per type.
That would mean that we need an alternative function to count the number of observations, e.g. to get the number of requests: Something like VALUE_COUNT.

I think this needs thought and input from the language experts, therefore I'm planning on neither COUNT nor VALUE_COUNT making it into the 9.3 tech preview.

Multi-value support

Both exponential_histograms and histogram ES fields don't support multi values.
Due to additional implementation effort required, the current exponential histogram block implementation also does not support multi values either.

I don't think this is required for the initial, basic analytics use-cases we intended to support in the tech-preview.
If there are any important technical reasons requiring this support, it would be great to know so that we could re-prioritize.

Equality operator

The equality operator would simply compare all histogram components for equality: E.g. sum, count, min, max and all buckets must be equal.
Note that this does not imply that the raw observations were actually equal, therefore I doubt this operators usefulness in practice.

That's why I'm not prioritizing this for the 9.3 tech preview.

String conversion (TO_STRING, TO_EXPONENTIAL_HISTOGRAM)

I would propose to simply use the JSON-formatting also used for the field type as string representation.
Again, I think this is of limited use for typical analytical use-cases, so not prioritizing this for the 9.3 tech preview.
If there are any technical reasons or you think I'm wrong here, please let me know to re-prioritize.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions