ESQL basic exponential histogram support

Add basic ESQL support for the new `exponential_histogram` ES field type ( in tech-preview):
 * [x] Add an ESQL data and block type
 * [x] Add a blockloader
 * [x] Add support for `PERCENTILES` aggregation https://github.com/elastic/elasticsearch/pull/137720
 * [ ] Add support for `MEDIAN` aggregation https://github.com/elastic/elasticsearch/pull/138402
 * [x] Add support for `AVG` aggregation https://github.com/elastic/elasticsearch/pull/138177
 * [x] Add support for `MIN` aggregation https://github.com/elastic/elasticsearch/pull/138091
 * [x] Add support for `MAX` aggregation https://github.com/elastic/elasticsearch/pull/138091
 * [x] Add support for `SUM` aggregation https://github.com/elastic/elasticsearch/pull/138177
 * [x] Add support for `CASE` ( as return value) https://github.com/elastic/elasticsearch/pull/138300
 * [x] Add support for `COALESCE` https://github.com/elastic/elasticsearch/pull/137897

Stretch for 9.3 / likely post 9.3:
 * [ ] Add support for `COUNT` aggregation
 * [ ] Maybe add a new `VALUE_COUNT` aggregation (depends on the semantics of `COUNT`)
 * [ ] Add support for `VARIANCE` aggregation
 * [ ] Add support for `STD_DEV` aggregation
 * [ ] Add support for `MEDIAN_ABSOLUTE_DEVIATION` aggregation
 * [ ] Add support for `==` operator
 * [ ] Add support for string conversion (`TO_STRING`, `TO_EXPONENTIAL_HISTOGRAM`)

# Details

For 9.3, we are putting the new `exponential_histogram` field type into tech-preview and would also like to provide the means to query and analyze such fields using ES|QL.
While this issue focuses on exponential histograms, the semantics defined here for aggregations and functions should apply in the same way to other numeric value sketch types (else e.g. the `histogram` field type with T-Digests).

Histograms are a form of pre-aggregated metrics. E.g. OpenTelemetry SDKs periodically export histograms, each individual histogram summarizing a set observations.
The typical observability use-case for histograms is observing response times, which I'll use as example throughout this explanation.

So for example, an OpenTelemetry SDK monitors the HTTP response times of an endpoint and has an export interval of 10 seconds.
So every 10 seconds, a histogram is ingested into ES summarizing all the response times that were observed in those 10 seconds.
In case of exponential histograms, this histogram consists of
 * The sum of all observed response times
 * The minimum observed response time
 * The maximum observed response time
 * A set of buckets (growing exponentially in their width, hence the name) with counts for approximating the distribution
 * The total number of observations, which is exactly the sum of the counts of all buckets. E.g. if 17 requests occured in the last 10 second window, this will be 17

Note that exponential histograms come with the exact sum, min and max. In contrast to this our classic `histogram` type does not store these currently.
However, they can be approximated based on the buckets (or rather centroids in case of T-Digest).

## Scalar functions on histograms

Because histograms essentially represent a set of observations instead of a single observation, it does make sense to define scalar functions for them which otherwise typically would be aggregations.
For example it is possible and makes sense to compute a percentile on a single histogram value, while the same doesn't make sense for a double.

The plan for this issue is to make use of this fact and define several scalar functions on histograms, as they are easy to develop and maintain and can be used to implement the desired aggregations via surrogates.
Note that these functions are currently **not** planned to be exposed to users, they are purely used in surrogates.
We can later expose them if we want to, but then we should go through the full language design process for them. Right now they should be considered just an implementation detail.

We'll add the following internal scalar functions as building blocks for the aggregations:
 * `HISTOGRAM_PERCENTILE(histogram, percentile)`: estimates the given percentile for the given histogram
 * `HISTOGRAM_SUM(histogram)`: extracts (or in case of TDigest approximates) the sum of all observations used to construct this histogram
 * `HISTOGRAM_VALUE_COUNT(histogram)`: extracts the number of observations used to construct this histogram
 * `HISTOGRAM_MIN(histogram)`: extracts (or in case of TDigest approximates) the minimum of all observations used to construct this histogram
 * `HISTOGRAM_MAX(histogram)`: extracts (or in case of TDigest approximates) the maximum of all observations used to construct this histogram

Note that we might combine the implementation of sum of this functions, as they will only used as surrogates at the moment anyway.
E.g. we might implement a single `EXTRACT_HISTOGRAM_COMPONENT(histogram, enumValue)` function for sum/min/max/valueCount instead of individual functions, similar to how it was done for aggregate metric double.

## Percentile (and merge) aggregation

In order to estimate percentiles, we need to merge histograms first before estimating the percentiles.
So we'll add a `MERGE` aggregation, which merges all input histograms into a single histogram of the same type.
Similar to the scalar functions, I'm planning on not exposing this aggregation yet to end users. Instead it will be used as a surrogate to implement the existing `PERCENTILE` aggregation:

`PERCENTILE(histogram, percentile) = HISTOGRAM_PERCENTILE(MERGE(histogram), percentile)`.

Merge is a lossy operation, as we typically have a restricted budget for buckets.
Therefore, it should be avoided where possible. E.g. we shouldn't implement `VARIANCE(histogram)` as `HISTOGRAM_VARIANCE(MERGE(histogram))`.

## Sum, Avg, min and max aggregation

I would propose to semantically define `AVG`, `SUM`, `MIN` and `MAX` on histograms as applying the corresponding aggregation on all observations summarized by the histograms combined.

As a result, we would implementthe aggregations using the following surrogates:

 * `AVG(histogram) = SUM(HISTOGRAM_SUM(histogram)) / SUM(HISTOGRAM_VALUE_COUNT(histogram))`
 * `SUM(histogram) = SUM(HISTOGRAM_SUM(histogram))`
 * `MIN(histogram) = MIN(HISTOGRAM_MIN(histogram))`
 * `MAX(histogram) = MAX(HISTOGRAM_MAX(histogram))`

 I'd propose this behaviour as it leads to beginner-friendly behaviour and queries:

```
FROM http_metrics
| STATS PERCENTILE(responseTime, 90), AVG(responseTime), MIN(responseTime), MAX(responseTime) BY endpoint
```

This query gives you the percentile, average and min and max of all observed response times.


## Count (and maybe Value_Count) aggregation

Count is a more difficult aggregation to define, as there are two possible semantics:
 * Count being a "count the number of non-null rows" function: `COUNT` will return the number of histograms
 * Count aligning to the `AVG` and friends definition proposed above: `COUNT(histogram) = SUMN(HISTOGRAM_COUNT(histogram))` will return the number of observations (!= number of histograms)

Coming from the SQL world, my gut feeling here is that the `COUNT` semantic should be the first alternative suggested as it shouldn't have special behaviour per type.
That would mean that we need an alternative function to count the number of observations, e.g. to get the number of requests: Something like `VALUE_COUNT`.

I think this needs thought and input from the language experts, therefore I'm planning on neither `COUNT` nor `VALUE_COUNT` making it into the 9.3 tech preview.

## Multi-value support

Both `exponential_histograms` and `histogram` ES fields don't support multi values.
Due to additional implementation effort required, the current exponential histogram block implementation also does not support multi values either.

I don't think this is required for the initial, basic analytics use-cases we intended to support in the tech-preview.
If there are any important technical reasons requiring this support, it would be great to know so that we could re-prioritize.

## Equality operator

The equality operator would simply compare all histogram components for equality: E.g. sum, count, min, max and all buckets must be equal.
Note that this does not imply that the raw observations were actually equal, therefore I doubt this operators usefulness in practice.

That's why I'm not prioritizing this for the 9.3 tech preview.

## String conversion (`TO_STRING`, `TO_EXPONENTIAL_HISTOGRAM`)

I would propose to simply use the JSON-formatting also used for the field type as string representation.
Again, I think this is of limited use for typical analytical use-cases, so not prioritizing this for the 9.3 tech preview.
If there are any technical reasons or you think I'm wrong here, please let me know to re-prioritize.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ESQL basic exponential histogram support #137549

Details

Scalar functions on histograms

Percentile (and merge) aggregation

Sum, Avg, min and max aggregation

Count (and maybe Value_Count) aggregation

Multi-value support

Equality operator

String conversion (`TO_STRING`, `TO_EXPONENTIAL_HISTOGRAM`)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ESQL basic exponential histogram support #137549

Description

Details

Scalar functions on histograms

Percentile (and merge) aggregation

Sum, Avg, min and max aggregation

Count (and maybe Value_Count) aggregation

Multi-value support

Equality operator

String conversion (TO_STRING, TO_EXPONENTIAL_HISTOGRAM)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

String conversion (`TO_STRING`, `TO_EXPONENTIAL_HISTOGRAM`)