Improve overlap percent estimation for low-density ranges in StatisticRange #27570

anton-kutuzov · 2025-12-07T19:06:08Z

Description

Join order can be misestimated due to the uniform distribution assumption in StatisticRange.overlapPercentWith().

When a column has a very wide numeric range but few distinct values (e.g., distinctValues = 14, low = 1, high = 3.6e9), the current overlap estimation becomes extremely small (e.g., 8.19e-10), underestimating join cardinalities.

Example:

SELECT *
FROM table1 t1
JOIN table2 t2
  ON t1.eid = t2.eid
WHERE CAST(event_date AS DATE) = DATE '2025-09-07'
  AND t1.platform_id IN (1, 2, 3, 4);

table1 is large and table2 is small.
The column platform_id in table1 has 14 distinct values, with low = 1 and high = 3 662 098 119.
In this case, the method StatisticRange.overlapPercentWith() estimates the overlap as
(4 - 1) / (3,662,098,119 - 1) ≈ 8.19e-10
which effectively means “all rows are filtered out”.
But in reality, the filter IN (1,2,3,4) should keep roughly 4 out of 14 values (~29%).

Solution:
Introduce a density check density = distinctValues / (high - low) and combine uniform overlap with NDV-based estimate when density is low.

Additional context and related issues

Release notes

(x) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

## Section
* Fix some things. ({issue}`issuenumber`)

cla-bot · 2025-12-07T19:06:10Z

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Anton Kutuzov.
This is most likely caused by a git client misconfiguration; please make sure to:

check if your git client is configured with an email to sign commits git config --list | grep email
If not, set it up using git config --global user.email email@example.com
Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

…cRange

raunaqmorarka · 2025-12-09T18:58:57Z

core/trino-main/src/main/java/io/trino/cost/StatisticRange.java

+            double otherDensity = other.distinctValues / other.length();
+            double minDensity = minExcludeNaN(thisDensity, otherDensity);
+
+            if (!isNaN(thisDensity) && !isNaN(otherDensity)


!isNaN(thisDensity) && !isNaN(otherDensity) -> !isNaN(minDensity)

raunaqmorarka · 2025-12-09T19:02:31Z

core/trino-main/src/main/java/io/trino/cost/StatisticRange.java

+            if (!isNaN(thisDensity) && !isNaN(otherDensity)
+                    && isFinite(length()) && isFinite(other.length())
+                    && minDensity < DENSITY_HEURISTIC_THRESHOLD) {
+                return minExcludeNaN(this.distinctValues, other.distinctValues) / this.distinctValues;


can this be return min(other.distinctValues / this.distinctValues, 1); ?

raunaqmorarka · 2025-12-09T19:06:50Z

core/trino-main/src/main/java/io/trino/cost/StatisticRange.java

        }
+
        if (lengthOfIntersect > 0) {
+            double thisDensity = this.distinctValues / length();


Please add a code comment explaining this section

raunaqmorarka · 2025-12-09T19:42:57Z

core/trino-main/src/main/java/io/trino/cost/StatisticRange.java

+            double minDensity = minExcludeNaN(thisDensity, otherDensity);
+
+            if (!isNaN(thisDensity) && !isNaN(otherDensity)
+                    && isFinite(length()) && isFinite(other.length())


Why do we check that the lengths are finite ?
I think we want to skip lengthOfIntersect == length() case here

anton-kutuzov requested a review from raunaqmorarka December 7, 2025 19:06

anton-kutuzov force-pushed the fix_overlap_percent_low_density branch from e52a475 to 9479bf0 Compare December 7, 2025 19:07

cla-bot bot added the cla-signed label Dec 7, 2025

Improve overlap percent estimation for low-density ranges in Statisti…

bf0ef28

…cRange

anton-kutuzov force-pushed the fix_overlap_percent_low_density branch from 9479bf0 to bf0ef28 Compare December 7, 2025 19:44

wendigo requested a review from findepi December 9, 2025 12:38

raunaqmorarka reviewed Dec 9, 2025

View reviewed changes

raunaqmorarka requested review from martint and sopel39 December 9, 2025 19:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve overlap percent estimation for low-density ranges in StatisticRange #27570

Improve overlap percent estimation for low-density ranges in StatisticRange #27570

anton-kutuzov commented Dec 7, 2025

Uh oh!

cla-bot bot commented Dec 7, 2025

Uh oh!

raunaqmorarka Dec 9, 2025

Uh oh!

raunaqmorarka Dec 9, 2025

Uh oh!

raunaqmorarka Dec 9, 2025

Uh oh!

raunaqmorarka Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

Improve overlap percent estimation for low-density ranges in StatisticRange #27570

Are you sure you want to change the base?

Improve overlap percent estimation for low-density ranges in StatisticRange #27570

Conversation

anton-kutuzov commented Dec 7, 2025

Description

Additional context and related issues

Release notes

Uh oh!

cla-bot bot commented Dec 7, 2025

Uh oh!

raunaqmorarka Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

raunaqmorarka Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

raunaqmorarka Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

raunaqmorarka Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants