Skip to content

Conversation

@anton-kutuzov
Copy link
Contributor

Description

Join order can be misestimated due to the uniform distribution assumption in StatisticRange.overlapPercentWith().

When a column has a very wide numeric range but few distinct values (e.g., distinctValues = 14, low = 1, high = 3.6e9), the current overlap estimation becomes extremely small (e.g., 8.19e-10), underestimating join cardinalities.

Example:

SELECT *
FROM table1 t1
JOIN table2 t2
  ON t1.eid = t2.eid
WHERE CAST(event_date AS DATE) = DATE '2025-09-07'
  AND t1.platform_id IN (1, 2, 3, 4);

table1 is large and table2 is small.
The column platform_id in table1 has 14 distinct values, with low = 1 and high = 3 662 098 119.
In this case, the method StatisticRange.overlapPercentWith() estimates the overlap as
(4 - 1) / (3,662,098,119 - 1) ≈ 8.19e-10
which effectively means “all rows are filtered out”.
But in reality, the filter IN (1,2,3,4) should keep roughly 4 out of 14 values (~29%).

Solution:
Introduce a density check density = distinctValues / (high - low) and combine uniform overlap with NDV-based estimate when density is low.

Additional context and related issues

Release notes

(x) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

## Section
* Fix some things. ({issue}`issuenumber`)

@cla-bot
Copy link

cla-bot bot commented Dec 7, 2025

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Anton Kutuzov.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email email@example.com
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@anton-kutuzov anton-kutuzov force-pushed the fix_overlap_percent_low_density branch from e52a475 to 9479bf0 Compare December 7, 2025 19:07
@cla-bot cla-bot bot added the cla-signed label Dec 7, 2025
@anton-kutuzov anton-kutuzov force-pushed the fix_overlap_percent_low_density branch from 9479bf0 to bf0ef28 Compare December 7, 2025 19:44
@wendigo wendigo requested a review from findepi December 9, 2025 12:38
double otherDensity = other.distinctValues / other.length();
double minDensity = minExcludeNaN(thisDensity, otherDensity);

if (!isNaN(thisDensity) && !isNaN(otherDensity)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!isNaN(thisDensity) && !isNaN(otherDensity) -> !isNaN(minDensity)

if (!isNaN(thisDensity) && !isNaN(otherDensity)
&& isFinite(length()) && isFinite(other.length())
&& minDensity < DENSITY_HEURISTIC_THRESHOLD) {
return minExcludeNaN(this.distinctValues, other.distinctValues) / this.distinctValues;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be return min(other.distinctValues / this.distinctValues, 1); ?

}

if (lengthOfIntersect > 0) {
double thisDensity = this.distinctValues / length();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a code comment explaining this section

double minDensity = minExcludeNaN(thisDensity, otherDensity);

if (!isNaN(thisDensity) && !isNaN(otherDensity)
&& isFinite(length()) && isFinite(other.length())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we check that the lengths are finite ?
I think we want to skip lengthOfIntersect == length() case here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

2 participants