-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Improve overlap percent estimation for low-density ranges in StatisticRange #27570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Improve overlap percent estimation for low-density ranges in StatisticRange #27570
Conversation
|
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Anton Kutuzov.
|
e52a475 to
9479bf0
Compare
9479bf0 to
bf0ef28
Compare
| double otherDensity = other.distinctValues / other.length(); | ||
| double minDensity = minExcludeNaN(thisDensity, otherDensity); | ||
|
|
||
| if (!isNaN(thisDensity) && !isNaN(otherDensity) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
!isNaN(thisDensity) && !isNaN(otherDensity) -> !isNaN(minDensity)
| if (!isNaN(thisDensity) && !isNaN(otherDensity) | ||
| && isFinite(length()) && isFinite(other.length()) | ||
| && minDensity < DENSITY_HEURISTIC_THRESHOLD) { | ||
| return minExcludeNaN(this.distinctValues, other.distinctValues) / this.distinctValues; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be return min(other.distinctValues / this.distinctValues, 1); ?
| } | ||
|
|
||
| if (lengthOfIntersect > 0) { | ||
| double thisDensity = this.distinctValues / length(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a code comment explaining this section
| double minDensity = minExcludeNaN(thisDensity, otherDensity); | ||
|
|
||
| if (!isNaN(thisDensity) && !isNaN(otherDensity) | ||
| && isFinite(length()) && isFinite(other.length()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we check that the lengths are finite ?
I think we want to skip lengthOfIntersect == length() case here
Description
Join order can be misestimated due to the uniform distribution assumption in
StatisticRange.overlapPercentWith().When a column has a very wide numeric range but few distinct values
(e.g., distinctValues = 14, low = 1, high = 3.6e9), the current overlap estimation becomes extremely small(e.g., 8.19e-10), underestimating join cardinalities.Example:
table1is large andtable2is small.The column
platform_idintable1has14distinct values, withlow = 1andhigh = 3 662 098 119.In this case, the method
StatisticRange.overlapPercentWith()estimates the overlap as(4 - 1) / (3,662,098,119 - 1) ≈ 8.19e-10which effectively means “all rows are filtered out”.
But in reality, the filter
IN (1,2,3,4)should keep roughly4out of14values(~29%).Solution:
Introduce a density check
density = distinctValues / (high - low)and combine uniform overlap with NDV-based estimate when density is low.Additional context and related issues
Release notes
(x) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text: