The current statistics histogram of a multi-column index actually serves as a single column index · pingcap/tidb#22589

(1 comment) (0 reactions) (1 assignee)Go (6,186 forks)batch import

component/statisticshelp wantedsig/plannertype/enhancement

Repository metrics

Stars: (40,090 stars)
PR merge metrics: (Avg merge 14d 4h) (346 merged PRs in 30d)

Description

Development Task

If you go through the BetweenRowCount of the Histogram, you'll see that TiDB use calcFraction4Datums to calc the (given range) / (total bucket).

But if you look deeper and go into the calcFration4Datums. You'll see that we use convertDatumToScalar(value *types.Datum, commonPrefixLen int) float64 to calc that value for STRING/[]BYTE type. The index's histogram will call it since the bucket bound is the encoded value of the index's column values.

You'll notice that, for a large dataset, it's really hard that the lower bound and the upper bound of one bucket could be the same for the first column. => For most cases, no bucket's bound can share the same value of its first index column.

However, the convertDatumToScalar only consider the first 8 byte after the common prefix length(which in most cases, it's 0). So the index's histogram can only deal with the cases that the range of the first column is point. If there's a two-column index (a, b) and the filter is a = 1 and b >= 2 and b <= 100. The convertDatumToScalar can only deal with the a=1 in most cases, and will treat a = 1 and b >= 2 and b <= 100 the same with a=1 then estimate a wrong value.

Though in current TiDB's codebase, we use another way to handle it. But it shows that the histogram of index is useless currently for most cases(Just serve as a single column's histogram with enough precision).

The structure needs to be improved.

Contributor guide

Research direction: Investigate the `convertDatumToScalar` function and the histogram bucket bound encoding. Understand how the current implementation only uses the first 8 bytes after the common prefix, causing multi column index histograms to behave like single column histograms. Explore ways to improve the histogram to accurately represent multi column index ranges, possibly by considering all columns in the bound.
Tech stack: go
Domain: database
Issue type: Refactor
Difficulty: 4
Estimated time: Over 1 week
Activity status: Active
Clarity: Clear
Prerequisites: GoStatistics
Newbie friendliness: 30

Repository metrics

Description

Development Task

Contributor guide

Get fresh easy issues in your inbox.