The current statistics histogram of a multi-column index actually serves as a single column index
#22,589 opened on Jan 28, 2021
Description
Development Task
If you go through the BetweenRowCount of the Histogram, you'll see that TiDB use calcFraction4Datums to calc the (given range) / (total bucket).
But if you look deeper and go into the calcFration4Datums. You'll see that we use convertDatumToScalar(value *types.Datum, commonPrefixLen int) float64 to calc that value for STRING/[]BYTE type. The index's histogram will call it since the bucket bound is the encoded value of the index's column values.
You'll notice that, for a large dataset, it's really hard that the lower bound and the upper bound of one bucket could be the same for the first column. => For most cases, no bucket's bound can share the same value of its first index column.
However, the convertDatumToScalar only consider the first 8 byte after the common prefix length(which in most cases, it's 0). So the index's histogram can only deal with the cases that the range of the first column is point. If there's a two-column index (a, b) and the filter is a = 1 and b >= 2 and b <= 100. The convertDatumToScalar can only deal with the a=1 in most cases, and will treat a = 1 and b >= 2 and b <= 100 the same with a=1 then estimate a wrong value.
Though in current TiDB's codebase, we use another way to handle it. But it shows that the histogram of index is useless currently for most cases(Just serve as a single column's histogram with enough precision).
The structure needs to be improved.