pingcap/tidb

Consider using a disk-based hash table for hash join avoiding OOM

Open

#11,607 建立於 2019年8月5日

在 GitHub 查看
 (1 留言) (0 反應) (1 負責人)Go (40,090 star) (6,186 fork)batch import
epic/memory-managementhelp wantedsig/executiontype/enhancement

描述

Feature Request

Is your feature request related to a problem? Please describe:

Consider using a disk-based hash table for hash join avoiding OOM.

HashJoinExecutor uses a hash table describing the map of join keys and inner table rows.

TiDB's hash join is implemented by innerResult and mvmap.MVMap. The innerResult stores all the rows of the inner table, and the mvmap.MVMap stores the map of (join key, inner table pointer). This allows us to use these two structures to get a map of join keys and inner table rows. When the inner table is particularly large, the innerResult will take up a lot of memory; when the join key is particularly large, mvmap.MVMap will also take up a lot of memory. There will be problems with OOM at this time.

Describe the feature you'd like:

  1. We already have a config mem-quota-query, which set the memory quota for a query in bytes.
  2. Introduce a new config oom-use-tmp-storage, default is true. Set to true to enable use of temporary disk for some executors(in this issue, it is hash join) when mem-quota-query is exceeded.
  3. Show disk usage of an executor in explain analyze
  4. Show disk usage of a query in SELECT * FROM information_schema.processlist;
  5. Consider disk usage in cost model.

Describe alternatives you've considered:

Teachability, Documentation, Adoption, Migration Strategy:

tasks:

  1. The improvement of mvmap.MVMap
  • hash join #11832
  • index join
  • performance and code clean #11937
  1. Disk-based innerResult
  • hash join
    • utilities: #12116
    • implement disk-based hash join: #12067
  • index join
  1. cost model, explain analyze, and disk usage control
  • change cost model of a hash join if it will be spilled #13246
  • show disk usage information in explain analyze #12625

Some tiny issues

  • [For new contributor]Show disk usage of a query in SELECT * FROM information_schema.processlist; #13931
  • [For new contributor] Show disk usage of a query in low query and statement summary #16883
  • add metrics for disk usage of a query #17263
  • [For new contributor]change the default value of mem-quota-query #12937
  • [For new contributor]temporary storage usage limitation of all queries. #13983
  • [For new contributor]Define temporary storage in config file. #13982
  • [help wanted]multiple instances of tidb-server may use the same temporary directy #13981

貢獻者指南