prometheus/prometheus

WAL Replay not using more memory than before the restart

Open

#16,942 创建于 2025年7月29日

在 GitHub 查看
 (8 评论) (0 反应) (0 负责人)Go (64,042 star) (10,408 fork)batch import
component/walhelp wantednot-as-easy-as-it-looks

描述

This is to track some ideas from https://github.com/prometheus/prometheus/issues/6934

The current algorithm tries to replay WAL as fast as possible after restart, which can use more memory than the Prometheus use.

This could be problematic for cases where Prometheus is under pressure (tons of metrics and low memory limit) and some operation like an expensive query or API call is OOM-ing it. The recovery is impossible due to startup using even more memory, so manual removal of WAL is needed.

For any other OOMs around too many series scraped, where no specific query or API caused the OOM, but it's just high use due to too many series scraped, this feature (improving startup use) is not going to help alone, but might unlock other options like compact/truncate on start to move big load to TSDB blocks for further debugging and work.

Please use this issue if you have thoughts around replay memory consumption alone. To discuss the OOM detection ideas and the general OOM handling or safeguards, let's use the https://github.com/prometheus/prometheus/issues/13939 issue. For the general unexpected OOMs, where clearly Prometheus uses unexpected amount of memory, given the scraped/ingested load you put through it, please open separate issue.

Acceptance Criteria

  • A mode (or by default, if fast enough) where Prometheus startup does not use more memory then the "normal" use.

Ideas

贡献者指南