jaegertracing/jaeger

Jaeger Query can OOM when retrieving large traces

Open

#1,051 建立於 2018年9月5日

在 GitHub 查看
 (38 留言) (1 反應) (0 負責人)Go (18,974 star) (2,326 fork)batch import
area/storagebughelp wantedperformance

描述

Requirement - what kind of business use case are you trying to solve?

  • Retrieve large traces
  • Be resilient against bad instrumentation using same traceID for all traces

Problem - what in Jaeger blocks you from solving the requirement?

Jaeger Query OOMs on retrieval of large traces on Cassandra. If someone is crafty, they can easily create a trace with millions of spans, and attempt to retrieve it to systematically bring down all jaeger-query instances.

Proposed Solution - Cassandra

We might do some combination of the following:

  • Trace retrieval limits: Test that the number of spans per trace is less than a user defined threshold before retrieving spans.
  • Protect against large spans submitted on the HTTP POST endpoints by setting a user defined span size limit.
  • Limit number of concurrent requests served by the HTTP GET handler so that we can accurately predict and bound worst case memory utilization.

Any open questions to address

  • Does this affect ES as well?

貢獻者指南