jaegertracing/jaeger

Jaeger Query can OOM when retrieving large traces

Open

#1051 opened on Sep 5, 2018

View on GitHub
 (38 comments) (1 reaction) (0 assignees)Go (18,974 stars) (2,326 forks)batch import
area/storagebughelp wantedperformance

Description

Requirement - what kind of business use case are you trying to solve?

  • Retrieve large traces
  • Be resilient against bad instrumentation using same traceID for all traces

Problem - what in Jaeger blocks you from solving the requirement?

Jaeger Query OOMs on retrieval of large traces on Cassandra. If someone is crafty, they can easily create a trace with millions of spans, and attempt to retrieve it to systematically bring down all jaeger-query instances.

Proposed Solution - Cassandra

We might do some combination of the following:

  • Trace retrieval limits: Test that the number of spans per trace is less than a user defined threshold before retrieving spans.
  • Protect against large spans submitted on the HTTP POST endpoints by setting a user defined span size limit.
  • Limit number of concurrent requests served by the HTTP GET handler so that we can accurately predict and bound worst case memory utilization.

Any open questions to address

  • Does this affect ES as well?

Contributor guide