jaegertracing/jaeger
View on GitHubJaeger Query can OOM when retrieving large traces
Open
#1051 opened on Sep 5, 2018
area/storagebughelp wantedperformance
Description
Requirement - what kind of business use case are you trying to solve?
- Retrieve large traces
- Be resilient against bad instrumentation using same traceID for all traces
Problem - what in Jaeger blocks you from solving the requirement?
Jaeger Query OOMs on retrieval of large traces on Cassandra. If someone is crafty, they can easily create a trace with millions of spans, and attempt to retrieve it to systematically bring down all jaeger-query instances.
Proposed Solution - Cassandra
We might do some combination of the following:
- Trace retrieval limits: Test that the number of spans per trace is less than a user defined threshold before retrieving spans.
- Protect against large spans submitted on the HTTP POST endpoints by setting a user defined span size limit.
- Limit number of concurrent requests served by the HTTP GET handler so that we can accurately predict and bound worst case memory utilization.
Any open questions to address
- Does this affect ES as well?