[Feature]: Augmenting Trace Timeline visualization with Aggregate System Statistics for faster Root Cause Analysis
#6814 opened on Mar 5, 2025
Description
Requirement
As a user of Jaeger, given a trace that violates the SLO (i.e. 95th percentile latency), I would like to quickly narrow down the spans/operations that are likely sources of the violation.
Problem
Currently, there are 2 issues that prevent Jaeger from accomplishing this task:
- There is no visual cue for the user in the Trace Timeline view to figure out which operations or spans could be the cause of an SLO violation.
- Jaeger does not currently calculate the relevant data about each span's latency in the context of its overall latency distribution.
Both of these problems increase the manual effort required for developers when they are doing Root Cause Analysis on SLO-violating cases.
Proposal
Inspired by TraVista, @stolet and I propose augmenting the Trace Timeline with span-level aggregate behavior.
First, we propose augmenting the Service & Operation pane to also include the percentile ranking for each operation right next to the operation name of the span. Percentile rankings that are higher than a user-defined threshold (e.g. 95) would be highlighted in red to visually indicate the anomalous nature of the request. Here is a mockup of what a percentile-enabled pane could look like:
Second, on clicking the anomalous span, a new latency distribution graph for that operation will show up so that users can easily see the latency distribution for that operation and where the operation's latency lies within the context of the entire distribution. The graph is a histogram graph with a log-scale y-axis that measures the number of requests that are in that specific latency bin. The bin containing the operation's latency is highlighted to provide a better visual cue to the user. Here is a mockup of what the latency distribution visualization could look like:
To support the above visualizations, the traces collected by Jaeger would need to be further processed to extract the necessary statistics for supporting the above visualizations.
Open questions
The key open question here is how to efficiently calculate the necessary operation-level statistics for enabling this use-case.