Streamlining eBPF Performance Optimization
Netflix recently announced the release of bpftop.
bpftop provides a dynamic real-time view of running eBPF programs. It displays the average runtime, events per second, and estimated total CPU % for each program.
This tool reduces overhead by allowing performance statistics only when in use.

Without bpftop, optimization tasks would demand manual computations, needlessly complicating the process.
But with bpftop, it’s easier. You can see where you’re starting from, make things better, and check if they actually got better, all without the extra hassle. (Source)
Improved Alerting with Atlas Streaming Eval
Netflix shifted its alerting system of notification infrastructure from traditional polling-based methods to real-time streaming evaluation.
This transition was prompted by scalability issues when the number of configured alerts dramatically increased, causing delays in alert notifications.

By leveraging streaming evaluation, Netflix overcame the limitations of its time-series database, Atlas, and improved scalability while maintaining reliability.
Key outcomes include accommodating a 20X increase in query volume, relaxing restrictions on high cardinality queries, and enhancing application health monitoring with correlations between SLI metrics and custom metrics derived from log data.
This shift opens doors to more actionable alerts and advanced observability capabilities, though it requires overcoming challenges in debugging and aligning the streaming path with database queries.
Overall, the transition showcases a significant advancement in Netflix’s observability infrastructure. (Source)
Building Netflix’s Distributed Tracing Infrastructure
Edgar, a distributed tracing infrastructure aimed at enhancing troubleshooting efficiency for streaming services.
Prior to Edgar, engineers at Netflix faced challenges in understanding and resolving streaming failures due to the lack of context provided by traditional troubleshooting methods involving metadata and logs from various microservices.
Edgar addresses this by providing comprehensive distributed tracing capabilities, allowing for the reconstruction of streaming sessions through the identification of session IDs.

Leveraging Open-Zipkin for tracing and Mantis for stream processing, Edgar enables the collection, processing, and storage of traces from diverse microservices.
Key components of Edgar include trace instrumentation for context propagation, stream processing for data sampling, and storage optimization for cost-effective data retention.

Through a hybrid head-based sampling approach and storage optimization strategies such as utilizing cheaper storage options and employing better compression techniques, Edgar optimizes resource utilization while ensuring efficient troubleshooting.
Additionally, Edgar’s trace data serves multiple use cases beyond troubleshooting, including application health monitoring, resiliency engineering, regional evacuation planning, and infrastructure cost estimation for A/B testing.
In essence, Edgar significantly improves engineering productivity by providing a streamlined and effective method for troubleshooting streaming failures at scale. (Source).
As engineering systems increasingly embed machine learning into decision-making, AI observability extends these practices further by helping teams monitor model behavior, data drift, and prediction reliability across production systems.