ShadowSync: Latency Long Tail caused by Hidden Synchronization in Real-time LSM-tree based Stream Processing Systems

Abstract

Mission-critical, real-time, continuous stream processing applications that interact with the real world have stringent latency requirements. For example, e-commerce websites like Amazon improve their marketing strategy by performing real-time advertising based on customers’ behavior, and latency long tail can cause signicant revenue loss. Recent work [39] showed a positive correlation between latency long tail and variance in the execution time of synchronous invocation chains (critical paths) in microservices benchmarks. This paper shows that asynchronous, very short but intense resource demands (called millibottlenecks) outside of critical paths can also cause significant latency long tail. Using a trace analysis stream processing application benchmark, we evaluated the impact of asynchronous workload bursts generated by a multi-layer data structure called LSM-tree (logstructured merge-tree) for continuous checkpointing. Outside of the critical path, LSM-tree relies on maintenance operations (e.g.,flushing/compaction during a checkpoint) to reorganize LSM-tree in memory and on disk to keep data access latency short. Although asynchronous, such recurrent maintenance operations can cause frequent millibottlenecks, particularly when they overlap, a problem we call ShadowSync. For scheduling and statistical reasons, significant latency long tail can arise from ShadowSync caused by asynchronous recurrent operations. Our experimental results show that with typical settings of benchmark components such as RocksDB, ShadowSync can prolong request message latency by up to 2 seconds. We show effective mitigation methods can alleviate both scheduled and statistical ShadowSync reducing the latency long tail to less than 20% of the original at the 99.9th percentile.

Publication
In Proceedings of the 23rd ACM/IFIP International Middleware Conference (Middleware’22)