memcached - a distributed memory object caching system

Caching beyond RAM: Riding the cliff - Dormando (February 04, 2019)

Key value stores are no longer confined to caching small objects over a local network. memcached is today deployed in many different environments, including geographically diverse datacenters, data warehouses and machine learning training systems. These new layouts have items sized kilobytes to megabytes, giving value to stitching devices together or tiering devices of different speeds.

Our previous post introduced Extstore, an extension to memcached which allows moving cache data onto flash storage. Working with Intel and Packet, we tested extstore on high speed flash SSD’s and Optane NVMe devices. These tests left many interesting questions unanswered however:

In this post we will attempt to answer these questions, running tests under both higher scrutiny and worse conditions than before. memcached can process over 50 million keys per second on a 48 core machine using only RAM and heavy batching. While impressive, real world scenarios are filled with overhead: small individual requests, syscalls, context switching, and so on. To avert potential benchmarketing, we’ll attempt to replicate these nuances when testing across RAM, flash and Optane devices.

Multi-device Experiments

Tail Latency can cause a minority of slow backend requests to impact a majority of user traffic. While response time can be improved with caching systems, services with a high response cardinality can instead be further impacted. Low hit rates can mean the time taken to query a cache service is just a time tax on those missed requests.

Interesting research now exists for shared cache pools behind services with many backends. This covers some but not all use cases, as backends have increasingly large responses that can overwhelm RAM-backed caches. The geographic expansion of applications can also drive cost as every service has to be deployed in every location.

To attempt to solve these issues by expanding storage space code was tested in two configurations. For the benchmark we used “Just a Bunch of Devices” (JBOD). All devices create one large pool of pages which spreads read and writes evenly. JBOD support was released in memcached 1.5.10.

Another approach we did not benchmark was tiered storage. With extstore, stored data are organized into logical buckets. These buckets can be grouped onto specific devices, allowing users to deploy hardware with a mix of small/fast and large/cheap NVMe devices. Even networked block devices could be used.

The draft can be followed here. To be detailed in future posts.

Test Setup

Testing used tag 4 of mc-crusher, under the “test-optane” script.

The test hardware was a server with dual 16 core Intel Xeon (32 total cores), 192G RAM, 4TB flash SSD, and 3x 750G optane drives.

CPU and memory were bound to NUMA node 0, perfoming like a single socket server. This should be close to what users deploy with, and will allow us to focus our findings.

Other testing we’ve done focused on throughput with sampled request latency. Heavy batching was used over few TCP connections to allow fetch rates of millions of keys per second. This test focused on a worst case scenario in both single device and JBOD configurations. Instead of few connections blasting pipelined requests, we have hundreds of connections periodically sending sets of individual GET commands.

A RAM-only test exists as a baseline: showing our worst case configuration without touching extstore at all. It’s important to compare the percentiles and scatter plots of RAM vs the drives in question, to determine the impact of moving cache to disk.

This lands us in a middle ground, with a bias toward the worst case. Since most users have some level of batching via multi-key fetches, this benchmark should be realistic.

We plot the latency percentiles as the target request rate increases. We also have a point cloud of every sample taken over time. This shows outliers and how consistent the performance is during each test.

It’s important to stress how using percentile averages to measure performance is problematic. We use them in the first graph to identify trends. We then complement with the latency summaries and a scatter point graph of the full duration of the test. Without this, it’s possible to have tests which look fine in the line graph, but in reality have severe outliers or drop all traffic for several seconds.

Latency Percentile: - Mouse over your production query rate to see a detailed breakdown



Results

Conclusions

The extra tests we show here demonstrate a baseline of a worst case scenario for the performance of a single machine with one or more devices. With request rates around 500,000 per second with under a millisecond average latency, most workloads fit comfortably. While expanding disk space works well, further development is needed to improve throughput with multiple high performance devices.