memcached - a distributed memory object caching
        system

Caching beyond RAM: the case for NVMe - Dormando (June 12th, 2018)

Caching architectures at every layer of the stack embody an implicit tradeoff between performance and cost. These tradeoffs however are constantly shifting: new inflection points can emerge alongside advances in storage technology, changes in workload patterns, or fluctuations in hardware supply and demand.

In this post we explore the design ramifications of the increasing cost of RAM on caching systems. While RAM has always been expensive, DRAM prices have risen by over 50% in 2017, and high densities of RAM involve multi-socket NUMA machines, bloating power and overall costs. Concurrently, alternative storage technologies such as Flash and Optane continue to improve. They have specialized hardware interfaces, consistent performance, high density, and relatively low costs. While there is increasing economic incentive to explore offloading caching from RAM onto NVMe or NVM devices, the implications for performance are still not widely understood.

We will explore these design implications in the context of Memcached, a distributed, simple, cache-focused key/value store. For a quick overview, see the about page or the story tutorial.

Memcached has a storage system called extstore, which allows keeping a portion of data for "less recently used" keys on disk, freeing up RAM. See the link for a full breakdown of how it works, but in short: keys stay in RAM, while values can be split off to disk. Recent and frequently accessed keys will still have their values in RAM.

This test was done with the help of Accelerate with Optane, which provided the hardware and guidance. Also thanks to Netflix for their adoption of extstore, with all their great feedback.

Cache RAM Breakdown

db to cache breakdown

For example, a 1TB database could have only 20% of its data "active" over a given time period (say 4 hours). If you want to cache all of the active data, you might need 200G of RAM. Out of that 200G of RAM, only 20% of it might be highly utilized.

cache hit rate breakdown

Of cache memory used, only 10% of the RAM could be responsible for 90% of all hits to the cache. The rest of the 90% of RAM only amounts for 10%.

However, if you cut 90% of the RAM usage, your miss rate would at least double, doubling the load on your DB. Depending on how your backend system performs, losing some RAM would double backend costs.

cache size breakdown

Breaking down items in memory, we might find that a vast majority (97.2% in the above case) live within only 30% of RAM. A smaller number of large, but still important items take up the other 70%. Even larger items can eat RAM very quickly.

Keep in mind these larger items can be a significant percentage of network utilization. One 8k request takes the same bandwidth as 80+ small ones.

extstore breakdown

How does extstore help? By splitting the values from large less recently used items from their keys and pointing onto disk, we can save a majority of less-used RAM. Depending on use case, you can:

even distribution

Are other workload variations okay? In the above example, cache hits are evenly distributed. This theoretical system has an IO limit of 400,000, which should be similar to a high end SSD or Optane drive. In this case RAM cannot be relied on to saturate the network.

At 400,000 IOPS, just 3072 byte averages are necessary to saturate a 10G NIC. 8192 for 25G. In properly designed clusters, extra headroom is necessary for growth, usage spikes, or failures within the pool. This means item sizes down to 1024 byte averages might be possible, however at 1024b (assuming 100 bytes of overhead per key), extstore will only be able to store 10x to disk of what it could fit in RAM.

Careful capacity planning is required: Not everyone's workload will be compatible with external storage. Carefully evaluate how your RAM is used in your cache pools before using disk for cache. If you have exclusively small items, short TTL's, or high write rates RAM will still be cheaper. Calculating this is done by monitoring an SSD's tolerance of "Drive Writes Per Day". If a 1TB device could survive 5 years with 2TB of writes per 24 hours it has a tolerance of 2 DWPD. Optane has a high tolerance at 30DWPD, while a high end flash drive is 3-6DWPD.

Test Setup

The tests were done on an Intel Xeon machine, sporting 32 cores, 192G of RAM, a 4TB SSD, and 3x optane 750G drives. Only one optane drive was used during the test. As of this writing extstore only works with one drive, and this configuration reflects most of its users.

During each test run, the number of mc-crusher clients was varied, as well as the number of extstore IO threads in the server. Too few IO threads won't saturate the device, and too many will overload it and can cause queues.

Each test runs for a minute after warm-up.

Latency and Throughput measurements

Since mc-crusher does not time results, two scripts were used to generate the result data:

For every test, a full breakdown of the latency samples are provided. The sampling was done at a maximum rate of one per millisecond.
Note: an event loop is not used, to avoid having to determine time elapsed as time waiting to be processed if a stack of events happen at the same time.

The Tests

Three general tests were done:

The Results

Unfortunately there is some wobbling in the graphs; that is due to leaving too little RAM free during the tests. The optane's performance was consistent, while the OS struggled to keep up.

For reference: A pure RAM multiget load test against this exact configuration of memcached (32 threads, etc) results in 18 million keys per second. More contrived benchmarks have gotten a server with many cores up past 50 million keys per second.

With few extstore IO threads the Optane drive is able to come much closer to saturating the IO limit: 4 threads, 4 clients: 230k Optane, 40k SSD. The latency breakdown shows the SSD typically being an order of magnitude higher in wait time, with the Optane staying in the 10us bucket, and SSD in 100us, slipping into 1ms.

With many extstore IO threads the underlying OS becomes saturated, causing wobbles and queueing in the Optane graph. Meanwhile, the SSD continues to benefit from extra thread resources, needed to overcome the extra latency from flash.

For many workloads, both SSD and Optane are completely viable. If a bulk of reads still come from RAM, with extstore used to service the long tail of only large objects, they both keep requests under 1ms in response time.

If you want to push the boundaries of extstore a drive like Optane goes a long way:

Test type: SSD IO threads: Optane IO threads:



Learnings

Conclusions

Workloads which aren't currently possible due to cost are now possible. Most workloads containing mixed data sizes and large pools can have significant cost reduction.


Extstore requires RAM per item on disk. This chart, assuming 100 bytes of overhead per item (key + metadata), visualizes how the RAM overhead falls as item sizes get larger.

DRAM costs are 3-4x Optane, and 4-8x SSD, depending on the drive.

As cache shifts from RAM to Optane (or flash), money spent purely on RAM can drop to 1/3rd.

Reducing RAM reduces reliance on multi-socket servers to get very high RAM density, NUMA capable machines are often necessary. These have multiple sockets, multiple CPUs, with half of the RAM attached to each CPU. Since memcached is highly efficient, you can both cut RAM, as well as cut half of the motherboard/CPU and even power costs once RAM is reduced. Cost reductions up to 80% for specific workloads are reasonable.

High speed, low latency SSD opens a new era for database and cache design. We demonstrate high performance numbers for a wide variety of use cases, for both reduction of costs and expansion of cache usage.