memcached - a distributed memory object caching system

The Volatile Benefit of Persistent Memory - Dormando (April 30, 2019)

Non-volatile memory modules and persistent memory are poised to be the next big thing for datacenter computing. With high storage densities and DRAM-like performance, with a DRAM-like interface, entirely new algorithms and systems architectures are possible. This is important for memcached, a high speed key/value cache service which is reliant on DRAM to meet its performance guarantees. This is new technology, what use can we make of it?

In collaboration with Intel, we experimented with Intel® Optane™ DC Persistent Memory Modules (hereby referred to as PMEM) and a minimally modified memcached to answer these questions. This post looks over benchmark results and possibilities for future code architectures for cache systems.

Persistent Memory

Intel® Optane™ DC Persistent Memory Modules are DDR4-compatible memory modules that can remember memory after powering off, without the aid of batteries. It does require compatible motherboards and processors to operate, however.

DDR RAM come in flavors of speed, features, and density. For servers, features and speed tend to be decided by the pairing with other components in the system: motherboard, CPU speed, and so on. The primary choice to make is how dense you want RAM to be.

Typical DRAM modules for servers are (as of today) between 8 and 64 gigabytes in size. Each density has its own price range:

     DRAM
16G  $7.5/gigabyte
32G  $7/gigabyte
64G  $9/gigabyte
128G $35/gigabyte

128G modules are new and rare, with extremely high costs to match. System builders must decide on tradeoffs: memory slots vs DRAM density. Cloud providers typically want dense, small, inexpensive systems.

These new modules from Intel come in densities of 128G, 256G, and 512G. The cost-per-gigabyte is lowest at 128G and highest at 512G. System performance scales with the number of modules in the system.

For cache systems, the 128G modules seem to fit the sweet spot. The rest of this post explores the performance of a system configured with 128G modules.

Benchmarks

With minimal changes, or none at all, what performance charactistics do we see? What can we learn about how to design the future? We went with three main configurations:

  1. No code changes. Just DRAM baseline on the supplied hardware.
  2. No code changes, using Memory Mode to expand system memory.
  3. Minimal code changes: only item memory on persistent memory storage.

Why not extstore?

extstore breakdown

Extstore is an extension to memcached which utilizes standard SSD/NVME devices to offload memory. It achieves this by keeping metadata and key data in RAM, and pointing the “value” part of an item onto storage. This tradeoff works well for items with relatively large values (1k-8k+). At low sizes (sub 1k), you need an high amount of RAM relative to the disk space you can use.

While extstore works well for large values, memcached lacks a story for handling a large volume of smaller items. PMEM could also be combined with Extstore: item metadata and keys in PMEM and values on NVME flash storage.


Test setup

Tests were done using memcached 1.5.12 and mc-crusher - the test scripts are available here

The tests were run on a single machine running fedora 27 with a kernel build of 4.15.6. The machine hardware specifications were:

The test machine was a NUMA multi socket server. One NUMA node was used to run the memcached server. The other was used to run the mc-crusher benchmark. A CPU core was also reserved for a latency sampling process.

The following OS level configuration changes were made:

/usr/bin/echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
echo "4096" > /proc/sys/net/core/somaxconn
cpupower set -b 0
cpupower frequency-set -g performance

Patches to Memcached

Three small patches were used to assist in the benchmark.


The Tests

Memcached Configuration


In all cases memcached is configured to use 48 worker threads. The same startup arguments were used for all tests, only varying how much memory is used, and where it came from. The only other change is disabling the LRU crawler. This is a low priority background thread and we ensure it isn’t interfering with test results at random.

Connections to memcached are sticky to a specific worker thread. All of the tests below create at least enough connections per benchmark thread to utilize all of memcached’s worker threads.

Hardware Configurations


This quickstart guide is similar to how we configured the hardware for each test.

The different modes of the Optane memory are described here - we use Memory Mode and App Direct.

DRAM

Memory Mode

memory mode diagram

App Direct Mode

app direct diagram

Results

Random selector:
uniform - select keys at random, every key has same chance of being fetched.
zipf - select keys on a power curve, lower numbered keys fetched more frequently.

Key count:
1 billion - 300G+ of storage. NOTE: DRAM baseline is always 125m keys.
125 million - working set fits in RAM for all hardware configurations.

Throughput tests: 48 connections per bench. core. Maximum request rate.
Heavily batched requests and responses. Low syscall rate.
Heavily batched requests, non batched responses. Medium syscall rate.

Latency tests: 50 requests every 75ms per connection. 400 connections per bench core. High syscall rate.
Read-only
99% read, 1% write
90% read, 10% write

Latency Percentile: - only affects Latency test graphs



Tuning these tests were a real challenge. In most of our experiments all three modes worked nearly the same. The fact that differences are visible here was an exercise in creativity.

DRAM acts as our baseline, but since we don’t have 400 gigabytes of RAM there was no easy way to compare DRAM directly vs PMEM for large workloads. This comparison is still valid: what we present is a comparison of having many machines with a DRAM-fitting workload vs one machine with a much larger than DRAM workload. Below are some quick interpretations to help understand the test results:

We spent a long time validating, re-running, and re-optimizing these tests. If you’re used to testing normal SSD’s, or even NVME drives, these results are unreal. Lets next discuss how this relates to a production deployment.

What works today?

Users of memcached have two bottlenecks to plan around: the network, and memory capacity. More recently, extstore users have to contend with flash drive burnout for workloads under heavy writes.

Network throughput

Throughput is nearly unlimited. In all cases a real network card would fail before the service would. 10 million 300 byte responses is over 20 gigabits of traffic. In almost all known memcached workloads, items vary in size significantly, from a couple bytes to half a megabyte or more. A system could easily saturate a 50 gigabit network card if it has a variety of item sizes stored.

Lets take 300b, 1kb, and 2kb TCP responses (key/value + TCP/Frame header). Now, assuming an 8 million requests per second ceiling for a heavy somewhat realistic read load for the machine we tested, what does NIC saturation look like:

At an average value of 2 kilobytes it becomes plausible to saturate a 100 gigabyte NIC with a single instance. Reality always pulls these numbers lower: syscall overhead, interrupt polling rates, and so on. Memcached also presently doesn’t scale as well on writes. We will have to put in effort to hit 100 gigabits for high write rate workloads.

People often reach for DPDK or similar custom userspace networking stacks on top of specialized hardware. Memcached prefers to keep deployment as simple as possible and does not support these libraries. Further, very few users ever break 100k requests per second on a single node, so they aren’t typically needed.

From what we see, PMEM is equivalent to DRAM for typical network cards.

Memory Density

Memory density is the other major factor determining how many memcached servers a company needs. You need enough RAM to get a high hit ratio to significantly lower backend load, or to fit a working set in memory.

There are limits to how few memcached instances you can have. During a failure other servers must pick up the slack: both database backends taking on extra misses and the network cards of remaining memcached nodes.

Having a new level of density gives more opportunity to expand the size of data caches as well. Cache more data from foreign datacenters, of AI/ML models, of pre-calculated or pre-templated renderings.


We’re off to a strong start. Even without modifying the existing code, the service performs nearly identically to DRAM in any realistic scenario. If we want to improve this situation, a user would need to first be extremely sensitive to microseconds of latency. The only other goal to optimize for is device longevity by reducing writes to PMEM. Intel guarantees full write loads for 3 years, but this could matter if a company intends to use the hardware for 5+ years under heavy sustained load. Still, if flash devices are out of reach due to concerns about early drive burnout, PMEM is a good alternative.

On that note, if you match any of these conditions we would love to hear from you! We aren’t aware of any user who would be limited by this.

Near future: restartable cache

A feature in development allows memcached to be restarted if cleanly shut down. For a lot of workloads this is a bad idea:

delete miss

However, if your application has the ability to queue and retry cache updates or invalidations, or it simply doesn’t matter for your workload, restartability can be a huge benefit. Memcached pools are sized as N+1(ish). You need enough servers to tolerate the extra miss rate when one goes down. If a server is upgraded, you have to wait for the replacement server to refill for a while before moving on.

With restartable mode the item memory and some metadata are stored in a shared memory segment. When memcached restarts, if it’s still compatible with the old data, it will rebuild its hash table and serve its previous data. This allows upgrades to be more easily done. This functions the same whether you are on DRAM or PMEM.

Even with this, you still lose the cache when rebooting a machine, such as with a kernel update. With Persistent Memory in App Direct mode, you can completely reboot (or even repair) a machine without losing the cache, since that shared memory segment will persist across reboots. So long as it’s done in a reasonable amount of time!

This change doesn’t make memcached crash safe. It is designed under a different set of tradeoffs than a traditional database.

TCO

Maximizing network usage, increasing memory density, and reducing cluster size can reduce TCO by up to 80%. Lets break this down.

Hardware Cost Metrics


High end switching (10g, 25g, 50g) can cost hundreds to thousands of dollars per network port. Reducing the number of cache servers reduces port usage and electricity costs. Since memcached can make good usage of larger network cards, assigned capacity won’t go to waste.

Memcached workloads require little CPU. Users may be running dual-socket systems just to get more RAM into a single system. Switching from a dual socket to single socket system alone can cut per-server costs nearly in half.

DRAM prices always have a curve, for example (for current server DIMMs):

  1. 16G modules at $7.5/gigabyte
  2. 32G modules at $7.0/gigabyte
  3. 64G modules at $9.0/gigabyte

Per-gigabyte costs for PMEM are (as of this writing) roughly half of DRAM of similar densities.

(Note: RAM pricing fluctuates, these numbers are for illustration purposes).

     256G         32G      256G
     DRAM         DRAM     Optane
     +---+        +---+    +----+
     |32G|        |16G|    |128G|
     |   |        |   |    |    |
     |32G|        |16G|    |128G|
     |   |        +---+    +----+
     |32G|
     |   |        $7.5/gig   $4.5/gig
     |32G|
     |   |        $1392 total
     |32G|
     |   |
     |32G|
     |   |
     |32G|
     |   |
     |32G|
     +---+

$7/gig -> $1792 total

Take the two examples above: to have 256G of cache one configuration could require 8 RAM slots, at $8 per gigabyte. $2k for RAM. Getting 8 slots could also require multi-socket servers, more expensive motherboards, or higher density RAM in fewer slots at a higher price. A combination with PMEM still requires some RAM, but in total you can cut thousands of dollars off the cost of a single cache node.

Hardware Cost Scaling


The above example shows nodes with 256G of DRAM. Performance scales nearly linearly as modules are added to a machine. Multi-terabyte nodes could be used to drastically reduce size of a cache cluster, network permitting.

Dramatically increasing cache size could reduce costs elsewhere: internet bandwidth or cross-datacenter bandwidth, expensive database servers, expensive compute nodes, and so on.

The Question Of Cloud Pricing


Direct hardware purchases are easy to calculate if you run your own datacenters. How does PMEM fit in with the cloud pricing most people are paying these days?

Cloud pricing is a big wildcard for this new technology right now. When NVME devices were introduced recently, cloud providers were putting 4TB+ drives in large instances with a maximum amount of storage and RAM. These instances were very costly and were not a great match for memcached or memcached with Extstore.

More recently nodes have been appearing with smaller amounts of NVME (750G to 1TB or so), with reduced CPU and RAM to go with. We would like to see cloud providers offer customers instances which fit into this market:

The cache and key/value storage layers are an important part of most infrastructures. Having instances tightly coupled to the needs of cloud users can be make or break for startups, and a huge line item for established companies. Users are able to push to market faster if they don’t have to worry as much about sections of their architecture which may be difficult or costly to scale.

Conclusions and Part Two

Intel® Optane™ DC Persistent Memory Modules have a high density, and very low latency. Compared to the needs of a database, memcached makes extremely few references to PMEM, so performance is nearly the same as DRAM: easily beating 1ms (even 0.5ms) SLA’s at millions of requests per second. PMEM can also drive large cost savings by either reducing cluster size or increasing available cache. Further, with Memory Mode deployment is trivial for old versions of memcached.

With NVME and now PMEM, we see new classes of ultra high density cache clusters:

As with the NVME rollout last year, we see this as an exciting time for datacenter storage. Revolutions come along rarely in this industry, and we welcome everything we can get.

We continue this discussion in part two, with a look into software and algorithmic changes we thought through and tested during the course of this project.