memcached - a distributed memory object caching system

Paper Review: MemC3 - Dormando (February 23, 2020)

Memcached’s simplistic in-memory key/value data structures make it a common target for academic works. Hundreds of papers either modify or compare with vanilla memcached.

Here we analyze one of the earlier papers which made significant performance and memory efficiency modifications to memcached. We focus on how the paper applies to real-world industry and what it provided at the time of original publication.

Paper Background

The MemC3 Paper states it was presented in April 2013. Unusually, the paper also cites the specific version of memcached tested: 1.4.13. That version was released on February 2nd, 2012.

A google citations search yields over 300 results, split among patents and papers. Publications as recent as 2019 and 2020 cite this paper, though often in lists of examples of similar systems or experiments. Since this was one of the early multithreaded optimization papers for memcached it is frequently compared to across many dozens of works.

Primary Claims

MemC3’s abstract claims to make algorithmic and engineering improvements to the original memcached, with a 30% reduction in memory for small objects and 3x network throughput. It achieves this with a novel cuckoo hash table and a version of the CLOCK algorithm to replace the doubly-linked LRU in memcached. We thus have two separate primary claims:

They present a working system as an example.

Is Code Available?

Yes! The code is available online. Based on 1.4.13 yet modified from a tarball rather than a git clone. Most papers don’t specify what versions of or start arguments to the software they compare against, let alone provide working source code.

We were able to build the software and test it. To fix a startup issue, we had to comment a single pthread CPU affinity line; which is fine since we didn’t end up testing performance.

Validating Claims

Simply put, the cuckoo hashing and CLOCK replacement seem to work as advertised; generally memory efficient while avoiding global locks.

We won’t get into the specifics of the hasing algorithm here. There are many posts which delve into probing type hash tables and memory vs speed tradeoffs.

In the 1.4.13 version of memcached, a global lock is held while an object is fetched or removed from the hash table. A bucket of mutex locks are then used to lock individual objects being fetched or modified.

In MemC3, the locks are removed. Instead various atomic operations are used to fetch and modify objects. This allows multiple simulatenous readers to the hash table while memcached could not. Extra care is taken to ensure consistency when the CLOCK algorithm evicts objects.

Quote:

Our cache integrates cache eviction with our optimistic
locking scheme for cuckoo hashing. When Evict selects a victim key
x by CLOCK, it first increases key x’s
version counter to inform other threads currently reading
x to retry; it then deletes x from the hash table to make
x unreachable for later readers, including those retries;
and finally it increases key x’s version counter again to
complete the change for x. Note that Evict and the hash
table Insert are both serialized (using locks) so when
updating the counters they can not affect each other.

The base premise of the paper checks out. Unfortunately as we dive deeper into the code, we find “at what cost?” was taken liberally. A quick review shows the following features are missing:

It’s important to note the timelines of academic publications means they were likely short on time to enable all features. We find it disingenuous to not make any mention of these removed features as some impact performance. At very least ideas or a plan for repair should have been included.

The omission of slab rebalancing, nor any mention of it in the paper, is the biggest hurdle to their claim of memory efficiency in a system replacement.

Memcached uses a slab allocator for its object data. This means available memory is cut into 1 megabyte pages which are then sliced into similarly sized objects. IE: slab class 1 may have objects which are 90 bytes or smaller, while slab class 2 has objects between 90 and 120 bytes, and so on. This alleviates memory fragmentation from a malloc/free implementation, and ensures memcached only has to evict a single object when adding one new object. These tradeoffs create reliable performance over a long uptime.

The slab rebalancing algorithm allows moving these 1 megabyte pages between slab classes. The size of objects may change over time: slowly growing or shrinking. It’s thus possible to have too little memory in an important slab class, effectively shrinking the cache size.

The big disappointment is that under any real workload MemC3 would frequently return the wrong data. If the “delete” command worked, it would return the wrong key entirely.

The following script reproduces the failure and helps explain why:

#!/usr/bin/perl
use warnings;
use strict;

use IO::Socket::INET;

my $addr = "127.0.0.1:11211";
my $sock = IO::Socket::INET->new(PeerAddr => $addr,
                                 Timeout  => 3);
my $key = 'b';
# 200 kilobyte objects.
my $size = 1024 * 200;
# data is '1' repeated.
my $data = '1' x $size;

# Set the one key.
print $sock "set $key 0 0 $size\r\n$data\r\n";
rescheck($sock, "STORED\r\n");

# Generate a huge multiget request for the same key.
my $numkeys = 5000;
my @keys = ();
for (1 .. $numkeys) {
    push(@keys, 'b');
}
my $keystring = join(' ', @keys);
chop($keystring);
print $sock "get ", $keystring, "\r\n";

# don't look at the response yet, but give the server a second
# to wake up and process the request.
sleep 2;

# start another socket
my $sock2 = IO::Socket::INET->new(PeerAddr => $addr,
                                  Timeout  => 3);

# If we could delete 'b' now, we could replace it with 'a' and return an
# entirely different key. However memc3's delete command leaks memory, so we
# demonstrate by replacing the existing key with new data instead.

my $data2 = '2' x $size;
# Re-set the key a couple times to swap the original object back out of the slab
# allocator. Probably only need to do this twice.
for (1 .. 5) {
    print $sock2 "set $key 0 0 $size\r\n$data2\r\n";
    rescheck($sock2, "STORED\r\n");
}

# finally, fetch responses and parse them from original socket.
# Ensure we're getting the original data back.
for (1 .. $numkeys) {
    get_is($sock, "$data\r\n");
}

print "\nDone setting\n";

sub get_is {
    my $sock = shift;
    my $val = shift;
    my $line = scalar <$sock>;
    if ($line =~ /^VALUE/) {
        my $rval = scalar <$sock>;
        die "invalid response: $rval (expect: $val)" unless $rval eq $val;
    } elsif ($line =~ /^END/) {
        print "REQUEST END\n";
    }
}

sub rescheck {
    my $sock = shift;
    my $val = shift;
    my $res = scalar <$sock>;
    die "invalid response: $res (expect: $val)" unless $res eq $val;
}

Here we demonstrate what happens when the server processes a large request, but cannot fit all of the data into the socket at once: it waits for the socket to become writable. With this script, the data for key ‘b’ will change from ‘1’ to ‘2’ halfway through reading! This is a fatal condition for any storage system in general.

Aside: Why We Count References

Memcached objects are “copy on write”, meaning if I set a new value to the same key the old memory stays around until all in-flight clients are done writing to the network. References then reach zero and the object memory is freed.

MemC3 only treats object references as important while examining the data structure of an object. In this case they use lock-free algorithms with version numbers and looping retries on conflicts. This does not extend to actually writing the data to the socket.

This is the case because memcached’s network handling is “zero copy” for object data. Scatter-gather syscalls are used to avoid copying object data into a buffer before writing back to the client. This nets higher performance as objects get larger and reduces memory required for a large number of client connections. If objects were strictly tiny, this copy stage wouldn’t add notable overhead. Real production deployments of memcached rarely consist of only small objects.

If a client is blocked from finishing writing data to a socket (for any reason!), without reference counting the original memory for the object may be re-used by the system and the wrong data sent back to a client.

MemC3 could get around this by copying all object data into a buffer first, but it would have to do the same retry routine:

This could lead to a pathological slowdown for frequently updated objects.

We think this issue stems from a misunderstanding in the MemC3 paper for why objects are reference counted. They state references are used to avoid evicting an object which is currently being accessed. This isn’t the whole picture: referenced object memory cannot immediately be reused, which is why the older code would avoid evicting them. This is the same as why reference counters are used everywhere else: to avoid corrupting data for in-flight requests.

Later versions of memcached (1.4.23 in 2015) removed this limitation. Now, the eviction process can loop and retry in the rare case of finding active objects in the tail. It evicts the tail object even if it’s busy, but won’t reuse its memory until it’s unreferenced. The loop then tries again and evicts another object if necessary. Active objects are immediately bumped to the top of the LRU, so the odds of hitting one during eviction are very low and the feature is only necessary for safety.

MemC3 vs Historical Memcached

The paper was written in 2012. The used 1.4.13 version of memcached already had improvements to thread scaling. Further improvements were made in 1.4.15, released in september 2012. In our own tests at the time, 1.4.15 was able to fetch 13.6 million keys per second using large batches of requests. The request rate would fall depending on how hard the LRU was being accessed. MemC3 claims 4.4M ops/sec. It’s unclear how much syscall batching was tested.

The paper uses YCSB to run a benchmark. They include some benchmark code in the repository. It works by using YCSB to generate a “trace file”, then reads the file into the benchmark memory and replays against memcached.

While they get good performance, it’s unclear if they used this same benchmark tool for testing vanilla memcached (which they get 1.5M ops/sec from). YCSB’s default module for memcached funnels all requests down a single connection, which causes all requests to be processed by a single server thread. Further, until 1.4.18 (released 2014), a bug in memcached would cause every request to update the LRU for the first 60 seconds of uptime. Normally an item was only reordered if it hadn’t been accessed in the last 60 seconds. This would cause benchmarks to run at an extremely low rate at the start, speed up at the end, then finish with a low average. Many papers did not notice this bug.

It’s possible the performance of vanilla memcached was the same or higher than MemC3 at the time of writing, and was much closer at the time of publication.

It wasn’t until 2015, with the release of 1.4.23 that the LRU locks were mitigated. We wrote about this in a blog post

Aside: How strict should academic claims be?

Casual readers of academic papers may not recognize hubris. Industry works heavily market their pro’s and bury their con’s. Academic works are very much the same: you haven’t cured cancer unless you’ve convinced someone else that you have.

It’s important to recognize the marketing of academic papers. Someone is heavily selling their idea; it’s up to the reader to discern if this work is applicable to a production tool. As an industry practitioner the work ethic of computer science papers comes off as odd. They are very thoroughly argued, yet missing the basics of repeatability, often missing code, and so on. Authors can easily argue the ‘point’ of a paper is the demonstration of an algorithm yet still claim to be a fully working replacement system. We need higher standards as this leads people to designing bad or incomplete systems. Worse, it wastes time as engineers rewrite and refactor working systems to implement a paper’s improvement; only finding the cut corners when it hits production testing.

It’s fine for MemC3 to be a new kind of key/value store, useful for users with very exactly sized small objects (8 byte keys, 1-8 byte values, etc). Advertising itself as a replacement for memcached is misleading at best since it only serves a corner case.

Conclusion

We are increasingly marketed to by scientific papers. Often big companies will partner with a school to convert internal systems into a paper to give it more credibility. We must be strict in ensuring these claims are verified, testing is accurate, and marketing is understood as just that.

In the case of MemC3, an interesting application of a cuckoo hash is overshadowed by poor testing metholodogy and cut corners. In fixing at least the data corruption issue the results could change significantly, invalidating its claim to be an overall improvement to the system.