Understanding The Memcached Source Code - LRU I

slab allocator (I, II, III) is the core module of the cache system, which largely determines how efficient the bottleneck resource, memory, can be utilized. The other 3 parts, namely,

LRU algorithm (I - this article , II , III) for entry expiration; and an

event driven model (I , II , III) based on libevent; and the

consistent hashing (not complete) for data distribution,

are built around it.

More often than not, the LRU algorithm is combined with a hash map, and is referred to as a

LRU Cache

In a LRU-cache, the hash map enables fast accessing of cached objects; and LRU avoids the cache to grow infinitely by marking expired, or so called, least recently used objects. Next we look at how LRU works from a high level standpoint.

Linked list

Technically, LRU algorithm works on a linked list, whenever a list entry is used (accessed or updated), it is removed from the list and be attached to the list head. In this way, the closer an element is to the list tail, the less recently used it is. Hence it is easy to remove irrelevant or “expired” elements from the tail based on a certain configuration.

Hash map

Linked list is slow when it comes to element access, hence another data structure is employed. We have seen how linked list “strings” chunks in slabs to make free lists. In an LRU cache, the mechanism is similar, however, it is the hash map entries instead of chunks in slabs got wired up this time, which looks like:

hash map perspective

We can also flatten the linked list, and make the structure a bit more clear,

linked list perspective

Core data structure - item

typedef struct _stritem {
/* Protected by LRU locks */
struct _stritem *next;
struct _stritem *prev;
/* Rest are protected by an item lock */
struct _stritem *h_next; /* hash chain next */
rel_time_t time; /* least recent access */
rel_time_t exptime; /* expire time */
int nbytes; /* size of data */
unsigned short refcount;
uint8_t nsuffix; /* length of flags-and-length string */
uint8_t it_flags; /* ITEM_* above */
uint8_t slabs_clsid;/* which slab class we're in */
uint8_t nkey; /* key length, w/terminating null and padding */
/* this odd type prevents type-punning issues when we do
* the little shuffle to save space when not using CAS. */
union {
... // scr: cas
char end; // scr: flexible array member indicating the item header "end"
} data[];
/* if it_flags & ITEM_CAS we have 8 bytes CAS */
/* then null-terminated key */
/* then " flags length\r\n" (no terminating null) */
/* then data with terminating \r\n (no terminating null; it's binary!) */
} item;
do_item_unlink@item.c

Properties in discussion

next, prev - LRU list pointers, initialized in do_item_alloc (LRU III), used by item_link_q, item_unlink_q

h_next - hash collision list pointers, initialized in do_item_alloc (LRU III), used by assoc_insert, assoc_delete, various methods (LRU II)

time - last access time, set in do_item_link, used by lru_pull_tail (LRU III)

exptime - expire time indicated by request argument, initialized in do_item_alloc (LRU III), used by lru_pull_tail (LRU III)

nbytes - data size indicated by request argument, initialized in do_item_alloc (LRU III)

refcount - reference cound, initialized in do_slabs_alloc (Slab III), used by do_item_link

nsuffix - initialized in do_item_alloc (LRU III) with item_make_header

it_flags - initialized in do_item_alloc (LRU III), used by do_item_link, do_item_unlink

slabs_clsid - the LRU list the item belongs, initialized in do_item_alloc (LRU III), used by item_link_q, item_unlink_q

nkey - key size, calcuated in do_item_alloc (LRU III), used by assoc_delete

Memory layout of an item chunk

We have mentioned item chunk in do_slabs_free. With the help of this data structure, we can now examine the chunk more closely.

item chunk

Next we read the relevant code that performs the above discussed LRU operations.

do_item_link

int do_item_link(item *it, const uint32_t hv) { // scr: -------------------> 1)
...
it->it_flags |= ITEM_LINKED; // scr: -------------------> 2)
it->time = current_time;

... // scr: stat

/* Allocate a new CAS ID on link. */
... // scr: cas
assoc_insert(it, hv); // scr: -------------------> 3)
item_link_q(it); // scr: -------------------> 4)
refcount_incr(&it->refcount); // scr: -------------------> 5)
... // scr: stat

return 1;
}
do_item_link@item.c

1) hv is supposed to be the shortened “hashed value”.

2) Set ITEM_LINKED in it->it_flags, and set current time to it->time.


The field it_flags is used in do_slabs_free and do_slabs_alloc

3) Insert the item to hash map.

4) Insert the item to linked list.

5) Increase the reference count.


This field is initialized as 1 in do_slabs_alloc

It is worth noting here that reference count indicates how many sub-modules are using the same resource, so as to determine when to actually deallocate the resource (In this particular case, item is referred by both slab and LRU). I have written this article that explains a similar mechanism of C++.

item_link_q is a thread safe wrapper of the workhorse method do_item_link_q.

static void do_item_link_q(item *it) { /* item is the new head */
item **head, **tail;
assert((it->it_flags & ITEM_SLABBED) == 0);

head = &heads[it->slabs_clsid]; // scr: -------------------> 1)
tail = &tails[it->slabs_clsid];
assert(it != *head);
assert((*head && *tail) || (*head == 0 && *tail == 0));
it->prev = 0; // scr: -------------------> 2)
it->next = *head;
if (it->next) it->next->prev = it;
*head = it;
if (*tail == 0) *tail = it;
sizes[it->slabs_clsid]++; // scr: -------------------> 3)
return;
}
do_item_link_q@item.c

1) Get the head and tail of the respective LRU linked list indicated by slabs_clsid. Note that the slabs_clsid is salted with the type of the queue, hence each slab group may enlist multiple lists.

2) Standard operations of “adding an element to the front”.

3) Increase the global array sizes.


static item *heads[LARGEST_ID];
static item *tails[LARGEST_ID];
...
static unsigned int sizes[LARGEST_ID];

item.c:59

assoc_insert - add to hash map

int assoc_insert(item *it, const uint32_t hv) { // scr: again, hv -> hash value
unsigned int oldbucket;

... // scr: expanding related operations
} else {
it->h_next = primary_hashtable[hv & hashmask(hashpower)]; // scr: 1)
primary_hashtable[hv & hashmask(hashpower)] = it; // scr: 2)
}

... // scr: expanding related operations
}
assoc_insert@assoc.c

1) Deal with potential conflict. If there is no, the h_next is set to null.

2) Set the item to the bucket in primary_hashtable.


...
static item** primary_hashtable = 0;
...

assoc.c:42

The expanding logic omitted here will be covered in the next post.

do_item_unlink

void do_item_unlink(item *it, const uint32_t hv) {
MEMCACHED_ITEM_UNLINK(ITEM_key(it), it->nkey, it->nbytes);
if ((it->it_flags & ITEM_LINKED) != 0) {
it->it_flags &= ~ITEM_LINKED; // scr: -------------------> 1)
... // scr: stat
assoc_delete(ITEM_key(it), it->nkey, hv); // scr: ---------------> 2)
item_unlink_q(it); // scr: -------------------> 3)
do_item_remove(it); // scr: -------------------> *)
}
}
do_item_unlink@item.c

1) Clear ITEM_LINKED in it->it_flags.

2) Remove the item from hash map.

3) Remove the item from linked list.

*) The actual releasing of an item will be covered in later posts.

Likewise, item_unlink_q is a thread safe wrapper of the workhorse method do_item_unlink_q.

static void do_item_unlink_q(item *it) {
item **head, **tail;
head = &heads[it->slabs_clsid]; // scr: -------------------> 1)
tail = &tails[it->slabs_clsid];

if (*head == it) { // scr: -------------------> 2)
assert(it->prev == 0);
*head = it->next;
}
if (*tail == it) {
assert(it->next == 0);
*tail = it->prev;
}
assert(it->next != it);
assert(it->prev != it);

if (it->next) it->next->prev = it->prev;
if (it->prev) it->prev->next = it->next;
sizes[it->slabs_clsid]--; // scr: -------------------> 3)
return;
}
do_item_unlink_q@item.c

1) Same, get the head and tail of the respective LRU linked list indicated by slabs_clsid.

2) Standard operations of “removing an element from a linked list”.

3) Decrease the global array sizes.


static item *heads[LARGEST_ID];
static item *tails[LARGEST_ID];
...
static unsigned int sizes[LARGEST_ID];

item.c:59

assoc_delete - remove from hash map

static item** _hashitem_before (const char *key, const size_t nkey, const uint32_t hv) {
item **pos;
unsigned int oldbucket;

... // scr: expanding related operations
} else {
pos = &primary_hashtable[hv & hashmask(hashpower)]; // scr: -----> 1)
}

while (*pos && ((nkey != (*pos)->nkey) || memcmp(key, ITEM_key(*pos), nkey))) {
pos = &(*pos)->h_next; // scr: ----------------------------------> 2)
}
return pos;
}

...

void assoc_delete(const char *key, const size_t nkey, const uint32_t hv) {
item **before = _hashitem_before(key, nkey, hv);

if (*before) {
item *nxt;
...
nxt = (*before)->h_next; // scr: --------------------------------> 3)
(*before)->h_next = 0; /* probably pointless, but whatever. */
*before = nxt; // scr: ------------------------------------------> 4)
return;
}
/* Note: we never actually get here. the callers don't delete things
they can't find. */
assert(*before != 0);
}
assoc_delete@assoc.c

1) Get the hash bucket using hv.

2) Go through the conflict chain and compare the key. Note that the result value is the address of the next member of the element before the found one. When there is no conflict, the address is the bucket itself.

3) Set the next element after the found one to temporary variable nxt.

4) Update the next member of the element before the found one.

Take home

Try this.

That's it. Did I make a serious mistake? or miss out on anything important? Or you simply like the read. Link me on -- I'd be chuffed to hear your feedback.