slab allocator (I, II, III) is the core module of the cache system, which largely determines how efficient the bottleneck resource, memory, can be utilized. The other 3 parts, namely,
In previous posts, we have discussed different facets of an item, i.e., slab, hash map and LRU list as well as their associated (CRUD) methods, which build up the internal procedures and perform client requests after the corresponding commands are parsed by the drive machine. This time we will go through those procedures by issuing telnet commands to a Memcached instance and see how the discussed modules work together on various item operations. We will also see the whole picture of LRU lists that maintain the property of ‘least recently used’ in accordance to those operations.
On top of standard LRU algorithm, the Memcached (1.4.28) emploies 3 lists instead of just 1, i.e., hot, warm and cold, a.k.a., Segmented LRU for each slab class. This heuristic is implemented to reduce the lock contention between itembumping (an action that moves recently accessed item to the list head) and item read. Moreover, unlike a casual implementation (such as the one I coded), Memcached does not bumpitem right on read action. Rather, the bumping is delayed to other operations when the resource is in short, which could reflect Memcached‘s read-first design decision.
In normal use cases, let’s say, a social media, the volume of read requests are more than that of other operations combined by orders of magnitude, hence it’s a critical point that worth extensive optimizations, I suppose.
We start this post by issuing an item read command to a Memcached instance.
~telnet localhost 11211 Trying 127.0.0.1... Connected to localhost. Escape character is '^]'. ...// add some items > get test
/* * If the command string hasn't been fully processed, get the next set * of tokens. */ if(key_token->value != NULL) { // scr: ------------------> *) ntokens = tokenize_command(key_token->value, tokens, MAX_TOKENS); key_token = tokens; }
if (key_token->value != NULL || add_iov(c, "END\r\n", 5) != 0 || (IS_UDP(c->transport) && build_udp_headers(c) != 0)) { out_of_memory(c, "SERVER_ERROR out of memory writing get response"); } else { // scr: ----------------------------------------------> *) conn_set_state(c, conn_mwrite); c->msgcurr = 0; } }
process_get_command@memcached.c
The only relevant step here are 1) item_get and 2) item_update. Steps marked as *) are mostly command parsing and I/O which will be discussed in later posts when we examine event driven mechanism.
do_item_get
Like other methods discussed before, item_get is a thread-safe wrapper of do_item_get.
3) If the item has expired, remove it. Note that do_item_unlink decreases the reference count held by the last step, and do_item_remove actually removes the item. These two methods will be discussed soon in item delete.
4) Simply mark the item as ITEM_ACTIVE rather than perform itembumping which is offloaded to other operations associated procedures. This is part of the heuristic discussed in the beginning.
do_item_update
item_update is a thread-safe wrapper of do_item_update.
1) The only line effective in this method is to set the access time for the item in (passively) an interval of 60 seconds. lru_maintainer_thread is set to true by command line argument modern so the operations inside if (!settings.lru_maintainer_thread) is not applicable.
#define ITEM_UPDATE_INTERVAL 60
memcached.h:73
case MODERN: ... start_lru_maintainer = true; break;
/* so slab size changer can tell later if item is already free or not */ clsid = ITEM_clsid(it); // scr: -------------------------> 2) DEBUG_REFCNT(it, 'F'); slabs_free(it, ntotal, clsid); // scr: ------------------> 3) }
do_item_remove@items.c
1) Decrease the reference count, if it reaches 0, goto 2) and free the item.
2) Use ITEM_clsid to get the slab class the item belongs. This macro removes the list type from slabs_clsid.
3) Call slabs_free to release the memory to slab subsystem.
Create
The Item creating procedure is divided into several logic fragments by the mentioned drive machine, 1) creating an empty item object with the key and other meta data sent through; 2) read the value (from the socket) and fill the item object with it; and 3) link the item object. The workflow controller - drive machine will be discussed in the next post.
Now we send an add command to the Memcached instance.
> add test 0 60 11 (\r\n) > hello world
Creating an empty item object
After the first line of the above command
> add test 0 60 11 (\r\n)
the procedure described in this section starts, the call stack of the hot path is,
ref 1 |~Drive machine & command parser |-process_update_command |-do_item_alloc |-slabs_clsid (Slab III) ++ |-do_slabs_alloc (Slab III) |-lru_pull_tail (on hot list) |-do_item_update_nolock (same to do_item_update) |-do_item_remove |-item_link_q (LRU I) |-do_item_remove |-lru_pull_tail (on warm list) |-same as hot list |-lru_pull_tail (on cold list) |-do_item_unlink_nolock (same to do_item_unlink LRU I)
1) Set the key (i.e., test), as well as the meta data (i.e., flags, 0; exptime, 60;vlen,11`), to local variables.
2) Increase vlen by 2, to populate the \n\r in addition to the key string.
3) Call item_alloc to allocate the memory (from slab) for the item.
4) After item_alloc is called, set the properties of conn. Here ritem points to the data portion of an item chunk; and rlbytes is set to vlen. These two fields will be used to populate the data portion with the content, i.e., hello world, in the next post.
do_item_alloc
Unlike other methods we have discussed, item_alloc is a wrapper of do_item_alloc without adding any locks. I would assume this wrapper is added simply for consistent code style.
item *item_alloc(char *key, size_t nkey, int flags, rel_time_t exptime, int nbytes){ item *it; /* do_item_alloc handles its own locks */ it = do_item_alloc(key, nkey, flags, exptime, nbytes, 0); return it; }
unsignedint id = slabs_clsid(ntotal); // scr: -----------------------> 2) if (id == 0) return0;
/* If no memory is available, attempt a direct LRU juggle/eviction */ /* This is a race in order to simplify lru_pull_tail; in cases where * locked items are on the tail, you want them to fall out and cause * occasional OOM's, rather than internally work around them. * This also gives one fewer code path for slab alloc/free */ for (i = 0; i < 5; i++) { /* Try to reclaim memory first */ ... // scr: legacy, no lru_maintainer_thread it = slabs_alloc(ntotal, id, &total_chunks, 0); // scr: ----------> 3) ... // scr: no-expire setting if (it == NULL) { if (settings.lru_maintainer_thread) { // scr: ----------------> 4) lru_pull_tail(id, HOT_LRU, total_chunks, false, cur_hv); lru_pull_tail(id, WARM_LRU, total_chunks, false, cur_hv); lru_pull_tail(id, COLD_LRU, total_chunks, true, cur_hv); } else { ... // scr: legacy, no lru_maintainer_thread } } else { break; } }
... // scr: stat & sanity check
/* Refcount is seeded to 1 by slabs_alloc() */ it->next = it->prev = it->h_next = 0; // scr: ------------------------> 5) /* Items are initially loaded into the HOT_LRU. This is '0' but I want at * least a note here. Compiler (hopefully?) optimizes this out. */ if (settings.lru_maintainer_thread) { ... // scr: no expire setting } else { id |= HOT_LRU; // scr: ---------------------------------------> 6) } } else { ... // scr: legacy, no lru_maintainer_thread } it->slabs_clsid = id; // scr: ----------------------------------------> 7)
2), 3) are discussed in detail in Slab III. To recap, slabs_clsid select the most optimal slab class and slab_alloc allocates one item chunk from slab sub-system.
4) If slab_alloc fails, try to release some memory using lru_pull_tail and retry the allocation for at most 5 times. lru_pull_tail is the focus of the next section.
5) Initialize the pointers of LRU list and hash collision list.
6) Set the list type (HOT_LRU) to the slabs_clsid, which indicates that this item belongs to the “HOT” LRU list of its respective slab class.
id |= cur_lru; // scr: ------------------------------------------> p) pthread_mutex_lock(&lru_locks[id]); search = tails[id]; // scr: -------------------------------------> p) /* We walk up *only* for locked items, and if bottom is expired. */ for (; tries > 0 && search != NULL; tries--, search=next_it) {//s: p) /* we might relink search mid-loop, so search->prev isn't reliable */ next_it = search->prev; // scr: -----------------------------> p) ...// scr: irrelevant code here uint32_t hv = hash(ITEM_key(search), search->nkey); /* Attempt to hash item lock the "search" item. If locked, no * other callers can incr the refcount. Also skip ourselves. */ if (hv == cur_hv || (hold_lock = item_trylock(hv)) == NULL) continue; /* Now see if the item is refcount locked */ if (refcount_incr(&search->refcount) != 2) { // scr: --------> s) /* Note pathological case with ref'ed items in tail. * Can still unlink the item, but it won't be reusable yet */ itemstats[id].lrutail_reflocked++; /* In case of refcount leaks, enable for quick workaround. */ /* WARNING: This can cause terrible corruption */ if (settings.tail_repair_time && search->time + settings.tail_repair_time < current_time) { itemstats[id].tailrepairs++; search->refcount = 1; /* This will call item_remove -> item_free since refcnt is 1 */ do_item_unlink_nolock(search, hv); item_trylock_unlock(hold_lock); continue; } }
p) This method starts by selecting the tail element of the designated LRU list using the slab class id and the list type, assuming that the element can be a release candidate. And it iterates over (at most 5 entries) the list in reverse order to find a entry in case that elements near the tail are recently accessed.
s) For each item selected, increase its reference count. In normal situation, the original value of reference count should be 1 (as you will see in the last step of the create operation). Hence a != 2 value after the increment indicates an exception that needs to be corrected. Note that the reference count is now 2 so it is required to decrease at least one time (back to 1) when the processing of the current item is done (e1, e2 or e3 is reached).
e1) Remove the item directly when an expiration is detected. Here the do_item_unlink_nolock is exactly the same as the discussed do_item_unlink (I think the code is duplicated to emphasize that this method is not thread-safe), and it follows the same “unlink and remove” routine as in item delete.
e2) When a candidate is found, we might need to relocate it to another list (when move_to_lru is set in the switchcase) by calling item_link_q. And we do need to call do_item_remove to reduce the reference count back to 1. The decision is made by the steps discussed bellow.
e3) If an item is recently accessed, reset the ITEM_ACTIVE flag; bump it to the head of the list; decrease its reference count and iterate to the next one (maximum 5 times). Remember that the flag ITEM_ACTIVE is set by item_get, and here is the place where the item gets bumped.
Hot & warm
1) The only difference of HOT_LRU and WARM_LRU is the threshold (limit) which are indicated by their respective configurations hot_lru_pct and warm_lru_pct.
settings.hot_lru_pct = 32; ... case HOT_LRU_PCT: if (subopts_value == NULL) { fprintf(stderr, "Missing hot_lru_pct argument\n"); return1; }; settings.hot_lru_pct = atoi(subopts_value); if (settings.hot_lru_pct < 1 || settings.hot_lru_pct >= 80) { fprintf(stderr, "hot_lru_pct must be > 1 and < 80\n"); return1; } break; ...
hot_lru_pct@memcached.c
... settings.warm_lru_pct = 32; ... case WARM_LRU_PCT: if (subopts_value == NULL) { fprintf(stderr, "Missing warm_lru_pct argument\n"); return1; }; settings.warm_lru_pct = atoi(subopts_value); if (settings.warm_lru_pct < 1 || settings.warm_lru_pct >= 80) { fprintf(stderr, "warm_lru_pct must be > 1 and < 80\n"); return1; } break; ...
warm_lru_pct@memcached.c
2) If the threshold of HOT_LRU or WARM_LRU is reached, remove the item from the current list using do_item_unlink_q, and goto e2). Therefore, e2) is responsible to relink it to the COLD_LRU and decrease the reference count.
3) If the current item is not “active”, and the threshold is not reached, finish this method without any item relocation, nor release.
Cold
4) If there are any item in the list, evict it directly with the discussed do_item_unlink_nolock to free up its resource, and goto e1). Note that the default values of evict_to_free and slab_automove are set to values that disable their respective if blocks.
... settings.evict_to_free = 1; ...
memcached.c:215
... case MODERN: /* Modernized defaults. Need to add equivalent no_* flags * before making truly default. */ ... settings.slab_automove = 1; ...
memcached.c:5824
Now we input the second line of the command
> hello world
and trigger the following procedures.
Populate the item with content
As mentioned in process_update_command, the content we input is populated to conn.item using conn.ritem and conn.rlbytes. This step is handled by drive machine which will be discussed in detail in the next post.
... res = read(c->sfd, c->ritem, c->rlbytes); ...
memcached.c:4421
Now we consider conn.item is filled with all relevant information, hence the next and final step is to
That's it. Did I make a serious mistake? or miss out on anything important? Or you simply like the read. Link me on -- I'd be chuffed to hear your feedback.