Customizing Redis pubsub for message persistence – Part 2

Redis Logo

In the last post we saw how Redis can easily be modified to persist the last published message on PubSub channels. Without subscribing to the PubSub channel we were able to get the last published message from Redis db. In this post, I will take that idea one step ahead and add native capabilities within Redis to persist all the unprocessed messages published on PubSub channel in channel specific lists. We’ll also preserve our capability to send the last published message to clients upon subscription.

But why are we doing this?

Popular open source application that provide support for Redis are based out of it’s list class of API. For example, let’s look at Celery which is a distributed task queue written in Python. Start redis-cli MONITOR on a terminal and then start celery in another window as follows:

$ pip install celery
$ celery worker --queues=testing --broker=redis:// --without-gossip --without-mingle --without-heartbeat

You’ll find celery polling Redis periodically as indicated by log lines like "BRPOP" "testing" "testing\x06\x163" "testing\x06\x166" "testing\x06\x169" "1". Celery uses the Redis PubSub mechanism (that we disabled in the above command) only for internal features. Sentry, a popular exception logging and aggregation library, internally depends upon Celery. There is an open pull request that claims to add Redis pubsub based support to Celery. In the world of Ruby, background processing frameworks like Requeue and Sidekiq depend upon Redis list class of API’s.

However, with no native support for persistence of PubSub messages in Redis, it’s not difficult to understand why adopting to Redis PubSub can be tricky for some. Currently, Redis simply drops the message if no subscribers are found. Hence, question really is whether your application is tolerant to loss of published messages (for example, dropped messages while you were upgrading your application) ?

To solve persistence problem with Redis pubsub, the usual approach is to start multiple application instances. Some instances can continue to serve while others get deployed. However still, your active instances might be experiencing a network partition and unable to receive published messages. After all, primary goal is to guarantee processing of every message received by Redis irrespective of whether we are using list or pubsub based backend. A native support to solve persistence problems with Redis PubSub is clearly desirable.

Persisting dropped Redis PubSub messages in a list

In the last post we added a single line of code to persist the last published message on channels in a separate Redis key. We’ll update implementation to push every received message at the end of a channel specific list. Replace `setKey(c->db, c->argv[1], c->argv[2]);` line that we added the last time with following code:

// Persist messages in list only if no receivers were found
if (receivers == 0) {
    int j, pushed = 0, where = REDIS_TAIL;

    // Fetch list key from the database
    robj *lobj = lookupKeyWrite(c->db,c->argv[1]);

    // For every published message on the channel
    for (j = 2; j < c->argc; j++) {
        c->argv[j] = tryObjectEncoding(c->argv[j]);

        // Ensure we have our quicklist initialized
        if (!lobj) {
            lobj = createQuicklistObject();
            quicklistSetOptions(lobj->ptr, server.list_max_ziplist_size,
                                server.list_compress_depth);
            dbAdd(c->db,c->argv[1],lobj);
        }

        // Push message at the tail of the list
        listTypePush(lobj,c->argv[j],where);
        pushed++;
    }

    // Signal key watchers and internal event subscribers
    if (pushed) {
        char *event = (where == REDIS_HEAD) ? "lpush" : "rpush";
        signalModifiedKey(c->db,c->argv[1]);
        notifyKeyspaceEvent(REDIS_NOTIFY_LIST,event,c->argv[1],c->db->id);
    }
    server.dirty += pushed;
}

I have added some comments in the code for clarity. This code is merely is a rip off of src/t_list.c:pushGenericCommand function. We don’t want to send replies to the client that are usually sent after an RPUSH command. Frankly, we can further refactor pushGenericCommand function to turn the above code into a single function call.

make test and start ./src/redis-server. Using ./src/redis-cli try:

$ ./src/redis-cli 
127.0.0.1:6379> publish persistent-channel this
(integer) 0
127.0.0.1:6379> publish persistent-channel is
(integer) 0
127.0.0.1:6379> publish persistent-channel gonna
(integer) 0
127.0.0.1:6379> publish persistent-channel be
(integer) 0
127.0.0.1:6379> publish persistent-channel awesome
(integer) 0
127.0.0.1:6379> lrange persistent-channel 0 -1
1) "this"
2) "is"
3) "gonna"
4) "be"
5) "awesome"
127.0.0.1:6379>

Voila! Since we published a few messages with no active subscriber, they all got persisted in a list named after the channel itself. Now incoming subscribers can process pending messages by fetching them from the list which otherwise would have been dropped.

Delivering unprocessed messages to subscribers upon subscription

Instead of depending upon clients to poll for channel list length, server can deliver unprocessed messages to subscribers upon successful subscription. Since this can get overwhelming for subscribers if there are several pending messages waiting in the list, we may not want to do this at all and leave it up to the clients. Let’s preserve our feature from the last post i.e. to deliver the last published message to clients upon subscription.

Here we don’t want to remove the last published message from our persistent list. We simply wish to send it to the incoming subscribers. For example, imagine cases like user status, location and mood being published on separated channels. Here is a method that will give back the last element from the list without removing it (equivalent to LRANGE key -1 -1):

robj *llast(redisClient *c, robj *key) {
    listTypeEntry entry;
    robj *o, *value = NULL;
    long llen;

    // Fetch list object from db
    o = lookupKeyRead(c->db, key);

    // Ensure key contains a list
    if(o != NULL && o->encoding == REDIS_ENCODING_QUICKLIST) {
        // Get list iterator for "lrange key -1 -1" use case
        llen = listTypeLength(o);
        listTypeIterator *iter = listTypeInitIterator(o, llen-1, REDIS_TAIL);

        // Fetch last item, prepare value
        listTypeNext(iter, &entry);
        quicklistEntry *qe = &entry.entry;
        if (qe->value) {
            value = createStringObject((char *)qe->value,
                                       qe->sz);
        } else {
            value = createStringObjectFromLongLong(qe->longval);
        }
        listTypeReleaseIterator(iter);
    }
    return value;
}

We now need to replace our modifications within subscription handling methods from the last post. Note, callee must call decrRefCount on the returned robj to free up created string object after delivery. Kindly check this github commit for all the changes since the last post. You can also checkout my Redis fork and directly play with the modified code.

Conclusion

We saw how easy it is to modify Redis for fun and profit. By adding native persistence capabilities, we offload our task of ensuring processing of every message received by Redis cluster. After all, none of the magical client side logics will ever be as reliable as native support. By the way, Redis 3.0.0 was released this week with native support for clustering, checkout while it’s hot.

Leave your comments and feedback.

Back to blogging: What to expect

Hello Readers,

I started this blog as a way to share my experiments and experiences while learning web development and computer science in general. In the first 2 years (between Apr’08 and Aug’10) I wrote as many as 100 blog posts. Quite a frenzy. Ever since, I only managed to write 5-6 posts in the following 4 years, about nearly 45 drafts which may now never get published. Good thing is that, I am back to blogging, which means a lot to share.

Briefly, here is what (or what not) to expect in the future posts:

  1. PHP – In past, PHP has dominated the content on this blog. Mostly web demos, some quick hacks or some JAXL library examples. However, I am no longer working actively with PHP since ’10 and probably never saw it after ’12. Expect zero PHP.

  2. JAXL – No more PHP essentially means no more JAXL posts. In fact, I recently moved JAXL repository to it’s own Github organization where other collaborators can maintain, improve and work on it without requiring my active involvement. This organization also contains other repositories that I managed to open source from my startup Jaxl.

  3. XMPP – Unfortunately, I am no longer in touch with progress on XMPP specifications. Specs has evolved a lot, to an extent that some developers have reported mod_message_carbon no longer works as expected with new Ejabberd server version (also, Message Carbon extension XEP-0280 has itself been deprecated). However, XMPP will always be my preferred choice whenever I need entire suite of user-to-user, group messaging, presence, contacts management, Jingle / SIP integration and other features baked into XMPP XEPs. For my everyday messaging needs, new technologies like ZeroMQ, AMQP (RabbitMQ), MQTT or even Redis PubSub are more suitable.

  4. Java – After some journey I am now finally working full-time with Java. I still hate it but trying to adapt, learn and love it for at least what it’s worth for.

  5. Python – Thanks to my stint with Appurify, I had a chance to work full-time with Python. I even managed to work on some interesting open source Python projects. Even though now it’s no longer my primary language, Python is always fun specially when one is in a hurry of getting things done.

  6. Golang / Erlang – I met Golang a year back. I met Erlang while hacking Ejabberd, Riak etc for my startup Jaxl and immediately fell in love with it. Nowadays, I am in love with Golang. It’s simple and precise, has similar message passing semantics (buffered channels) as found in Erlang (mail boxes). I highly recommend digging into these languages and getting comfortable with message passing programming paradigm. They will change how you approach and think about your application structure. Expect lots of Golang and some Erlang.

  7. Docker – Who is not into docker these days? If that’s not the case with you, leave this post right now and head over to docker user guide. That’s how important I find this piece of beauty (technology). Expect a lot about docker in my future posts.

  8. Startups – A lot of startup fun has kept me busy since ’10, some experiences and learnings are worth sharing.

  9. Android – I have been working full-time with mobiles (both Android and iOS) since ’12. Not much of application making but a lot of hacking with Adb protocol and libimobiledevice.

  10. System designing – Luckily, I happened to experience a lot of end-to-end system and network designing. This domain is of great interest once you start to have fun with Racks, Subnet, Routes, Switches, Firewalls, DNS, Multi-cast and entire suite of technology under this umbrella.

Will end this post with some interesting images from the past.

Fsck'd iPhone screen
Swollen iPhone screen due to high device temperature
Rackframe
Setting up Racks

MEMQ : Fast queue implementation using Memcached and PHP only

Memcached is a scalable caching solution developed by Danga interactive. One can do a lot of cool things using memcached including spam control, online-offline detection of users, building scalable web services. In this post, I will demonstrate and explain how to implement fast scalable queues in PHP.

MEMQ: Overview
Every queue is uniquely identified by it’s name. Let’s consider a queue named “foo” and see how MEMQ will implement it inside memcached:

  • Two keys namely, foo_head and foo_tail contains meta information about the queue
  • While queuing, item is saved in key foo_1234, where 1234 is the current value of key foo_tail
  • While de-queuing, item saved in key foo_123 is returned, where 123 is the current value of key foo_head
  • Value of keys foo_head and foo_tail start with 1 and gets incremented on every pop and push operation respectively
  • Value of key foo_head NEVER exceeds value of foo_tail. When value of two meta keys is same, queue is considered empty.

MEMQ: Code
Get the source code from GitHub:
http://github.com/abhinavsingh/memq

<?php

	define('MEMQ_POOL', 'localhost:11211');
	define('MEMQ_TTL', 0);

	class MEMQ {

		private static $mem = NULL;

		private function __construct() {}

		private function __clone() {}

		private static function getInstance() {
			if(!self::$mem) self::init();
			return self::$mem;
		}

		private static function init() {
			$mem = new Memcached;
			$servers = explode(",", MEMQ_POOL);
			foreach($servers as $server) {
				list($host, $port) = explode(":", $server);
				$mem->addServer($host, $port);
			}
			self::$mem = $mem;
		}

		public static function is_empty($queue) {
			$mem = self::getInstance();
			$head = $mem->get($queue."_head");
			$tail = $mem->get($queue."_tail");

			if($head >= $tail || $head === FALSE || $tail === FALSE)
				return TRUE;
			else
				return FALSE;
		}

		public static function dequeue($queue, $after_id=FALSE, $till_id=FALSE) {
			$mem = self::getInstance();

			if($after_id === FALSE && $till_id === FALSE) {
				$tail = $mem->get($queue."_tail");
				if(($id = $mem->increment($queue."_head")) === FALSE)
					return FALSE;

				if($id <= $tail) {
					return $mem->get($queue."_".($id-1));
				}
				else {
					$mem->decrement($queue."_head");
					return FALSE;
				}
			}
			else if($after_id !== FALSE && $till_id === FALSE) {
				$till_id = $mem->get($queue."_tail");
			}

			$item_keys = array();
			for($i=$after_id+1; $i<=$till_id; $i++)
				$item_keys[] = $queue."_".$i;
			$null = NULL;

			return $mem->getMulti($item_keys, $null, Memcached::GET_PRESERVE_ORDER);
		}

		public static function enqueue($queue, $item) {
			$mem = self::getInstance();

			$id = $mem->increment($queue."_tail");
			if($id === FALSE) {
				if($mem->add($queue."_tail", 1, MEMQ_TTL) === FALSE) {
					$id = $mem->increment($queue."_tail");
					if($id === FALSE)
						return FALSE;
				}
				else {
					$id = 1;
					$mem->add($queue."_head", $id, MEMQ_TTL);
				}
			}

			if($mem->add($queue."_".$id, $item, MEMQ_TTL) === FALSE)
				return FALSE;

			return $id;
		}

	}

?>

MEMQ: Usage
The class file provide 3 methods which can be utilized for implementing queues:

  1. MEMQ::is_empty – Returns TRUE if a queue is empty, otherwise FALSE
  2. MEMQ::enqueue – Queue up the passed item
  3. MEMQ::dequeue – De-queue an item from the queue

Specifically MEMQ::dequeue can run in two modes depending upon the parameters passed, as defined below:

  1. $queue: This is MUST for dequeue to work. If other optional parameters are not passed, top item from the queue is returned back
  2. $after_id: If this parameter is also passed along, all items from $after_id till the end of the queue are returned
  3. $till_id: If this paramater is also passed along with $after_id, dequeue acts like a popRange function

Whenever optional parameters are passed, MEMQ do not remove the returned items from the queue.

MEMQ: Is it working?
Add following line of code at the end of the above class file and hit the class file from your browser. You will get back inserted item id as response on the browser:

var_dump(MEMQ::enqueue($_GET['q'], time()));

Lets see how cache keys looks like in memcached:

[email protected]:~$ telnet localhost 11211
Trying ::1...
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.

get foo_head
VALUE foo_head 1 1
1
END

get foo_tail
VALUE foo_tail 1 1
2
END

get foo_1
VALUE foo_1 1 10
1265540583
END

get foo_2
VALUE foo_2 1 10
1265540585
END

MEMQ: Benchmark
Below are the benchmarking results for varying load:

  1. Queuing performance: 697.18 req/sec (n=1000, c=100) and 258.64 req/sec (n=5000, c=500)
  2. Dequeue performance: 641.27 req/sec (n=1000, c=100) and 242.87 req/sec (n=5000, c=500)

MEMQ: Why and other alternatives
There are several open source alternatives which provide a lot more scalability. However, MEMQ was written because my application doesn’t expect a load in order of 10,000 hits/sec. Listed below are a few open source alternatives for applications expecting high load:

  1. ActiveMQ: A reliable and fast solution under apache foundation
  2. RabbitMQ: Another reliable solution based on AMQP solution
  3. Memcacheq: A mash-up of two very stable stacks namely memcached and berkleyDB. However, it’s installation is a bit tricky.

MEMQ: Mantra and Customization
At the base MEMQ implementation can be visualized as follows:

There is a race between two keys in memcached (foo_head and foo_tail). Both are incremented on every dequeue and queue operation respectively. However, foo_tail is strong enough and never allows foo_head to exceed. When value of keys foo_tail and foo_head are equal, queue is considered empty.

The above code file still doesn’t include utility methods like MEMQ::total_items etc. However, writing such methods should be pretty easy depending upon your application needs. Also depending upon your application requirement, you should also take care of overflowing integer values.

WordPress style “Duplicate comment detected” using Memcached and PHP

If you have a knack of leaving comments on blogs, chances are you might have experienced a wordpress error page saying “Duplicate comment detected; it looks as though you’ve already said that!“, probably because you were not sure that your comment was saved last time and you tried to re-post your comment. In this blog post, I will put up some sample PHP code for Duplicate comment detection using Memcached without touching the databases. Towards the end, I will also discuss how the script can be modified for usage in any environment including forums and social networking websites.

Duplicate comment detection using Memcached
Here is a php function called is_repetitive_comment which return some useful value if the comment is repetitive, otherwise FALSE.

<?php

        define('COMPRESSION', 0);
        define('SIGNATURE_TTL', 60);

        $mem = new Memcache;
        $mem->addServer("localhost", 11211);

        function is_repetitive_comment($comment, $username) { // username can be ip address for anonymous env
                                                              // for per blog/forum checks pass forum id too
                                                              // for multi-host using same memcached instance, pass hostname too
                                                              // for restricting post of same comment, don't pass username
                $comment = trim($comment);
                $signature = md5(implode('',func_get_args()));

                global $mem;
                if(($value = $mem->get($signature)) !== FALSE) {
                        error_log($signature." found at ".time());
                        return $value;
                }
                else {
                        $value = array('comment' => $comment,
                                       'by' => $username,
                                       /* Other information if you may want to save */
                                      );
                        $mem->set($signature, $value, COMPRESSION, SIGNATURE_TTL);
                        error_log($signature." set at ".time());
                        return FALSE;
                }
        }

?>

Is it working?
Lets verify the working of the code and then we will dig into the code:

  • Save the sample code in a file, name it index.php
  • Towards the end of the script add following 3 line of code:
            var_dump(is_repetitive_comment("User Comment", "username"));
            sleep(5); // Simulating the case when a user might try to post the same comment again knowingly or unknowingly
                      // Similar kind of check is done in wordpress comment submission (though without memcached)
            var_dump(is_repetitive_comment("User Comment", "username"));
  • Run from command line:
    sabhinav$ php index.php
    6105b67d969642fe9e27bc052f29e259 set at 1262393877
    bool(false)
    6105b67d969642fe9e27bc052f29e259 found at 1262393882
    array(2) {
      ["comment"]=>
      string(12) "User Comment"
      ["by"]=>
      string(8) "username"
    }
  • As seen, function is_repetitive_comment returns bool(false) for the first time. However, after 5 seconds when same comment is being submitted it throws back some useful information from previous submission.

Working of is_repetitive_comment
Here is in brief, how memcached is used for duplicate comment detection by the script:

  • SIGNATURE_TTL defines the time limit between two similar comment submissions. Default set to 60 seconds
  • is_repetitive_comment takes two parameter namely the comment itself and the username of the user trying to post the comment.
  • The function create a signature by combining the passed parameters and checks whether a key=$signature exists in memcache
  • If key is found, it means same user has posted the same comment in past SIGNATURE_TTL i.e. 60 seconds. Function simply return back the value set for the key from memcache
  • However, if key is NOT found, user is allowed to post the comment by returning FALSE. However function also sets a key=$signature into memcache

The value of key=$signature depends upon your application and use case. You might want to save some useful parameters so that you can show appropriate error message without hitting the databases for anything.

Extracting more from the sample script
Here is how you can modify the above sample script for various environments:

  • If you are performing repetitive comment check in an anonymous environment i.e. commenter may not be registered users, you can pass commenter’s ip address instead of username
  • If you serve multiple sites out of the same box and all share the same memcached instance, you SHOULD also pass site’s root url to the function. Otherwise you might end up showing error message to wrong users
  • If you want to restrict submission of same comment per blog or forum, also pass the blog id to the function
  • If you want to simply restrict submission of same comment through out your site, pass only the comment to the function

Let me know if you do similar tiny little hacks using memcached 😀

How to build a custom static file serving HTTP server using Libevent in C

Libevent is an event notification library which lays the foundation for immensely successful open source projects like Memcached. As the web advances into a real time mode, more and more websites are using a mix of technologies like HTTP Pub-Sub, HTTP Long-polling and Comet with a custom light weight HTTP servers in the backend to create a real time user experience. In this blog post, I will start with necessary prerequisites for setting up the development environment. Further, I will demonstrate how to build a HTTP server capable of serving static pages. Finally, I will put up a few use cases of a custom HTTP server in today’s world.

Setting up Environment
Follow the following steps to install the latest version of libevent (version 2.0.3-alpha)

  • $ wget http://www.monkey.org/~provos/libevent-2.0.3-alpha.tar.gz
  • $ tar -xvzf libevent-2.0.3-alpha.tar.gz
  • $ cd libevent-2.0.3-alpha.tar.gz
  • ./configure
  • make
  • sudo make install

Check the environment by running the following piece of C code (event2.cpp):

#include <event2/event.h>

int main(int argc, char **argv) {
	const char *version;
	version = event_get_version();
	printf("%sn", version);
	return 0;
}

Compile and run as following:

$ g++ -arch x86_64 -Wall -levent event2.cpp -o event2
$ ./event2
$ 2.0.3-alpha

I had to pass -arch x86_64 flags on Mac OSX 10.5.8. This can vary depending upon your operating system.

Libsrvr: Static file serving HTTP Server
Below is the C code for a static file serving HTTP server using libevent called “Libsrvr”:

libsrvr.h

// General purpose header files
#include <iostream>
#include <getopt.h>
#include <sys/stat.h>

// Libevent header files
#include </usr/local/include/event2/event.h>
#include </usr/local/include/event2/http.h>
#include </usr/local/include/event2/buffer.h>

// Libsrvr configuration settings
#define LIBSRVR_SIGNATURE "libsrvr v 0.0.1"
#define LIBSRVR_HTDOCS "/Users/sabhinav/libsrvr/www"
#define LIBSRVR_INDEX "/index.html"

// Libsrvr http server and base struct
struct evhttp *libsrvr;
struct event_base *libbase;

// Libsrvr options
struct _options {
	int port;
	char *address;
	int verbose;
} options;
  • LIBSRVR_SIGNATURE is the server signature sent as response header for all incoming requests
  • LIBSRVR_HTDOCS is the path to the the DocumentRoot for libsrvr
  • LIBSRVR_INDEX is the similar to DirectoryIndex directive of apache

libsrvr.cpp

#include </Users/sabhinav/libsrvr/libsrvr.h>

void router(struct evhttp_request *r, void *arg) {
	const char *uri = evhttp_request_get_uri(r);

	char *static_file = (char *) malloc(strlen(LIBSRVR_HTDOCS) + strlen(uri) + strlen(LIBSRVR_INDEX) + 1);
	stpcpy(stpcpy(static_file, LIBSRVR_HTDOCS), uri);

	bool file_exists = true;
	struct stat st;
	if(stat(static_file, &st) == -1) {
		file_exists = false;
		evhttp_send_error(r, HTTP_NOTFOUND, "NOTFOUND");
	}
	else {
		if(S_ISDIR(st.st_mode)) {
			strcat(static_file, LIBSRVR_INDEX);

			if(stat(static_file, &st) == -1) {
				file_exists = false;
				evhttp_send_error(r, HTTP_NOTFOUND, "NOTFOUND");
			}
		}
	}

	if(file_exists) {
		int file_size = st.st_size;

		char *html;
		html = (char *) alloca(file_size);

		if(file_size != 0) {
			FILE *fp = fopen(static_file, "r");
			fread(html, 1, file_size, fp);
			fclose(fp);
		}

		struct evbuffer *buffer;
		buffer = evbuffer_new();

		struct evkeyvalq *headers = evhttp_request_get_output_headers(r);
		evhttp_add_header(headers, "Content-Type", "text/html; charset=UTF-8");
		evhttp_add_header(headers, "Server", LIBSRVR_SIGNATURE);

		evbuffer_add_printf(buffer, "%s", html);
		evhttp_send_reply(r, HTTP_OK, "OK", buffer);
		evbuffer_free(buffer);

		if(options.verbose) fprintf(stderr, "%st%dn", static_file, file_size);
	}
	else {
		if(options.verbose) fprintf(stderr, "%st%sn", static_file, "404 Not Found");
	}

	free(static_file);
}

int main(int argc, char **argv) {
	int opt;

	options.port = 4080;
	options.address = "0.0.0.0";
	options.verbose = 0;

	while((opt = getopt(argc,argv,"p:vh")) != -1) {
		switch(opt) {
			case 'p':
				options.port = atoi(optarg);
				break;
			case 'v':
				options.verbose = 1;
				break;
			case 'h':
				printf("Usage: ./libsrvr -p port -v[erbose] -h[elp]n");
				exit(1);
		}
	}

	libbase = event_base_new();
	libsrvr = evhttp_new(libbase);
	evhttp_bind_socket(libsrvr, options.address, options.port);
	evhttp_set_gencb(libsrvr, router, NULL);
	event_base_dispatch(libbase);

	return 0;
}

Here is some explanation for the above code:

  • Command line options are parsed using GNU getopt library
  • libbase is the event base for HTTP server libsrvr.
  • HTTP server is bind to port 4080 (by default).
  • A callback function is registered for each incoming HTTP request to libsrvr. Function router is invoked every time a HTTP request is received
  • Finally libbase is dispatched and code never reaches return 0

The working of the router function is as follows:

  • Incoming request uri is converted to absolute file path on the system
  • Checks for file or directory existence is done
  • If absolute path is a directory, LIBSRVR_INDEX is served out of that directory

Launching Libsrvr:
Compile and run the libsrvr as follows:

$ g++ -arch x86_64 -Wall -levent libsrvr.cpp -o libsrvr
$ ./libsrvr -v
/Users/sabhinav/libsrvr/www//index.html	538
/Users/sabhinav/libsrvr/www/assets/style.css	35
/Users/sabhinav/libsrvr/www/assets/script.js	27
/Users/sabhinav/libsrvr/www/dummy	404 Not Found
/Users/sabhinav/libsrvr/www/index.html	538
/Users/sabhinav/libsrvr/www/assets/style.css	35

If started under verbose mode (-v), libsrvr will output each requested file path on the console as shown above.

Use cases
Below are a few use cases of a custom HTTP server as seen in web today:

  • Facebook Chat: Uses a custom http server based on mochiweb framework
  • Yahoo finance: Uses a custom http streaming server based on libevent

Generally, iframe technique is combined with javascript hacks for streaming data from the custom http servers. Read “How to make cross-sub-domain ajax (XHR) requests using mod_proxy and iframes” for details.

Conclusion
Though a static file server find little place in today’s world, the idea was to show the ease by which you can create your own HTTP server which is light weight, fast and scalable (all thanks to Niels for his libevent). Couple libsrvr with memcached for caching static files, and benchmark will show over 10,000 req/sec handling capability of libsrvr.

Share if you like it and also let me know your thoughts through comments.

How to use locks for assuring atomic operation in Memcached?

Memcached provide atomic increment and decrement commands to manipulate integer (key,value) pairs. However special care should be taken to ensure application performance and possible race conditions while using memcached. In this blog post, I will first build a facebook style “like” application using atomic increment command of memcached. Also, I will discuss various technical difficulty one would face while ensuring atomicity in this application. Finally, I will demo how to ensure atomicity over a requested process using custom locks in memcached.

Where should I care about it?
Lets consider a sample application as depicted by the flow diagram below:
Facebook style "like" demo architecture using "memcached"

The above application is similar to facebook “like” feature. In brief, we maintain a key per post e.g. $key="post_id_1234_likes_count", storing count of users who liked this post. Another $key="post_id_1234_user_id_9999", stores user_id_9999 relationship with post_id_1234. Example, “liked” which is set to 1 if liked and “timestamp” which is the time when user liked this post.

Since this application is going to reside on a high traffic website, earlier design decisions are made to have memcached in-front of MySQL database and will act as the primary storage medium with periodic syncs to the database. For me a like/dislike functionality is not so important as compared to other social functionality on my website.

Here is a sample code for the above functionality:

	$mem = new Memcache;
	$mem->addServer("127.0.0.1", 11211);

	function incrementLikeCount($post_id) {
		global $mem;

		// prepare post key
		$key = "post_id_".$post_id."_likes_count";

		// get old count
		$old_count = $mem->get($key);

		// false means no one liked this post before
		if($old_count === FALSE) $old_count = 0;

		// increment count
		$new_count = $old_count+1;

		// set new count value
		if($mem->set($key, $new_count, 0, 0)) {
			error_log("Incremented key ".$key." to ".$new_count);
			return TRUE;
		}
		else {
			error_log("Error occurred in incrementing key ".$key);
			return FALSE;
		}
	}

	// get incoming parameters
	$post_id = $_GET['post_id'];

	// take action
	incrementLikeCount($post_id);

Why should I care about it?
Save the above code sample in a file called memcached_no_lock.php and hit the url http://localhost/memcached_no_lock.php?post_id=1234 five times. Verify the key value in memcached:

centurydaily-lm:Workspace sabhinav$ telnet localhost 11211
Trying ::1...
Connected to localhost.
Escape character is '^]'.
get post_id_1234_likes_count
VALUE post_id_1234_likes_count 0 2
5
END

Alright, application seems to give expected results. Next, lets verify this application for high traffic websites using apache benchmark:

centurydaily-lm:Workspace sabhinav$ ab -n 100 -c 10 http://localhost/memcached_no_lock.php?post_id=1234
Concurrency Level:      10
Time taken for tests:   0.090 seconds
Complete requests:      100
Failed requests:        0
Write errors:           0
Total transferred:      22200 bytes
HTML transferred:       0 bytes
Requests per second:    1112.03 [#/sec] (mean)

Verify the key value in memcached:

centurydaily-lm:Workspace sabhinav$ telnet localhost 11211
Trying ::1...
Connected to localhost.
Escape character is '^]'.
get post_id_1234_likes_count
VALUE post_id_1234_likes_count 0 2
36
END

What happened? We expected value for $key="post_id_1234_likes_count" to reach 100, but actually it is 36. What went wrong? This behavior can be explained by simply looking at the apache error log file:

[Sat Dec 05 14:32:08 2009] [error] [client ::1] Incremented key post_id_1234_likes_count to 1
[Sat Dec 05 14:32:08 2009] [error] [client ::1] Incremented key post_id_1234_likes_count to 1
[Sat Dec 05 14:32:08 2009] [error] [client ::1] Incremented key post_id_1234_likes_count to 1
[Sat Dec 05 14:32:08 2009] [error] [client ::1] Incremented key post_id_1234_likes_count to 2
[Sat Dec 05 14:32:08 2009] [error] [client ::1] Incremented key post_id_1234_likes_count to 2
[Sat Dec 05 14:32:08 2009] [error] [client ::1] Incremented key post_id_1234_likes_count to 3
[Sat Dec 05 14:32:08 2009] [error] [client ::1] Incremented key post_id_1234_likes_count to 3
[Sat Dec 05 14:32:08 2009] [error] [client ::1] Incremented key post_id_1234_likes_count to 3

Ohk, from above log we understand concurrency killed our application, since we see $key being incremented to the same value by more than 1 incoming request.

How should I take care of this?
Below is the modified code sample which will allow us atomic increments:

	$mem = new Memcache;
	$mem->addServer("127.0.0.1", 11211);

	function incrementLikeCount($post_id) {
		global $mem;

		// prepare post key
		$key = "post_id_".$post_id."_likes_count";

		$new_count = $mem->increment($key, 1);
		if($new_count === FALSE) {
			$new_count = $mem->add($key, 1, 0, 0);
			if($new_count === FALSE) {
				error_log("Someone raced us for first count on key ".$key);
				$new_count = $mem->increment($key, 1);
				if($new_count === FALSE) {
					error_log("Unable to increment key ".$key);
					return FALSE;
				}
				else {
					error_log("Incremented key ".$key." to ".$new_count);
					return TRUE;
				}
			}
			else {
				error_log("Initialized key ".$key." to ".$new_count);
				return TRUE;
			}
		}
		else {
			error_log("Incremented key ".$key." to ".$new_count);
			return TRUE;
		}

	}

	// get incoming parameters
	$post_id = $_GET['post_id'];

	// take action
	incrementLikeCount($post_id);

To ensure atomicity, we start with incrementing the $key="post_id_1234_likes_count". Since memcached increment() is atomic by itself, we need not put any locking mechanism in here. However, memcached increment returns FALSE, if the $key doesn’t already exists.

Hence, if we get a FALSE response from the first increment, we will try to initialize $key using memcached add() command. Good thing about memcached add is that, it will return a false FALSE, if the $key is already present. Hence, if more than one thread is trying to initialize $key, only one of them will succeed. All the rest of the threads will return FALSE for add command. Finally, if the response is FALSE from the first add, we will try to increment the $key again.

Lets try to test this modified code with apache benchmark. Also, this time we will increase concurrency from 10 to 100 threads. Save the above modified code in a file called memcached_lock.php and issue the following ab command:

centurydaily-lm:Workspace sabhinav$ ab -n 10000 -c 100 http://localhost/memcached_lock.php?post_id=1234
Concurrency Level:      100
Time taken for tests:   11.006 seconds
Complete requests:      10000
Failed requests:        0
Write errors:           0
Total transferred:      2224884 bytes
HTML transferred:       0 bytes
Requests per second:    908.61 [#/sec] (mean)

Lets verify the key value inside memcached:

centurydaily-lm:Workspace sabhinav$ telnet localhost 11211
Trying ::1...
Connected to localhost.
Escape character is '^]'.
get post_id_1234_likes_count
VALUE post_id_1234_likes_count 0 5
10000
END

Bingo! As desired we have a value of 10000 for $key inside memcached.

Using custom locks for atomicity:
There can be many instances where you SHOULD try to process a request atomically using locks. For e.g. while trying to fetch a query from database or while trying to regenerate a requested page template in your custom template caching engine.

In the example below, I will modify the memcached_lock.php script to ensure atomic increments without using increment command. Instead I will use custom locks using memcached:

	$mem = new Memcache;
	$mem->addServer("127.0.0.1", 11211);

	function incrementLikeCount($post_id) {
		global $mem;

		// prepare post key
		$key = "post_id_".$post_id."_likes_count";

		// initialize lock
		$lock = FALSE;

		// initialize configurable parameters
		$tries = 0;
		$max_tries = 1000;
		$lock_ttl = 10;

		$new_count = $mem->get($key); // fetch older value
	    while($lock === FALSE && $tries < $max_tries) {
	    	if($new_count === FALSE) $new_count = 0;
	        $new_count = $new_count + 1;

			// add() will return false if someone raced us for this lock
                       // ALWAYS USE add() FOR CUSTOM LOCKS
	        $lock = $mem->add("lock_".$new_count, 1, 0, $lock_ttl);

			$tries++;
			usleep(100*($tries%($max_tries/10))); // exponential backoff style of sleep
		}

		if($lock === FALSE && $tries >= $max_tries) {
			error_log("Unable to increment key ".$key);
			return FALSE;
		}
		else {
	    	$mem->set($key, $new_count, 0, 0);
			error_log("Incremented key ".$key." to ".$new_count);
			return TRUE;
		}

	}

	// get incoming parameters
	$post_id = $_GET['post_id'];

	// take action
	incrementLikeCount($post_id);

Try testing it using apache benchmark as above and then verify it with memcached.

centurydaily-lm:Workspace sabhinav$ telnet localhost 11211
Trying ::1...
Connected to localhost.
Escape character is '^]'.
get post_id_1234_likes_count
VALUE post_id_1234_likes_count 0 3
100
END

Benchmarks:
We see a drop in performance from 1112 hits/sec (memcached_no_lock) to 908 hits/sec (memcached_lock using increment). This is majorly because of increased concurrency. At same concurrency level of 10, I received a performance benchmark of 1128 hits/sec with our thread protected code. However, for our custom lock code above, I received a performance benchmark of 275 hits/sec.

Conclusion:
Always use memcached increment/decrement while dealing with locks on integer valued keys. For achieving locks on a process, use custom locks as demoed above using memcached add command. Also custom locks are subjected to configurable options like $max_tries and others.

Hope you enjoyed reading.
Do let me know through your comments.

Memcached and “N” things you can do with it – Part 1

In my last post MySQL Query Cache, WP-Cache, APC, Memcache – What to choose, I discussed in brief about 4 caching technologies which you might have used knowingly or unknowingly.

Towards the end we came to a conclusion that memcached is the best caching solution when you are looking for speed and number of hits per second. By my experience, memcached is capable of handling more than a 100 Million PV’s per month without any problem. However, towards the end I did also discussed why memcached is unreliable and unsecure.

In this post I will dig a level deeper into memcached. For ease here is the so called table of content:

  1. Basics: Memcached – Revisited
  2. Code Sample: A memcached class and how to use it
  3. N things: What else can I do with memcached
  4. Case Study: How Facebook uses memcached
  5. DONT’s: A few things to be taken care

Basics: Memcached – Revisited
Memcached was developed by Brad when live journal was hitting more than 20 Million PV’s per day. Handling 20 Million PV’s was no joke and he needed a better solution to handle such a high traffic. Since most of the blogs doesn’t change once published, he thought of having a model where he can skip the database read for the same posts again and again or atleast reduce the number of database reads. And hence came Memcached. Memcached is a deamon which runs in the background. By deamon you may think of a process running in the background doing its job.

If you are using ubuntu or debian like me, here are the steps for installing memcached:

  1. sudo apt-get install php5-memcache
  2. Go to /etc/php5/conf.d/memcache.ini and uncomment the line ;extension=memcache.so to enable this module
  3. sudo pecl install memcache
  4. Go to php.ini file and add this line: extension=memcache.so
  5. sudo apt-get install memcached
  6. sudo /etc/init.d/memcached start
  7. Restart Apache

If you are on windows, here are the steps which will help you running memcached on windows machine:

  1. Download the memcached win32 binary from my code.google vault.
  2. Unzip the downloaded file under C:memcached
  3. As we need to install memcached as a service, run this from a command line: C:memcachedmemcached.exe -d install from the command line
  4. Memcache by default loads with 64Mb of memory which is just not sufficient enough. Navigate to HKEY_LOCAL_MACHINESYSTEMCurrentControlSetServicesmemcached Server in your registry and find the ImagePath entry. Change that to “C:memcachedmemcached.exe” -d runservice -m 512
  5. Start the server by running this command: C:memcachedmemcached.exe -d start
  6. Open folder C:PHPext and check for php_memcache.dll. If you are unlucky not to have that download from here for PHP-4.x and from here for PHP-5.x
  7. Add extension=php_memcache.dll in your php.ini file
  8. Restart Apache
  9. Download the full package consisting of exe’s, classes and dll’s from here.

A few other options which you can use for starting memcached are:
memcached -d -m 2048 -l 10.0.0.40 -p 11211 , This will start memcached as a daemon, using 2GB of memory, and listening on IP 10.0.0.40, port 11211. Port 11211 is the default port for memcached.

By now I assume you have a 🙂 on your face, because you have memcached deamon running on your machine. Windows users can check that by opening up the task manager and looking for a memcached.exe process. I don’t see any reason for you not smiling, but if in case you are that unlucky windows user, please leave that system and move to a unix machine. Atleast try running Ubuntu on windows by reading this post of mine How to configure Ubuntu and LAMP on Windows.

Code Sample: Memcached class
So we have memcached setup on our system. Now we will try to hook up a simple code with it, which will do all the necessary talking with the deamon. Below I will demonstrate a very basic and simple application which will help you getting started. This application contains a total of 5 files namely: database.class.php, memcache.class.php, log.class.php, index.php, memcacheExtendedStats.php.

log.class.php
This is a very basic logger class which will log every thing for you to inspect later. Here is the code:

  class logAPI {
    var $LogFile = "log.txt";

    function Logger($Log) {
      $fh = fopen($this->LogFile,"a");
      fwrite($fh,$Log);
      fclose($fh);
    }
  }

When ever the code connects with memcached deamon, or database or fail to connect to either of them , this class will log the appropriate message in log.txt

database.class.php
This is another basic database class which have methods like getData() and setData(). getData() is used while trying to retrieve rows from the database while setData() is used while updating or inserting new rows in the database. For this demo application we will only be using the getData() method. Here is the code:

  include_once("memcache.class.php");
  include_once("log.class.php");

  class databaseAPI {

    /************************/
    /** Database information **/
    /************************/
    var $dbhost = "localhost";
    var $dbuser = "root";
    var $dbpass = "";
    var $dbname = "gtalkbots";
    var $db = NULL;

    /******************************/
    // CONSTRUCTOR DEFINITION //
    /******************************/
    function __construct() {
      $this->connect();
    }

    /*************************************************/
    // Function establishes a connection to database //
    /*************************************************/
    function connect() {
      // Connect to the dbhost
      $connection = mysql_connect($this->dbhost,$this->dbuser,$this->dbpass) or die(mysql_error());

	  // If connection fails send a mail to $dbmail about the same
      if(!$connection) {
        echo "Failed to establish a connection to host";
        exit;
      }
      else {
        // Connect to dbname
        $database = @mysql_select_db($this->dbname);

	// If fails to connect to the database send a mail to $dbmail
        if(!$database) {
          echo "Failed to establish a connection to database";
          exit;
        }
        else {
          $this->db = $connection;
        }
      }
    }

    /*******************************************/
    // Function closes the database connection //
    /*******************************************/
    function close() {
      mysql_close($this->db);
    }

    /***********************************************************/
    // Function executes the query against database and returns the result set   //
    // Result returned is in associative array format, and then frees the result //
    /***********************************************************/
    function getData($query,$options=array("type"=>"array","cache"=>"on"),$resultset="") {
        // Lookup on memcache servers if cache is on
	if($options['cache'] == "on") {
	    $obj = new memcacheAPI();
	    if($obj->connect()) {
	        // Try to fetch from memcache server if present
		$resultset = $obj->getCache($query);
	    }
    	    else {
	        // Fetch query from the database directly
	    }
	}
	// If $resultset == "" i.e. either caching is off or memcache server is down
        // OR $resultset == null i.e. when $query is not cached
	if($resultset == "" || $resultset == null) {
	    $result = mysql_query($query,$this->db);
	    if($result) {
		if($options['type'] == "original") {
		    // Return the original result set, if passed options request for the same
		    $resultset = $result;
		}
                else if($options['type'] == "array") {
	            // Return the associative array and number of rows
	    	    $mysql_num_rows = mysql_num_rows($result);
		    $result_arr = array();
		    while($info = mysql_fetch_assoc($result)) {
		        array_push($result_arr,$info);
		    }
		    $resultset = array("mysql_num_rows" => $mysql_num_rows,"result" => $result_arr,"false_query" => "no");
		}
  		// Cache the $query and $resultset
		$obj->setCache($query,$resultset);
		$obj->close();
		return $resultset;

                // Free the memory
		mysql_free_result($result);
            }
            else {
	        $resultset = array("false_query" => "yes");
	        return $resultset;
            }
	}
	else {
	    // If $query was found in the cache, simple return it
	    $obj->close();
	    return $resultset;
	}
    }

    /**********************************************************/
    // Function executes the query against database (INSERT, UPDATE) types   //
    /************************************************************/
    function setData($query) {
      // Run the query
      $result = mysql_query($query,$this->db);
      // Return the result
      return array('result'=>$result,'mysql_affected_rows'=>mysql_affected_rows());
    }

  }

memcache.class.php
Memcache class consists of two methods: getCache() and setCache(). getCache() will look up for the (key,value) pair in memory. If it exists, the method unserialize it and returns back. setCache() is used to set (key,value) pair in memory. It accepts the key and value, serialize the value before storing in cache.

  class memcacheAPI {

	/* Define the class constructor */
	function __construct() {
	  $this->connection = new Memcache;
	  $this->log = new logAPI();
	  $this->date = date('Y-m-d H:i:s');
	  $this->log->Logger("[[".$this->date."]] "."New Instance Created
n"); } /* connect() connects to the Memcache Server */ /* returns TRUE if connection established */ /* returns FALSE if connection failed */ function connect() { $memHost = "localhost"; $memPort = 11211; if($this->connection->connect($memHost,$memPort)) { $this->log->Logger("[[".$this->date."]] "."Connection established with memcache server
n"); return TRUE; } else { $this->log->Logger("[[".$this->date."]] "."Connection failed to establish with memcache server
n"); return FALSE; } } /* close() will disconnet from Memcache Server */ function close() { if($this->connection->close()) { $this->log->Logger("[[".$this->date."]] "."Connection closed with memcache server
n"); $this->log->Logger("=================================================================================================
nn"); return TRUE; } else { $this->log->Logger("[[".$this->date."]] "."Connection didn't close with memcache server
n"); $this->log->Logger("=======================================================================================================
nn"); return FALSE; } } /* getCache() function will fetch the passed $query resultset from cache */ /* returned resultset is null if $query not found in cache */ function getCache($query) { /* Generate the key corresponding to query */ $key = base64_encode($query); /* Get the resultset from cache */ $resultset = $this->connection->get($key); /* Unserialize the result if found in cache */ if($resultset != null) { $this->log->Logger("[[".$this->date."]] "."Query ".$query." was found already cached
n"); $resultset = unserialize($resultset); } else { $this->log->Logger("[[".$this->date."]] "."Query ".$query." was not found cached in memcache server
n"); } return $resultset; } /* setCache() function will set the serialized resultset on Memcache Server */ function setCache($query,$resultset,$useCompression=0,$ttl=600) { /* Generate the key corresponding to query */ $key = base64_encode($query); /* Set the value on Memcache Server */ $resultset = serialize($resultset); if($this->connection->set($key,$resultset,$useCompression,$ttl)) { $this->log->Logger("[[".$this->date."]] "."Query ".$query." was cached
n"); return TRUE; } else { $this->log->Logger("[[".$this->date."]] "."Query ".$query." was not able to cache
n"); return FALSE; } } }

index.php
With everything in place, its time to test memcached. We will check if memcached is working fine by running this code file twice one by one. Open command line and point to this code. Run from command line: php index.php . Then again run from command line php index.php.

  include_once("database.class.php");
  $mdb = new databaseAPI();

  $query = "SELECT * from status LIMIT 0,1000";
  $resultset = $mdb->getData($query);
  $mdb->close();

  echo "
";
  print_r($resultset);
  echo "

";

If everything is working fine, you will see a log.txt file being generated which will look as follows.

log.txt

[[2009-01-18 09:52:57]] New Instance Created
[[2009-01-18 09:52:57]] Connection established with memcache server
[[2009-01-18 09:52:57]] Query SELECT * from status LIMIT 0,1000 was not found cached in memcache server
[[2009-01-18 09:52:57]] Query SELECT * from status LIMIT 0,1000 was cached
[[2009-01-18 09:52:57]] Connection closed with memcache server
=================================================================================================
[[2009-01-18 09:53:08]] New Instance Created
[[2009-01-18 09:53:08]] Connection established with memcache server
[[2009-01-18 09:53:08]] Query SELECT * from status LIMIT 0,1000 was found already cached
[[2009-01-18 09:53:08]] Connection closed with memcache server
=================================================================================================

From the log file we can see that the 1st time results were fetched from the database and for the second time from memcached 🙂

Before we proceed further lets explain the flow of the above scripts. In index.php we create a new instance of database.class.php $mdb. Then we try to $query for 100 rows from the database. $mdb->getData($query) initiates this database fetch. As the program control goes to getData() method of database.class.php, it passed the control to getCache() method of memcache.class.php. There the code create a $key = base64_encode($query) and checks if we have the result set cached in memcached. If it doesn’t exists, it passed control back to getData() of database.class.php which fetches it from the database. After the fetch, it passes the $resultset back to setCache() method of memcache.class.php. There the setCache() method serialize the $resultset and cache it as ($key,$value) = (base64_encode($query), serialize($resultset)) in memcache.

Next time when the same query is fired and control goes to getCache() method of memcache.class.php, it fetches the result from cache, unserialize it and returns back the result to getData() of database.class.php. And thats why you see a log similar to above.

memcache.extendedstats.php
Finally it’s time to see some statistics. Here is a simple file which will show memcache status:

  $memcache_obj = new Memcache;
  $memcache_obj->addServer('localhost', 11211);

  $stats = $memcache_obj->getExtendedStats();

  echo "
";
  print_r($stats);
  echo "

";

Running it from command line using: php memcache.extendedstats.php will give you a statistic array like this.

Array
(
    [localhost:11211] => Array
        (
            [pid] => 5472
            [uptime] => 17
            [time] => 1232303504
            [version] => 1.2.5
            [pointer_size] => 32
            [curr_items] => 1
            [total_items] => 1
            [bytes] => 271631
            [curr_connections] => 2
            [total_connections] => 5
            [connection_structures] => 3
            [cmd_get] => 2
            [cmd_set] => 1
            [get_hits] => 1
            [get_misses] => 1
            [evictions] => 0
            [bytes_read] => 271705
            [bytes_written] => 271614
            [limit_maxbytes] => 536870912
            [threads] => 1
        )

)

This array tells you of a number of things about how your memcached deamon and caching architecture is performing. In short here is what each of the variable would mean:

  1. pid: Process id of this server process
  2. uptime: Number of seconds this server has been running
  3. time: Current UNIX time according to the server
  4. version: Version string of this server
  5. rusage_user: Accumulated user time for this process
  6. rusage_system: Accumulated system time for this process
  7. curr_items: Current number of items stored by the server
  8. total_items: Total number of items stored by this server ever since it started
  9. bytes: Current number of bytes used by this server to store items
  10. curr_connections: Number of open connections
  11. total_connections: Total number of connections opened since the server started running
  12. connection_structures: Number of connection structures allocated by the server
  13. cmd_get: Cumulative number of retrieval requests
  14. cmd_set: Cumulative number of storage requests
  15. get_hits: Number of keys that have been requested and found present
  16. get_misses: Number of items that have been requested and not found
  17. bytes_read: Total number of bytes read by this server from network
  18. bytes_written: Total number of bytes sent by this server to network
  19. limit_maxbytes: Number of bytes this server is allowed to use for storage.

However at this stage the figures which you might be interested in knowing are get_hits and get_misses. get_misses means number of times you requested for a key in memcache and it was not found while get_hits means number of times your requested key was successfully retrieved from memcached. Hence as expected we currently have get_misses = 1 and get_hits = 1. Try running php index.php once more and get_hits will get incremented by one.

N things: What else can I do
Till now you have memcached deamon running on your system and you know how to communicate with the deamon. You also know a very basic usage of memcached now, i.e. cache the queries to reduce load from your database. However there is a lot more you can do with memcached.

Here I would like to present before you a few applications of memcached which I learnt from my experiences. I hope they will be enough to make you think otherwise.

  1. Restricting spammers on your site : Often you will find on social networking sites like Digg, Facebook and Orkut that if you try to add several users as your friend within a short span, they (facebook) will show you a warning or they (digg) will restrict you from doing it. Similarly there are cases when you want to send a shout to more than 200 users on digg and you are restricted from doing so. How will you implement this on your site?

    Ordinary user’s solution: One solution is if a user ‘X’ add another user ‘Y’ as a friend, you will check how many friends has ‘X’ added in the past 10 minutes. If that exceeds 20, you won’t allow him to add more friends or show him a warning message. Simple enough and will work coolly. But what if your site have more than a million users, with many hackers around the world trying to keep your servers busy. As a memcached user you must have following solution in your mind:

    Your solution: As ‘X’ adds ‘Y’ in his friend list, you will set a (key,value) pair in memcached, where $key = “user_x” with $TTL = 10 minutes. For the first friend added, $value = 1. Now as ‘X’ tries to add another friend, your simply increment $value for $key = “user_x”. As this $value equals 20, and ‘X’ tried to add another friend, your code will check the value of $key = “user_x” and see if it’s present in memcached. If it’s present check for it’s value. If it’s value is equal to 20, you show a warning message to the user. Hence restricting him from adding more than 20 friends within a time span of 10 minutes. After 10 minutes, $key = “user_x” will automatically expires and your code will allow ‘X’ to add more friends. Similar solution exists if you want to stop spammers from sending message or commenting over a limit on your site. Now I see confidence building in you as a memcached programmer 😀

  2. Detecting online active/inactive users : Often you want a method which can tell your registered users, about their online friends. Or in a forum you want to tell the current reading user, about who all are reading this post. I won’t tell how an ordinary user will implement this, but as a memcached user your solution should be:

    Ordinary user’s solution: You don’t want to think of this.

    Your solution: As user ‘X’ visit post#12345 in the forum, not only you will fetch post data from the database, you also set two (key,value) pairs.

    $key1 = “post_12345”
    $value1 = [[comma separated list of user names]]
    $TTL1 = [[some arbitrary large value is fine]]

    $key2 = “user_x”
    $value2 = “post_12345”
    $TTL2 = 10 minutes, We assume a user to be inactive if he is on a particular post for more than 10 minutes (tunable), and we will mark him as inactive.

    (key1,value1) is a post specific data while (key2,value2) is a user specific data. Now every time a user visits post#12345, you will do two things. Read all comma separated user names from $value1, and then check for their corresponding $key2 value. If corresponding $key2 exists and equals to $value2 = “post_12345”, i.e. on the same post and not idle, we will keep that user name in value of $key1. However if $key2 is not present (i.e. user gone away) or value of $key2 equals to some other post, we will remove his user name from $value1. Confused 😛 , read twice and the picture will be clear.

    Can you think of a better solution? Please let me and others know about it. (Remember we are trying to detect only active/inactive users, which is not same as online/offline users)

  3. Building scalable web services : Another application of memcached lies in building scalable web services and web widgets. Gtalkbots offer a cool widget which you can put on your blog and sites to show off your aggregated status messages. See this widget live in the right hand sidebar. While building this widget, one thing which I kept in mind was that, what if someone with million hits per day put my widget on his site. Even though, Gtalkbots gets a few thousand of hits per day, it will crash, mainly because of this particular widget being placed on a high traffic site. So as a memcached user I have done the following thing.

    Ordinary user’s solution: Deprecated

    Your solution: I simply cache the widget data in memcache with $TTL = 1 hour. So every time this million hits per day site is accessed, which loads my widget million times a day, the query will be returned from cache. And hence saving my server from crashing. Fetch Gtalkbots widget from here and try putting on your site.

Alright by now you can impress your bosses with your cool memcache implementations. But wait there is a lot more you need to know. I can go on putting hundred’s of memcached applications here, but main point is, setting your mind as a memcached user. I personally have this habit of linking everything to memcached while designing a system, and if it suits my need, Bingo!.

  1. Versioning your cache keys : One disadvantage of using cache at times is that, if unfortunately something goes wrong with your data and that buggy data is cached, your users will keep seeing that buggy data until your cache expires or you manually clear off your cache. Suppose you clear your cache and then one of your fucking engineer comes running saying the data is fine.

    Ordinary user’s solution: Stop it, No more plz

    Your solution: As a memcached user, i would love to keep a Versioning system for my caches. Nothing complex, simply append “v1” (version number) to all your keys. i.e. $key = “user_x” will now become $key = “v1:user_x”. and at a place in your code you have $current_cache_version = “v1”. Now suppose you are told your data is buggy, so by the time your engineers are investigation change $current_cache_version = “v2”. This will keep your old caches, which you may want to recover after your investigation and at the same time show new data to your users.

  2. Not so frequent update of Db for trivial data : This is a site dependent application. Suppose you run a site where you are not so serious about database columns like “last_update_date”, “last_logged_in” and so on. However you still keep a track of such for analysis purpose and don’t mind if it’s not so accurate.

    Your solution: One simple solution to this problem is keep such trivial data in memcached and set up a cron job which will run every 10 minutes and populate such data in the database. 🙂

I will leave you with a presentation on memcache which I gave sometime back at office. I hope it will help you gain more understanding of memcache.

Memcache

View SlideShare presentation or Upload your own. (tags: caching memcache)

I hope after reading this you are well equipped on how to design scalable systems. All the best! , do leave a comment or suggestions if any. If you liked this post, do subscribe for the next post in memcache series.

MySQL Query Cache, WP-Cache, APC, Memcache – What to choose

Hello Cache Freaks,

Ever since I changed my job (from Business Intelligence to Web development) and started working with my present employer, I have had a chance to work on a lot of scalable projects. From making my project to scale from 20 Million PV’s to 100 Million PV’s to development of an internal tool, the answer to all scalable applications have been caching.

There are a lot of caching techniques which are being employed by sites worldwide.

  1. WP-Cache used in wordpress – a file system based caching mechanism
  2. APC Cache – an opcode based caching system
  3. Memcache – an in memory caching system
  4. Query Cache – caching mechanism employed in MySQL

Here in this post I would like to pen down my experiences while working with all the caching mechanism. Their pros and cons. What things you need to take care while working with them and every little tit bit which comes to my mind while writing this post.

Query Cache – inbuilt cache mechanism in MySQL
Query cache is an inbuilt cache mechanism for MySQL. Basically when you fire a query against a MySQL database, it goes through a lot of core modules. e.g. Connection Manager, Thread Manager, Connection Thread, User Authentication Module, Parse Module, Command Dispatcher, Optimizer Module, Table Manager, Query Cache Module and blah blah. Discussing these modules is out of scope of this blog post. The only module we are interested here is Query Cache Module.

Suppose I fire a query:
$query = “Select * from users”;

MySQL fetches it for the first time from the database and caches the result for further similar query. Next time when you fire a similar query, it picks the result from the cache and deliver it back to you.

However, there are a few drawbacks with Query Cache:

  • If there is a new entry in the users table, the cache is cleared.
  • Secondly, even if the result of your query is cached, MySQL has to go through a number of core modules before it give back the cached result to you.
  • Thirdly, even if your results are caches, you need to connect to your MySQL database, which generally have a bottleneck with number of connections allowed.

One thing which you should take care while replying on Query Cache is that, your query must not have any parameter which is random or changes quite often. For e.g. If you wish to fetch url’s of 100 photo from the database, and then you want to present them in a random fashion every time, you might be tempted to use rand() somewhere in your MySQL queries. However by using rand() you ensure that the query is never returned from cache, since rand() parameter always makes the query look different.

Similarly, If you have a case where you need to show data not older than 15 days, and by mistake you also include the seconds parameter in your SQL query, then the query will never return from cache.

Hence for a scalable site, with 100 Million PV’s you can’t really survive with a simple query cache provided by MySQL database.

WP-Cache – Caching mechanism for WordPress blogs
WP-Cache is a file system based caching mechanism i.e. it caches your blog posts in form of simple text files which are saved on your file system. You can have a look at these cached files by visiting wp-content/cache folder inside your blog directory. Generally you will find two set of files for a single blog post. One .html and another .meta file.

.html file generally contains the static html content for your blog post. Once published, the blog post content is static, hence instead of fetching it’s data from the database, WP-Cache serves it from the cache directory.

.meta file contains serialized information such as Last Modified, Content-Type, URI which WP-Cache uses to maintain cache expiry on your blog.

WP-Cache works really well, however if the traffic starts increasing on your blog, then the bottleneck will be maximum number of child processes which apache can create (For starters you can think, each user connecting to your blog as one apache process, hence there is a restriction on number of users who can connect to your blog at a particular time). In such case the solution can be either to have multiple front end servers with a load balancer to distribute the traffic among front end servers, or to have a better cache solution such as a memory based caching mechanism (e.g. Memcache). Also since memory read is always faster than file read, you must go for a memory based cache system.

APC Cache – An opcode based cache for PHP
APC stands for Alternative PHP Cache. In one of my previous post How does PHP echo’s a “Hello World”? – Behind the scene, I talk about how PHP churns out “Hello World” for you.

PHP takes your code through various steps:

  1. Scanning – The human readable source code is turned into tokens.
  2. Parsing – Groups of tokens are collected into simple, meaningful expressions.
  3. Compilation – Expressions are translated into instruction (opcodes)
  4. Execution – Opcode stacks are processed (one opcode at a time) to perform the scripted tasks.

Opcode caches let the ZEND engine perform the first three of these steps, then store that compiled form (opcodes) so that the next time a given script is used, it can use the stored version without having to redo those steps only to come to the same result.

However the problem with APC cache is that it is not a distributed cache system. By distributed cache I mean, if you have 3 frontend server then you need to have a copy of this opcode on all the three fronend server. Also like WP-Cache APC is again a file system driven cache system, which is not the optimal solution. Also with APC cache, PHP still has to go through the last step as described above.

Memcache – In memory based cache mechanism
Memcache is the solution when you talk about million PV’s on your site. It is a high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load.

For starters, Memcache is not PHP, nor Python or any other thing as you may think. It’s a deamon which runs in the background on your server. Your code connects to it and cache query results, JS, CSS and other cachable data in the server’s memory (RAM). Since its an in-memory caching system, it is faster than any of the discussed caching system above. However it is unreliable and unsecure but then there are ways to tackle this unreliable and unsecure nature of memcache.

Memcache is unreliable because it resides in your system memory. So an event like system reboot or power failure will result in loss of all your cache. Also memcache provides no mechanism to take backup of your caches. Hence once lost, you need to programmatically warmup your caches.

Memcache is unsecure because it doesn’t require any authentication mechanism. There is no username or password with which your code connects to it. (Hence it is super fast, unlike Query cache which has to go through auth module even if the query result is cached). It usually runs at port 11211 on your server and if not taken care, anyone can telnet to port 11211 on your server and steal your caches.

Below are the steps which are being followed on a memcache enabled website:

  1. User enter your site’s url in his browser, say http://localhost
  2. There are about 6 queries which drives your opening page
  3. Lets assume one of the query for this page is $query = “SELECT photo_url from photos LIMIT 0,10”
  4. When the user visit http://localhost, your code will first connect to memcache deamon on port 11211
  5. If memcache deamon is running, it checks if the result of this query are already cached. Generally data is cached in memcache as (key,value) pair
  6. Since this is the first visit on your site, ofcourse there is no query being cached till now. Hence your code now connect to your MySQL database and fetched the resultset. Something like this probably. $resultset = mysql_query($query);
  7. After successfully fetching the resultset, your code will cache this resultset in memcache. The code connects to memcache deamon and saves this (key,value) pair in memory, where $key = md5($query) and $value = serialize($resultset)
  8. After caching the (key,value) pair, your code returns back the fetched resultset to the frontpage where it is being displayed to the user
  9. Similarly all the 6 queries which drives your front page are being cached one by one
  10. Now the next time when another user visit this page i.e. http://localhost, your code will first see if $key = md5($query) is already present in cache. It will find the $key being cached, fetches the serialized resultset from memory, unserialize it and throws back to the front page where it is displayed as intended
  11. While caching (key,value) pair you also have a provision to specify a TTL (time to live) after which memcache will automatically expire your cached result
  12. Suppose you specifies a TTL = 15 Minutes for all the above queries and now a visitor visit http://localhost after 30 minutes
  13. Your code will again first connect to memcache deamon and check if $key = md5($query) is present in cache. Memcache deamon will see that this $key is present but it has expired. It will return a result saying $key is not cached and internally flushes out $key. Your code then again connect to MySQL database, fetches the resultset and cache back the results in memcache for further use

I will leave you with a presentation on memcache which I gave sometime back at office. I hope it will help you gain more understanding of memcache.

Memcache

View SlideShare presentation or Upload your own. (tags: caching memcache)

In my next posts, I will be covering a few code samples and use cases of memcache which you wouldn’t even heard of. If you liked the post don’t forget to promote it on social networking sites and do subscribe to my blog. 🙂