How to use locks in PHP cron jobs to avoid cron overlaps

Cron jobs are hidden building blocks for most of the websites. They are generally used to process/aggregate data in the background. However as a website starts to grow and there is gigabytes of data to be processed by every cron job, chances are that our cron jobs might overlap and possibly corrupt our data. In this blog post, I will demonstrate how can we avoid such overlaps by using simple locking techniques. I will also discuss a few edge cases we need to consider while using locks to avoid overlap.

Cron job helper class
Here is a helper class (cron.helper.php) which will help us avoiding cron job overlaps. (See usage example below)

<?php

	define('LOCK_DIR', '/Users/sabhinav/Workspace/cronHelper/');
	define('LOCK_SUFFIX', '.lock');

	class cronHelper {

		private static $pid;

		function __construct() {}

		function __clone() {}

		private static function isrunning() {
			$pids = explode(PHP_EOL, `ps -e | awk '{print $1}'`);
			if(in_array(self::$pid, $pids))
				return TRUE;
			return FALSE;
		}

		public static function lock() {
			global $argv;

			$lock_file = LOCK_DIR.$argv[0].LOCK_SUFFIX;

			if(file_exists($lock_file)) {
				//return FALSE;

				// Is running?
				self::$pid = file_get_contents($lock_file);
				if(self::isrunning()) {
					error_log("==".self::$pid."== Already in progress...");
					return FALSE;
				}
				else {
					error_log("==".self::$pid."== Previous job died abruptly...");
				}
			}

			self::$pid = getmypid();
			file_put_contents($lock_file, self::$pid);
			error_log("==".self::$pid."== Lock acquired, processing the job...");
			return self::$pid;
		}

		public static function unlock() {
			global $argv;

			$lock_file = LOCK_DIR.$argv[0].LOCK_SUFFIX;

			if(file_exists($lock_file))
				unlink($lock_file);

			error_log("==".self::$pid."== Releasing lock...");
			return TRUE;
		}

	}

?>

Using cron.helper.php
Here is how the helper class can be integrated in your current cron job code:

  • Save cron.helper.php in a folder called cronHelper
  • Update LOCK_DIR as per your need
  • You might have to set proper permissions on folder cronHelper, so that running cron job have write permissions
  • Wrap your cron job code as show below:
    <?php
    
    	require 'cronHelper/cron.helper.php';
    
    	if(($pid = cronHelper::lock()) !== FALSE) {
    
    		/*
    		 * Cron job code goes here
    		*/
    		sleep(10); // Cron job code for demonstration
    
    		cronHelper::unlock();
    	}
    
    ?>

Is it working? Verify
Lets verify is the helper class really take care of all the edge cases.

  • sleep(10) is our cron job code for this test
  • Run from command line:
    sabhinav$ php job.php
    ==40818== Lock acquired, processing the job...
    ==40818== Releasing lock...
    

    where 40818 is the process id of current running cron job

  • Run from command line and terminate the cron job in between by pressing CNTR+C:
    sabhinav$ php job.php
    ==40830== Lock acquired, processing the job...
    

    By pressing CNTR+C, we simulate the cases when a cron job can die in between due to a fatal error or system shutdown. In such cases, helper class fails to release the lock on this cron job.

  • With the lock in place (ls -l cronHelper | grep lock), run from command line:
    sabhinav$ php job.php
    ==40830== Previous job died abruptly...
    ==40835== Lock acquired, processing the job...
    ==40835== Releasing lock...
    

    As seen, helper class detects that one of the previous cron job died abruptly and then allow the current job to run successfully.

  • Run the cron job from two command line window and one of them will not proceed as shown below:
    centurydaily-lm:cronHelper sabhinav$ php job.php
    ==40856== Already in progress...
    

    One of the cron job will die since a cron job with $pid=40856 is already in progress.

Working of cron.helper.php
The helper class create a lock file inside LOCK_DIR. For our test cron job above, lock file name will be job.php.lock. Lock file name suffix can be configured using LOCK_SUFFIX.

cronHelper::lock() places the current running cron job process id inside the lock file. Upon job completion cronHelper::unlock() deletes the lock file.

If cronHelper::lock() finds that lock file already exists, it extracts the previous cron job process id from the lock file and checks whether a previous cron job is still running. If previous job is still in progress, we abort our current current job. If previous job is not in progress i.e. died abruptly, current cron job acquires the lock.

This is the classic method for avoiding cron overlaps. However there can be various other methods of achieving the same thing. If you know any do let me know through your comments.

How to build a custom static file serving HTTP server using Libevent in C

Libevent is an event notification library which lays the foundation for immensely successful open source projects like Memcached. As the web advances into a real time mode, more and more websites are using a mix of technologies like HTTP Pub-Sub, HTTP Long-polling and Comet with a custom light weight HTTP servers in the backend to create a real time user experience. In this blog post, I will start with necessary prerequisites for setting up the development environment. Further, I will demonstrate how to build a HTTP server capable of serving static pages. Finally, I will put up a few use cases of a custom HTTP server in today’s world.

Setting up Environment
Follow the following steps to install the latest version of libevent (version 2.0.3-alpha)

  • $ wget http://www.monkey.org/~provos/libevent-2.0.3-alpha.tar.gz
  • $ tar -xvzf libevent-2.0.3-alpha.tar.gz
  • $ cd libevent-2.0.3-alpha.tar.gz
  • ./configure
  • make
  • sudo make install

Check the environment by running the following piece of C code (event2.cpp):

#include <event2/event.h>

int main(int argc, char **argv) {
	const char *version;
	version = event_get_version();
	printf("%sn", version);
	return 0;
}

Compile and run as following:

$ g++ -arch x86_64 -Wall -levent event2.cpp -o event2
$ ./event2
$ 2.0.3-alpha

I had to pass -arch x86_64 flags on Mac OSX 10.5.8. This can vary depending upon your operating system.

Libsrvr: Static file serving HTTP Server
Below is the C code for a static file serving HTTP server using libevent called “Libsrvr”:

libsrvr.h

// General purpose header files
#include <iostream>
#include <getopt.h>
#include <sys/stat.h>

// Libevent header files
#include </usr/local/include/event2/event.h>
#include </usr/local/include/event2/http.h>
#include </usr/local/include/event2/buffer.h>

// Libsrvr configuration settings
#define LIBSRVR_SIGNATURE "libsrvr v 0.0.1"
#define LIBSRVR_HTDOCS "/Users/sabhinav/libsrvr/www"
#define LIBSRVR_INDEX "/index.html"

// Libsrvr http server and base struct
struct evhttp *libsrvr;
struct event_base *libbase;

// Libsrvr options
struct _options {
	int port;
	char *address;
	int verbose;
} options;
  • LIBSRVR_SIGNATURE is the server signature sent as response header for all incoming requests
  • LIBSRVR_HTDOCS is the path to the the DocumentRoot for libsrvr
  • LIBSRVR_INDEX is the similar to DirectoryIndex directive of apache

libsrvr.cpp

#include </Users/sabhinav/libsrvr/libsrvr.h>

void router(struct evhttp_request *r, void *arg) {
	const char *uri = evhttp_request_get_uri(r);

	char *static_file = (char *) malloc(strlen(LIBSRVR_HTDOCS) + strlen(uri) + strlen(LIBSRVR_INDEX) + 1);
	stpcpy(stpcpy(static_file, LIBSRVR_HTDOCS), uri);

	bool file_exists = true;
	struct stat st;
	if(stat(static_file, &st) == -1) {
		file_exists = false;
		evhttp_send_error(r, HTTP_NOTFOUND, "NOTFOUND");
	}
	else {
		if(S_ISDIR(st.st_mode)) {
			strcat(static_file, LIBSRVR_INDEX);

			if(stat(static_file, &st) == -1) {
				file_exists = false;
				evhttp_send_error(r, HTTP_NOTFOUND, "NOTFOUND");
			}
		}
	}

	if(file_exists) {
		int file_size = st.st_size;

		char *html;
		html = (char *) alloca(file_size);

		if(file_size != 0) {
			FILE *fp = fopen(static_file, "r");
			fread(html, 1, file_size, fp);
			fclose(fp);
		}

		struct evbuffer *buffer;
		buffer = evbuffer_new();

		struct evkeyvalq *headers = evhttp_request_get_output_headers(r);
		evhttp_add_header(headers, "Content-Type", "text/html; charset=UTF-8");
		evhttp_add_header(headers, "Server", LIBSRVR_SIGNATURE);

		evbuffer_add_printf(buffer, "%s", html);
		evhttp_send_reply(r, HTTP_OK, "OK", buffer);
		evbuffer_free(buffer);

		if(options.verbose) fprintf(stderr, "%st%dn", static_file, file_size);
	}
	else {
		if(options.verbose) fprintf(stderr, "%st%sn", static_file, "404 Not Found");
	}

	free(static_file);
}

int main(int argc, char **argv) {
	int opt;

	options.port = 4080;
	options.address = "0.0.0.0";
	options.verbose = 0;

	while((opt = getopt(argc,argv,"p:vh")) != -1) {
		switch(opt) {
			case 'p':
				options.port = atoi(optarg);
				break;
			case 'v':
				options.verbose = 1;
				break;
			case 'h':
				printf("Usage: ./libsrvr -p port -v[erbose] -h[elp]n");
				exit(1);
		}
	}

	libbase = event_base_new();
	libsrvr = evhttp_new(libbase);
	evhttp_bind_socket(libsrvr, options.address, options.port);
	evhttp_set_gencb(libsrvr, router, NULL);
	event_base_dispatch(libbase);

	return 0;
}

Here is some explanation for the above code:

  • Command line options are parsed using GNU getopt library
  • libbase is the event base for HTTP server libsrvr.
  • HTTP server is bind to port 4080 (by default).
  • A callback function is registered for each incoming HTTP request to libsrvr. Function router is invoked every time a HTTP request is received
  • Finally libbase is dispatched and code never reaches return 0

The working of the router function is as follows:

  • Incoming request uri is converted to absolute file path on the system
  • Checks for file or directory existence is done
  • If absolute path is a directory, LIBSRVR_INDEX is served out of that directory

Launching Libsrvr:
Compile and run the libsrvr as follows:

$ g++ -arch x86_64 -Wall -levent libsrvr.cpp -o libsrvr
$ ./libsrvr -v
/Users/sabhinav/libsrvr/www//index.html	538
/Users/sabhinav/libsrvr/www/assets/style.css	35
/Users/sabhinav/libsrvr/www/assets/script.js	27
/Users/sabhinav/libsrvr/www/dummy	404 Not Found
/Users/sabhinav/libsrvr/www/index.html	538
/Users/sabhinav/libsrvr/www/assets/style.css	35

If started under verbose mode (-v), libsrvr will output each requested file path on the console as shown above.

Use cases
Below are a few use cases of a custom HTTP server as seen in web today:

  • Facebook Chat: Uses a custom http server based on mochiweb framework
  • Yahoo finance: Uses a custom http streaming server based on libevent

Generally, iframe technique is combined with javascript hacks for streaming data from the custom http servers. Read “How to make cross-sub-domain ajax (XHR) requests using mod_proxy and iframes” for details.

Conclusion
Though a static file server find little place in today’s world, the idea was to show the ease by which you can create your own HTTP server which is light weight, fast and scalable (all thanks to Niels for his libevent). Couple libsrvr with memcached for caching static files, and benchmark will show over 10,000 req/sec handling capability of libsrvr.

Share if you like it and also let me know your thoughts through comments.

How to add content verification using hmac in PHP

Many times a requirement arises where we are supposed to expose an API for intended users, who can use these API endpoints to GET/POST data on our servers. But how do we verify that only the intended users are using these API’s and not any hacker or attacker. In this blog post, I will show you the most elegant way of adding content verification using hash_hmac (Hash-based Message Authentication Code) in PHP. This will allow us to restrict possible misuse of our API by simply issuing an API key for intended users.

Here are the steps for adding content verification using hmac in PHP:

  • Issue $private_key and $public_key for users allowed to post data using our API. You can use the method similar to one described here for generating public and private keys.
  • Users having these keys can now use following sample script (hmac-sender.php) to submit data:
            // User Public/Private Keys
            $private_key = 'private_key_user_id_9999';
            $public_key = 'public_key_user_id_9999';
    
            // Data to be submitted
            $data = 'This is a HMAC verification demonstration';
    
            // Generate content verification signature
            $sig = base64_encode(hash_hmac('sha1', $data, $private_key, TRUE));
    
            // Prepare json data to be submitted
            $json_data = json_encode(array('data'=>$data, 'sig'=>$sig, 'pubKey'=>$public_key));
    
            // Finally submit to api end point
            submit_to_api_end_point("http://yoursite.com/hmac-receiver.php?data=".urlencode($json_data));
  • At hmac-receiver.php, we validate the incoming data in following fashion:
            function get_private_key_for_public_key($public_key) {
                    // extract private key from database or cache store
                    return 'private_key_user_id_9999';
            }
    
            // Data submitted
            $data = $_GET['data'];
            $data = json_decode(stripslashes($data), TRUE);
    
            // User hit the end point API with $data, $signature and $public_key
            $message = $data['data'];
            $received_signature = $data['sig'];
            $private_key = get_private_key_for_public_key($data['pubKey']);
            $computed_signature = base64_encode(hash_hmac('sha1', $message, $private_key, TRUE));
    
            if($computed_signature == $received_signature) {
                    echo "Content Signature Verified";
            }
            else {
                    echo "Invalid Content Verification Signature";
            }
    

Where to use such verification?
This is an age old method for content verification which is used widely in a variety of applications. Below are a few places where hmac verification finds a place:

  • If you have exposed an API for your vendors to submit requested data
  • If you are looking to enable third party applications in your website. Similar to developer application model of facebook.

Hope you liked the post. Do leave your comments.
Enjoy!

How to use locks for assuring atomic operation in Memcached?

Memcached provide atomic increment and decrement commands to manipulate integer (key,value) pairs. However special care should be taken to ensure application performance and possible race conditions while using memcached. In this blog post, I will first build a facebook style “like” application using atomic increment command of memcached. Also, I will discuss various technical difficulty one would face while ensuring atomicity in this application. Finally, I will demo how to ensure atomicity over a requested process using custom locks in memcached.

Where should I care about it?
Lets consider a sample application as depicted by the flow diagram below:
Facebook style "like" demo architecture using "memcached"

The above application is similar to facebook “like” feature. In brief, we maintain a key per post e.g. $key="post_id_1234_likes_count", storing count of users who liked this post. Another $key="post_id_1234_user_id_9999", stores user_id_9999 relationship with post_id_1234. Example, “liked” which is set to 1 if liked and “timestamp” which is the time when user liked this post.

Since this application is going to reside on a high traffic website, earlier design decisions are made to have memcached in-front of MySQL database and will act as the primary storage medium with periodic syncs to the database. For me a like/dislike functionality is not so important as compared to other social functionality on my website.

Here is a sample code for the above functionality:

	$mem = new Memcache;
	$mem->addServer("127.0.0.1", 11211);

	function incrementLikeCount($post_id) {
		global $mem;

		// prepare post key
		$key = "post_id_".$post_id."_likes_count";

		// get old count
		$old_count = $mem->get($key);

		// false means no one liked this post before
		if($old_count === FALSE) $old_count = 0;

		// increment count
		$new_count = $old_count+1;

		// set new count value
		if($mem->set($key, $new_count, 0, 0)) {
			error_log("Incremented key ".$key." to ".$new_count);
			return TRUE;
		}
		else {
			error_log("Error occurred in incrementing key ".$key);
			return FALSE;
		}
	}

	// get incoming parameters
	$post_id = $_GET['post_id'];

	// take action
	incrementLikeCount($post_id);

Why should I care about it?
Save the above code sample in a file called memcached_no_lock.php and hit the url http://localhost/memcached_no_lock.php?post_id=1234 five times. Verify the key value in memcached:

centurydaily-lm:Workspace sabhinav$ telnet localhost 11211
Trying ::1...
Connected to localhost.
Escape character is '^]'.
get post_id_1234_likes_count
VALUE post_id_1234_likes_count 0 2
5
END

Alright, application seems to give expected results. Next, lets verify this application for high traffic websites using apache benchmark:

centurydaily-lm:Workspace sabhinav$ ab -n 100 -c 10 http://localhost/memcached_no_lock.php?post_id=1234
Concurrency Level:      10
Time taken for tests:   0.090 seconds
Complete requests:      100
Failed requests:        0
Write errors:           0
Total transferred:      22200 bytes
HTML transferred:       0 bytes
Requests per second:    1112.03 [#/sec] (mean)

Verify the key value in memcached:

centurydaily-lm:Workspace sabhinav$ telnet localhost 11211
Trying ::1...
Connected to localhost.
Escape character is '^]'.
get post_id_1234_likes_count
VALUE post_id_1234_likes_count 0 2
36
END

What happened? We expected value for $key="post_id_1234_likes_count" to reach 100, but actually it is 36. What went wrong? This behavior can be explained by simply looking at the apache error log file:

[Sat Dec 05 14:32:08 2009] [error] [client ::1] Incremented key post_id_1234_likes_count to 1
[Sat Dec 05 14:32:08 2009] [error] [client ::1] Incremented key post_id_1234_likes_count to 1
[Sat Dec 05 14:32:08 2009] [error] [client ::1] Incremented key post_id_1234_likes_count to 1
[Sat Dec 05 14:32:08 2009] [error] [client ::1] Incremented key post_id_1234_likes_count to 2
[Sat Dec 05 14:32:08 2009] [error] [client ::1] Incremented key post_id_1234_likes_count to 2
[Sat Dec 05 14:32:08 2009] [error] [client ::1] Incremented key post_id_1234_likes_count to 3
[Sat Dec 05 14:32:08 2009] [error] [client ::1] Incremented key post_id_1234_likes_count to 3
[Sat Dec 05 14:32:08 2009] [error] [client ::1] Incremented key post_id_1234_likes_count to 3

Ohk, from above log we understand concurrency killed our application, since we see $key being incremented to the same value by more than 1 incoming request.

How should I take care of this?
Below is the modified code sample which will allow us atomic increments:

	$mem = new Memcache;
	$mem->addServer("127.0.0.1", 11211);

	function incrementLikeCount($post_id) {
		global $mem;

		// prepare post key
		$key = "post_id_".$post_id."_likes_count";

		$new_count = $mem->increment($key, 1);
		if($new_count === FALSE) {
			$new_count = $mem->add($key, 1, 0, 0);
			if($new_count === FALSE) {
				error_log("Someone raced us for first count on key ".$key);
				$new_count = $mem->increment($key, 1);
				if($new_count === FALSE) {
					error_log("Unable to increment key ".$key);
					return FALSE;
				}
				else {
					error_log("Incremented key ".$key." to ".$new_count);
					return TRUE;
				}
			}
			else {
				error_log("Initialized key ".$key." to ".$new_count);
				return TRUE;
			}
		}
		else {
			error_log("Incremented key ".$key." to ".$new_count);
			return TRUE;
		}

	}

	// get incoming parameters
	$post_id = $_GET['post_id'];

	// take action
	incrementLikeCount($post_id);

To ensure atomicity, we start with incrementing the $key="post_id_1234_likes_count". Since memcached increment() is atomic by itself, we need not put any locking mechanism in here. However, memcached increment returns FALSE, if the $key doesn’t already exists.

Hence, if we get a FALSE response from the first increment, we will try to initialize $key using memcached add() command. Good thing about memcached add is that, it will return a false FALSE, if the $key is already present. Hence, if more than one thread is trying to initialize $key, only one of them will succeed. All the rest of the threads will return FALSE for add command. Finally, if the response is FALSE from the first add, we will try to increment the $key again.

Lets try to test this modified code with apache benchmark. Also, this time we will increase concurrency from 10 to 100 threads. Save the above modified code in a file called memcached_lock.php and issue the following ab command:

centurydaily-lm:Workspace sabhinav$ ab -n 10000 -c 100 http://localhost/memcached_lock.php?post_id=1234
Concurrency Level:      100
Time taken for tests:   11.006 seconds
Complete requests:      10000
Failed requests:        0
Write errors:           0
Total transferred:      2224884 bytes
HTML transferred:       0 bytes
Requests per second:    908.61 [#/sec] (mean)

Lets verify the key value inside memcached:

centurydaily-lm:Workspace sabhinav$ telnet localhost 11211
Trying ::1...
Connected to localhost.
Escape character is '^]'.
get post_id_1234_likes_count
VALUE post_id_1234_likes_count 0 5
10000
END

Bingo! As desired we have a value of 10000 for $key inside memcached.

Using custom locks for atomicity:
There can be many instances where you SHOULD try to process a request atomically using locks. For e.g. while trying to fetch a query from database or while trying to regenerate a requested page template in your custom template caching engine.

In the example below, I will modify the memcached_lock.php script to ensure atomic increments without using increment command. Instead I will use custom locks using memcached:

	$mem = new Memcache;
	$mem->addServer("127.0.0.1", 11211);

	function incrementLikeCount($post_id) {
		global $mem;

		// prepare post key
		$key = "post_id_".$post_id."_likes_count";

		// initialize lock
		$lock = FALSE;

		// initialize configurable parameters
		$tries = 0;
		$max_tries = 1000;
		$lock_ttl = 10;

		$new_count = $mem->get($key); // fetch older value
	    while($lock === FALSE && $tries < $max_tries) {
	    	if($new_count === FALSE) $new_count = 0;
	        $new_count = $new_count + 1;

			// add() will return false if someone raced us for this lock
                       // ALWAYS USE add() FOR CUSTOM LOCKS
	        $lock = $mem->add("lock_".$new_count, 1, 0, $lock_ttl);

			$tries++;
			usleep(100*($tries%($max_tries/10))); // exponential backoff style of sleep
		}

		if($lock === FALSE && $tries >= $max_tries) {
			error_log("Unable to increment key ".$key);
			return FALSE;
		}
		else {
	    	$mem->set($key, $new_count, 0, 0);
			error_log("Incremented key ".$key." to ".$new_count);
			return TRUE;
		}

	}

	// get incoming parameters
	$post_id = $_GET['post_id'];

	// take action
	incrementLikeCount($post_id);

Try testing it using apache benchmark as above and then verify it with memcached.

centurydaily-lm:Workspace sabhinav$ telnet localhost 11211
Trying ::1...
Connected to localhost.
Escape character is '^]'.
get post_id_1234_likes_count
VALUE post_id_1234_likes_count 0 3
100
END

Benchmarks:
We see a drop in performance from 1112 hits/sec (memcached_no_lock) to 908 hits/sec (memcached_lock using increment). This is majorly because of increased concurrency. At same concurrency level of 10, I received a performance benchmark of 1128 hits/sec with our thread protected code. However, for our custom lock code above, I received a performance benchmark of 275 hits/sec.

Conclusion:
Always use memcached increment/decrement while dealing with locks on integer valued keys. For achieving locks on a process, use custom locks as demoed above using memcached add command. Also custom locks are subjected to configurable options like $max_tries and others.

Hope you enjoyed reading.
Do let me know through your comments.