Hello Cache Freaks,
Ever since I changed my job (from Business Intelligence to Web development) and started working with my present employer, I have had a chance to work on a lot of scalable projects. From making my project to scale from 20 Million PV’s to 100 Million PV’s to development of an internal tool, the answer to all scalable applications have been caching.
There are a lot of caching techniques which are being employed by sites worldwide.
- WP-Cache used in wordpress – a file system based caching mechanism
- APC Cache – an opcode based caching system
- Memcache – an in memory caching system
- Query Cache – caching mechanism employed in MySQL
Here in this post I would like to pen down my experiences while working with all the caching mechanism. Their pros and cons. What things you need to take care while working with them and every little tit bit which comes to my mind while writing this post.
Query Cache – inbuilt cache mechanism in MySQL
Query cache is an inbuilt cache mechanism for MySQL. Basically when you fire a query against a MySQL database, it goes through a lot of core modules. e.g. Connection Manager, Thread Manager, Connection Thread, User Authentication Module, Parse Module, Command Dispatcher, Optimizer Module, Table Manager, Query Cache Module and blah blah. Discussing these modules is out of scope of this blog post. The only module we are interested here is Query Cache Module.
Suppose I fire a query:
$query = “Select * from users”;
MySQL fetches it for the first time from the database and caches the result for further similar query. Next time when you fire a similar query, it picks the result from the cache and deliver it back to you.
However, there are a few drawbacks with Query Cache:
- If there is a new entry in the users table, the cache is cleared.
- Secondly, even if the result of your query is cached, MySQL has to go through a number of core modules before it give back the cached result to you.
- Thirdly, even if your results are caches, you need to connect to your MySQL database, which generally have a bottleneck with number of connections allowed.
One thing which you should take care while replying on Query Cache is that, your query must not have any parameter which is random or changes quite often. For e.g. If you wish to fetch url’s of 100 photo from the database, and then you want to present them in a random fashion every time, you might be tempted to use rand() somewhere in your MySQL queries. However by using rand() you ensure that the query is never returned from cache, since rand() parameter always makes the query look different.
Similarly, If you have a case where you need to show data not older than 15 days, and by mistake you also include the seconds parameter in your SQL query, then the query will never return from cache.
Hence for a scalable site, with 100 Million PV’s you can’t really survive with a simple query cache provided by MySQL database.
WP-Cache – Caching mechanism for WordPress blogs
WP-Cache is a file system based caching mechanism i.e. it caches your blog posts in form of simple text files which are saved on your file system. You can have a look at these cached files by visiting wp-content/cache folder inside your blog directory. Generally you will find two set of files for a single blog post. One .html and another .meta file.
.html file generally contains the static html content for your blog post. Once published, the blog post content is static, hence instead of fetching it’s data from the database, WP-Cache serves it from the cache directory.
.meta file contains serialized information such as Last Modified, Content-Type, URI which WP-Cache uses to maintain cache expiry on your blog.
WP-Cache works really well, however if the traffic starts increasing on your blog, then the bottleneck will be maximum number of child processes which apache can create (For starters you can think, each user connecting to your blog as one apache process, hence there is a restriction on number of users who can connect to your blog at a particular time). In such case the solution can be either to have multiple front end servers with a load balancer to distribute the traffic among front end servers, or to have a better cache solution such as a memory based caching mechanism (e.g. Memcache). Also since memory read is always faster than file read, you must go for a memory based cache system.
APC Cache – An opcode based cache for PHP
APC stands for Alternative PHP Cache. In one of my previous post How does PHP echo’s a “Hello World”? – Behind the scene, I talk about how PHP churns out “Hello World” for you.
PHP takes your code through various steps:
- Scanning – The human readable source code is turned into tokens.
- Parsing – Groups of tokens are collected into simple, meaningful expressions.
- Compilation – Expressions are translated into instruction (opcodes)
- Execution – Opcode stacks are processed (one opcode at a time) to perform the scripted tasks.
Opcode caches let the ZEND engine perform the first three of these steps, then store that compiled form (opcodes) so that the next time a given script is used, it can use the stored version without having to redo those steps only to come to the same result.
However the problem with APC cache is that it is not a distributed cache system. By distributed cache I mean, if you have 3 frontend server then you need to have a copy of this opcode on all the three fronend server.
Also like WP-Cache APC is again a file system driven cache system, which is not the optimal solution. Also with APC cache, PHP still has to go through the last step as described above.
Memcache – In memory based cache mechanism
Memcache is the solution when you talk about million PV’s on your site. It is a high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load.
For starters, Memcache is not PHP, nor Python or any other thing as you may think. It’s a deamon which runs in the background on your server. Your code connects to it and cache query results, JS, CSS and other cachable data in the server’s memory (RAM). Since its an in-memory caching system, it is faster than any of the discussed caching system above. However it is unreliable and unsecure but then there are ways to tackle this unreliable and unsecure nature of memcache.
Memcache is unreliable because it resides in your system memory. So an event like system reboot or power failure will result in loss of all your cache. Also memcache provides no mechanism to take backup of your caches. Hence once lost, you need to programmatically warmup your caches.
Memcache is unsecure because it doesn’t require any authentication mechanism. There is no username or password with which your code connects to it. (Hence it is super fast, unlike Query cache which has to go through auth module even if the query result is cached). It usually runs at port 11211 on your server and if not taken care, anyone can telnet to port 11211 on your server and steal your caches.
Below are the steps which are being followed on a memcache enabled website:
- User enter your site’s url in his browser, say http://localhost
- There are about 6 queries which drives your opening page
- Lets assume one of the query for this page is $query = “SELECT photo_url from photos LIMIT 0,10″
- When the user visit http://localhost, your code will first connect to memcache deamon on port 11211
- If memcache deamon is running, it checks if the result of this query are already cached. Generally data is cached in memcache as (key,value) pair
- Since this is the first visit on your site, ofcourse there is no query being cached till now. Hence your code now connect to your MySQL database and fetched the resultset. Something like this probably. $resultset = mysql_query($query);
- After successfully fetching the resultset, your code will cache this resultset in memcache. The code connects to memcache deamon and saves this (key,value) pair in memory, where $key = md5($query) and $value = serialize($resultset)
- After caching the (key,value) pair, your code returns back the fetched resultset to the frontpage where it is being displayed to the user
- Similarly all the 6 queries which drives your front page are being cached one by one
- Now the next time when another user visit this page i.e. http://localhost, your code will first see if $key = md5($query) is already present in cache. It will find the $key being cached, fetches the serialized resultset from memory, unserialize it and throws back to the front page where it is displayed as intended
- While caching (key,value) pair you also have a provision to specify a TTL (time to live) after which memcache will automatically expire your cached result
- Suppose you specifies a TTL = 15 Minutes for all the above queries and now a visitor visit http://localhost after 30 minutes
- Your code will again first connect to memcache deamon and check if $key = md5($query) is present in cache. Memcache deamon will see that this $key is present but it has expired. It will return a result saying $key is not cached and internally flushes out $key. Your code then again connect to MySQL database, fetches the resultset and cache back the results in memcache for further use
I will leave you with a presentation on memcache which I gave sometime back at office. I hope it will help you gain more understanding of memcache.
In my next posts, I will be covering a few code samples and use cases of memcache which you wouldn’t even heard of. If you liked the post don’t forget to promote it on social networking sites and do subscribe to my blog.