PHP tokens and opcodes : 3 useful extensions for understanding the working of Zend Engine

“PHP tokens and opcodes” – When a PHP script is executed it goes through a number of processes, before the final result is displayed. These processes are namely: Lexing, Parsing, Compiling and Executing. In this blog post, I will walk you through all these processes with a sample example. In the end I will list some useful PHP extensions, which can be used to analyze results of every intermediate process.

Lets take a sample PHP script as an example:

<?php
	function increment($a) {
		return $a+1;
	}
	$a = 3;
	$b = increment($a);
	echo $b;
?>

Try running this script through command line:

~ sabhinav$ php -r debug.php
4

This PHP script goes through the following processes before outputting the result:

  • Lexing: The php code inside debug.php is converted into tokens
  • Parsing: During this stage, tokens are processed to derive at meaningful expressions
  • Compiling: The derived expressions are compiled into opcodes
  • Execution: Opcodes are executed to derive at the final result

Lets see how a PHP script passes through all the above steps.

Lexing:
During this stage human readable php script is converted into token. For the first two lines of our PHP script:

<?php
	function increment($a) {

tokens will look like this (try to match the tokens below line by line with the above 2 lines of PHP code and you will get a feel):

~ sabhinav$ php -r 'print_r(token_get_all(file_get_contents("debug.php")));';
Array
(
    [0] => Array
        (
            [0] => 368             // 368 is the token number and it's symbolic name is T_OPEN_TAG, see below
            [1] => <?php

            [2] => 1
        )

    [1] => Array
        (
            [0] => 371
            [1] =>
            [2] => 2
        )

    [2] => Array
        (
            [0] => 334
            [1] => function
            [2] => 2
        )

    [3] => Array
        (
            [0] => 371
            [1] =>
            [2] => 2
        )

    [4] => Array
        (
            [0] => 307
            [1] => increment
            [2] => 2
        )

    [5] => (
    [6] => Array
        (
            [0] => 309
            [1] => $a
            [2] => 2
        )

    [7] => )
    [8] => Array
        (
            [0] => 371
            [1] =>
            [2] => 2
        )

    [9] => {
    [10] => Array
        (
            [0] => 371
            [1] =>

            [2] => 2
        )

A list of parser tokens can be found here: http://www.php.net/manual/en/tokens.php

Every token number has a symbolic name attached with it. Below is our PHP script with human readable code replaced by symbolic name for each generated token:

~ sabhinav$ php -r '$tokens = (token_get_all(file_get_contents("debug.php"))); foreach($tokens as $token) { if(count($token) == 3) { echo token_name($token[0]); echo $token[1]; echo token_name($token[2]);  }  }';
T_OPEN_TAG<?php
UNKNOWNT_WHITESPACE	UNKNOWNT_FUNCTIONfunctionUNKNOWNT_WHITESPACE UNKNOWNT_STRINGincrementUNKNOWNT_VARIABLE$aUNKNOWNT_WHITESPACE UNKNOWNT_WHITESPACE
		UNKNOWNT_RETURNreturnUNKNOWNT_WHITESPACE UNKNOWNT_VARIABLE$aUNKNOWNT_LNUMBER1UNKNOWNT_WHITESPACE
	UNKNOWNT_WHITESPACE
	UNKNOWNT_VARIABLE$aUNKNOWNT_WHITESPACE UNKNOWNT_WHITESPACE UNKNOWNT_LNUMBER3UNKNOWNT_WHITESPACE
	UNKNOWNT_VARIABLE$bUNKNOWNT_WHITESPACE UNKNOWNT_WHITESPACE UNKNOWNT_STRINGincrementUNKNOWNT_VARIABLE$aUNKNOWNT_WHITESPACE
	UNKNOWNT_ECHOechoUNKNOWNT_WHITESPACE UNKNOWNT_VARIABLE$bUNKNOWNT_WHITESPACE
UNKNOWN

Parsing and Compiling:
By generating the tokens in the above step, zend engine is able to recognize each and every detail in the script. Where the spaces are, where are the new line characters, where is a user defined function and what not. Over the next two stages, the generated tokens are parsed and then compiled into opcodes. Below is the compiled opcode for the complete sample script of ours:

~ sabhinav$ php -r '$op_codes = parsekit_compile_file("debug.php", $errors, PARSEKIT_SIMPLE); print_r($op_codes); print_r($errors);';
Array
(
    [0] => ZEND_EXT_STMT UNUSED UNUSED UNUSED
    [1] => ZEND_NOP UNUSED UNUSED UNUSED
    [2] => ZEND_EXT_STMT UNUSED UNUSED UNUSED
    [3] => ZEND_ASSIGN T(0) T(0) 3
    [4] => ZEND_EXT_STMT UNUSED UNUSED UNUSED
    [5] => ZEND_EXT_FCALL_BEGIN UNUSED UNUSED UNUSED
    [6] => ZEND_SEND_VAR UNUSED T(0) 0x1
    [7] => ZEND_DO_FCALL T(1) 'increment' 0x83E710CA
    [8] => ZEND_EXT_FCALL_END UNUSED UNUSED UNUSED
    [9] => ZEND_ASSIGN T(2) T(0) T(1)
    [10] => ZEND_EXT_STMT UNUSED UNUSED UNUSED
    [11] => ZEND_ECHO UNUSED T(0) UNUSED
    [12] => ZEND_RETURN UNUSED 1 UNUSED
    [function_table] => Array
        (
            [increment] => Array
                (
                    [0] => ZEND_EXT_NOP UNUSED UNUSED UNUSED
                    [1] => ZEND_RECV T(0) 1 UNUSED
                    [2] => ZEND_EXT_STMT UNUSED UNUSED UNUSED
                    [3] => ZEND_ADD T(0) T(0) 1
                    [4] => ZEND_RETURN UNUSED T(0) UNUSED
                    [5] => ZEND_EXT_STMT UNUSED UNUSED UNUSED
                    [6] => ZEND_RETURN UNUSED NULL UNUSED
                )

        )

    [class_table] =>
)

As we can see above, Zend engine is able to recognize the flow of our PHP. For instance, [3] => ZEND_ASSIGN T(0) T(0) 3 is a replacement for $a = 3; in our PHP code. Read on to understand what do these T(0) in the opcode means.

Executing the opcodes:
The generated opcode is executed one by one. Below table shows various details as every opcode is executed:

~ sabhinav$ php -d vld.active=1 -d vld.execute=0 -f debug.php
Branch analysis from position: 0
Return found
filename:       /Users/sabhinav/Workspace/interview/facebook/peaktraffic/debug.php
function name:  (null)
number of ops:  13
compiled vars:  !0 = $a, !1 = $b
line     #  op                           fetch          ext  return  operands
-------------------------------------------------------------------------------
   2     0  EXT_STMT
         1  NOP
   5     2  EXT_STMT
         3  ASSIGN                                                   !0, 3
   6     4  EXT_STMT
         5  EXT_FCALL_BEGIN
         6  SEND_VAR                                                 !0
         7  DO_FCALL                                      1          'increment'
         8  EXT_FCALL_END
         9  ASSIGN                                                   !1, $1
   7    10  EXT_STMT
        11  ECHO                                                     !1
   8    12  RETURN                                                   1

Function increment:
Branch analysis from position: 0
Return found
filename:       /Users/sabhinav/Workspace/interview/facebook/peaktraffic/debug.php
function name:  increment
number of ops:  7
compiled vars:  !0 = $a
line     #  op                           fetch          ext  return  operands
-------------------------------------------------------------------------------
   2     0  EXT_NOP
         1  RECV                                                     1
   3     2  EXT_STMT
         3  ADD                                              ~0      !0, 1
         4  RETURN                                                   ~0
   4     5* EXT_STMT
         6* RETURN                                                   null

End of function increment.

First table represents the main loop run, while second table represents the run of user defined function in the php script. compiled vars: !0 = $a tells us that internally while script execution !0 = $a and hence now we can relate [3] => ZEND_ASSIGN T(0) T(0) 3 very well.

Above table also returns back the number of operations number of ops: 13 which can be used to benchmark and performance enhancement of your PHP script.

If APC cache is enabled, it caches the opcodes and thereby avoiding repetitive lexing/parsing/compiling every time same PHP script is called.

3 PHP extensions providing interface to Zend Engine:
Below are 3 very useful PHP extensions for geeky PHP developers. (Specially helpful for all PHP extension developers)

  • Tokenizer: The tokenizer functions provide an interface to the PHP tokenizer embedded in the Zend Engine. Using these functions you may write your own PHP source analyzing or modification tools without having to deal with the language specification at the lexical level.
  • Parsekit: These parsekit functions allow runtime analysis of opcodes compiled from PHP scripts.
  • Vulcan Logic Disassembler (vld): Provides functionality to dump the internal representation of PHP scripts. Homepage of VLD project for download instructions.

Hope this is of some help for PHP geeks out there.
Enjoy!