I'm always curious as to which technologies could do the job better. I've read that the V8 javascript engine performs the best in regular expression benchmarks so I had to see for myself. The test needs to be based on a "closer" to real world example so I used a 7.2MB Apache access log.

The tests are in the following technologies: PHP, HHVM, Python, Node.js, and C#. Now that C# is open source I was able to test all of them under the same Ubuntu linux system. Note, all tests did not perform any output.


PHP v5.5

<?php

$file = file('../data/access.log');
$regex = '/^(\S+) \S+ \S+ \[([^\]]+)\] "([A-Z]+)[^"]*" \d+ \d+ "[^"]*" "([^"]*)"$/m';

for ($i = 0; $i < count($file); $i++) {
    if (preg_match($regex, $file[$i], $matches)) {

    }
}
time php accesslog.php

real	0m0.137s
user	0m0.123s
sys	0m0.012s

HipHop VM v3.4.0

Using the same php code snipped above.

time hhvm accesslog.php

real	0m0.152s
user	0m0.103s
sys	0m0.048s

Python v2.7.6

import re

r = re.compile('^(\S+) \S+ \S+ \[([^\]]+)\] "([A-Z]+)[^"]*" \d+ \d+ "[^"]*" "([^"]*)"$');
f = file('../data/access.log', 'r')
for line in f:
    matches = r.match(line)
    # print matches.group(1)

f.close()
time python accesslog.py

real	0m0.060s
user	0m0.060s
sys	0m0.000s

Node.js v0.10.25

Yes, I used the "readline" that is currently "unstable" in the node.js documentation. I may try a different approach but wanted to use the built in modules.

var readline = require('readline'),
    fs = require('fs');

var rd = readline.createInterface({
    input: fs.createReadStream('../data/access.log'),
    output: process.stdout,
    terminal: false
});

rd.on('line', function(line) {
    var matches = line.match(/^(\S+) \S+ \S+ \[([^\]]+)\] "([A-Z]+)[^"]*" \d+ \d+ "[^"]*" "([^"]*)"$/m);
    // console.log(matches[1], matches[2], matches[3]);
});

Considering this technology has been around for only a few years, this is awesome!

time node accesslog.js

real	0m0.086s
user	0m0.086s
sys	0m0.004s

Update! Instead of using the readline module. I did a simple split instead.

var fs = require('fs');

fs.readFile('../data/access.log', {encoding: 'utf8'}, function (err, data) {
    if (err) throw err;

    var lines = data.split("\n");
    for (var i = 0, ii = lines.length; i < ii; i++) {
        var matches = lines[i].match(/^(\S+) \S+ \S+ \[([^\]]+)\] "([A-Z]+)[^"]*" \d+ \d+ "[^"]*" "([^"]*)"$/m);
    }
});
time node accesslog.js

real	0m0.063s
user	0m0.055s
sys	0m0.004s

C# on Mono JIT Compiler v3.2.8

Interested to see how a "compiled" language performs. Interpreted into bytecode/IL (intermediate language) that is compiled by the CLR (common language runtime aka virtual machine) like Java. Simply a JIT (Just-in-time) compiler.

using System;
using System.IO;
using System.Text;
using System.Text.RegularExpressions;

public class AccessLog
{
    public static void Main()
    {
        string[] lines = System.IO.File.ReadAllLines("../data/access.log");
        string pattern = "^(\\S+) \\S+ \\S+ \\[([^\\]]+)\\] \"([A-Z]+)[^\"]*\" \\d+ \\d+ \"[^\"]*\" \"([^\"]*)\"$";

        foreach (string line in lines) {
            Match match = Regex.Match(line, pattern, RegexOptions.IgnoreCase);
            // Console.WriteLine(match.Result("$1"));
        }
    }
}
time mono accesslog.exe

real	0m0.559s
user	0m0.539s
sys	0m0.020s

The Results

Language/Runtime Environment Real Time User Time System Time
Python v2.7.6 .060s .060s .000s
Node.js v0.10.25 (split new line) .063s .055s .004s
Node.js v0.10.25 (readline) .086s .086s .004s
PHP v5.5 .137s .123s .012s
HipHop VM v3.4.0 .152s .103s .048s
C# on Mono JIT Compiler v3.2.8 .559s .539s .020s

At the end of the day, this is comparing apples to oranges really. However in this specific task, python and node.js are the top performers... for now until I add database connections using given client libraries. Actually doing something with this data...


TODO

  • Test Java.
  • Test C++.
  • Run multiple tests and perform an average mean, median, and mode.

comments powered by Disqus
eXTReMe Tracker