I'm always curious as to which technologies could do the job better. I've read that the V8 javascript engine performs the best in regular expression benchmarks so I had to see for myself. The test needs to be based on a "closer" to real world example so I used a 7.2MB Apache access log.
The tests are in the following technologies: PHP, HHVM, Python, Node.js, and C#. Now that C# is open source I was able to test all of them under the same Ubuntu linux system. Note, all tests did not perform any output.
<?php $file = file('../data/access.log'); $regex = '/^(\S+) \S+ \S+ \[([^\]]+)\] "([A-Z]+)[^"]*" \d+ \d+ "[^"]*" "([^"]*)"$/m'; for ($i = 0; $i < count($file); $i++) { if (preg_match($regex, $file[$i], $matches)) { } }
time php accesslog.php real 0m0.137s user 0m0.123s sys 0m0.012s
Using the same php code snipped above.
time hhvm accesslog.php real 0m0.152s user 0m0.103s sys 0m0.048s
import re r = re.compile('^(\S+) \S+ \S+ \[([^\]]+)\] "([A-Z]+)[^"]*" \d+ \d+ "[^"]*" "([^"]*)"$'); f = file('../data/access.log', 'r') for line in f: matches = r.match(line) # print matches.group(1) f.close()
time python accesslog.py real 0m0.060s user 0m0.060s sys 0m0.000s
Yes, I used the "readline" that is currently "unstable" in the node.js documentation. I may try a different approach but wanted to use the built in modules.
var readline = require('readline'), fs = require('fs'); var rd = readline.createInterface({ input: fs.createReadStream('../data/access.log'), output: process.stdout, terminal: false }); rd.on('line', function(line) { var matches = line.match(/^(\S+) \S+ \S+ \[([^\]]+)\] "([A-Z]+)[^"]*" \d+ \d+ "[^"]*" "([^"]*)"$/m); // console.log(matches[1], matches[2], matches[3]); });
Considering this technology has been around for only a few years, this is awesome!
time node accesslog.js real 0m0.086s user 0m0.086s sys 0m0.004s
Update! Instead of using the readline module. I did a simple split instead.
var fs = require('fs'); fs.readFile('../data/access.log', {encoding: 'utf8'}, function (err, data) { if (err) throw err; var lines = data.split("\n"); for (var i = 0, ii = lines.length; i < ii; i++) { var matches = lines[i].match(/^(\S+) \S+ \S+ \[([^\]]+)\] "([A-Z]+)[^"]*" \d+ \d+ "[^"]*" "([^"]*)"$/m); } });
time node accesslog.js real 0m0.063s user 0m0.055s sys 0m0.004s
Interested to see how a "compiled" language performs. Interpreted into bytecode/IL (intermediate language) that is compiled by the CLR (common language runtime aka virtual machine) like Java. Simply a JIT (Just-in-time) compiler.
using System; using System.IO; using System.Text; using System.Text.RegularExpressions; public class AccessLog { public static void Main() { string[] lines = System.IO.File.ReadAllLines("../data/access.log"); string pattern = "^(\\S+) \\S+ \\S+ \\[([^\\]]+)\\] \"([A-Z]+)[^\"]*\" \\d+ \\d+ \"[^\"]*\" \"([^\"]*)\"$"; foreach (string line in lines) { Match match = Regex.Match(line, pattern, RegexOptions.IgnoreCase); // Console.WriteLine(match.Result("$1")); } } }
time mono accesslog.exe real 0m0.559s user 0m0.539s sys 0m0.020s
Language/Runtime Environment | Real Time | User Time | System Time |
---|---|---|---|
Python v2.7.6 | .060s | .060s | .000s |
Node.js v0.10.25 (split new line) | .063s | .055s | .004s |
Node.js v0.10.25 (readline) | .086s | .086s | .004s |
PHP v5.5 | .137s | .123s | .012s |
HipHop VM v3.4.0 | .152s | .103s | .048s |
C# on Mono JIT Compiler v3.2.8 | .559s | .539s | .020s |
At the end of the day, this is comparing apples to oranges really. However in this specific task, python and node.js are the top performers... for now until I add database connections using given client libraries. Actually doing something with this data...