I offer a version written in C for inclusion in the benchmark:
https://gist.github.com/nkurz/5e49ba0ddb04e23de03f
I've only tested on Haswell under Linux, but results are good when compiled with gcc-4.8 -O2:
// 8981 LANGUAGE C 623
// 8981 LANGUAGE C++/clang 734
// 8981 LANGUAGE C++/gcc 755
I have tested, but I think the implementation should continue to perform well with larger graph sizes.