VTable Benchmarking

In order to evaluate the impacts of the various vtable modifications I proposed in my last post about D, I built a test program implementing them so I could do some benchmarks. And I got strange and surprising results.

The Test Program

The test program runs by performing 3 laps each doing 400 000 000 function calls for each type of call supported. Supported type of calls:

Static
A direct call to a function, no function pointer involved, no virtual call.
C++
D Class
C++ or D virtual dispatch built into the language.
Standard
A replica in C++ or D code of what the compiler does when creating a virtual call.
Dynamic
Accessing the virtual table by checking the function offset in a global variable.
Dynamic Indirect
Accessing the virtual table hidden behind a second layer of indirection and by checking the function offset in a global variable.
Objective-C
A dynamically dispatched call to a method of an Objective-C object instance.

If you want to check my results, or just the program, you can download it from here: VTable Benchmarks.zip

Unzip it, and you’ll have a folder containing the source code, and five scripts file named test-dmd.sh, test-dmd.bat, test-g++.sh, test-g++.bat, and test-gdc.sh. Those scripts compile the C++ or D code using the compiler in the script’s name and then run the test (you must have the compiler available in your path). .bat files are for Windows, .sh for Linux and Mac OS X. All arguments you pass to the scripts are passed to the compiler too, making it easy to test with various optimisation flags.

Feel free to experiment, improve the scripts and modify them to work with whatever compiler and platform that suits you, and please post your thoughts and own results in the comments.

Following are my own results.

Mac OS X (PowerPC G4)

./test-g++.sh -Os

== 3-lap average; totaling 1200000000 function calls ==
Static:           2349424 ticks; 1.000000 (base time)
C++:              5688451 ticks; 2.421211
Standard:         5687754 ticks; 2.420914
Dynamic:          6022387 ticks; 2.563346
Dynamic Indirect: 6691103 ticks; 2.847975
Objective-C:      10912837 ticks; 4.644898

Pretty much what I expected: C++ and Standard to yeild the same results, Dynamic being a little slower and Dynamic Indirect even more. Note that this is the only test in which I can compare with Objective-C’s dynamic dispatch. It’s interesting to note that method dispatching in Objective-C takes 465% the time of a static call and 192% the time of a C++ virtual call.

./test-gdc.sh -Os -frelease -finline

== 3-lap average; totaling 1200000000 function calls ==
Static:           234 ticks; 1.000000 (base time)
D class:          568 ticks; 2.423295
Standard:         568 ticks; 2.423295
Dynamic:          602 ticks; 2.565341
Dynamic Indirect: 702 ticks; 2.994318

The overall result for gdc is about the same as for g++, but the tick counts are way off. It’s almost as if you took the tick counts from the C++ test and divided them by 10000. I’m not sure why the clock function is lying in std.c.time, but there’s obviously a bug causing this strange behaviour in GDC. Because of this, we can’t compare ticks from those two tests directly. I think we can safely assume they’re just divided by 10000 because the result fits pretty well with those above if you do that, but this may not be very reliable.

Windows (AMD Duron)

test-g++.bat -Os

== 3-lap average; totaling 1200000000 function calls ==
Static:           2373 ticks; 1.000000 (base time)
C++:              3040 ticks; 1.281000
Standard:         3048 ticks; 1.284230
Dynamic:          3310 ticks; 1.394748
Dynamic Indirect: 3595 ticks; 1.514675

Once again, g++ gives me the results I’d expect: C++ virtual calls being almost the same as Standard, Dynamic being a little slower and Dynamic Indirect even more. It’s interesting to note that the gap between Standard and Dynamic is a lot smaller than between Dynamic and Dynamic Indirect.

test-dmd.bat -O -release -inline

== 3-lap average; totaling 1200000000 function calls ==
Static:           674 ticks; 1.000000 (base time)
D class:          6766 ticks; 10.034108
Standard:         2370 ticks; 3.514582
Dynamic:          8679 ticks; 12.871478
Dynamic Indirect: 9653 ticks; 14.315867

Ok… seems we’ve got a few problems here. Comparing the time for a static call with the one from g++, I’d say the static call was inlined despite my efforts to make that impossible. I’d also guess that Standard has been changed to a static call by the optimizer (the time is almost exactly the same as g++’s static call). So I’ll advance that the only reliable data would be the tick counts for the D class, Dynamic and Dynamic Indirect. When you compare them, you find that they’re all way higher than for g++, so I’m not sure what to think. Is DMD so bad at doing virtual calls?

I also tested DMD without without the -inline switch, but it isn’t much better for comparisons.

test-dmd.bat -O -release

== 3-lap average; totaling 1200000000 function calls ==
Static:           6760 ticks; 1.000000 (base time)
D class:          7100 ticks; 1.050296
Standard:         7107 ticks; 1.051331
Dynamic:          8549 ticks; 1.264694
Dynamic Indirect: 2933 ticks; 0.433974

I’m not sure why the static call is so slow (compared to g++) nor why Dynamic Indirect is faster than them all, but at least the other results seems consistent in their ratios: D virtual dispatch is pretty much equivalent to Standard, and Dynamic is a little faster.

Linux (AMD Duron)

./test-g++.sh -Os

== 3-lap average; totaling 1200000000 function calls ==
Static:           2693333 ticks; 1.000000 (base time)
C++:              3023333 ticks; 1.122525
Standard:         3026666 ticks; 1.123762
Dynamic:          3023333 ticks; 1.122525
Dynamic Indirect: 3363333 ticks; 1.248762

The overall results looks good, but there’s one oddity: Dynamic is as fast as Standard and C++ virtual dispatch. Also note that the overhead of a virtual call seems less than with g++ on Windows (1.122525 times the time of a static call, compared to 1.281000 on Windows).

./test-gdc.sh -Os -frelease -finline

== 3-lap average; totaling 1200000000 function calls ==
Static:           2690000 ticks; 1.000000 (base time)
D class:          3030000 ticks; 1.126394
Standard:         3023333 ticks; 1.123916
Dynamic:          3030000 ticks; 1.126394
Dynamic Indirect: 3370000 ticks; 1.252788

Pretty much the same as with g++. Dynamic keeps being as fast as Standard and C++ virtual dispatch for some unknown reasons.

./test-dmd.sh -O -release -inline

== 3-lap average; totaling 1200000000 function calls ==
Static:           1010000 ticks; 1.000000 (base time)
D class:          2650000 ticks; 2.623762
Standard:         2690000 ticks; 2.663366
Dynamic:          2690000 ticks; 2.663366
Dynamic Indirect: 3590000 ticks; 3.554456

Again, we have some interesting results. First, the static call is a lot faster than with g++ or gdc, so it probably got inlined like in the results for Windows. Comparing tick counts, virtual dispatch (D, Standard, and Dynamic) seems to run at the same speed as the static call made by g++ and gdc. Perhaps the tick counts are just in a different unit as the ones from g++ and gdc. If that’s the case, then we can still compare the relative scale, which shows us that Standard and Dynamic run at the same speed (just like g++ and gdc) but that the D virtual dispatch mecanism is a little faster for some reason.

Condensed Results

Percent of Static call time

Static Virtual Dynamic Dyn. Indirect
Mac G4 g++ 100% 242% 256% 285%
Mac G4 gdc 100% 243% 257% 300%
Win Duron g++ 100% 128% 139% 151%
Win Duron dmd 100% 1004% 1288% 1432%
Win Duron dmd (no-inline) 100% 105% 126% 43%
Linux Duron g++ 100% 112% 112% 125%
Linux Duron gdc 100% 113% 113% 125%
Linux Duron dmd 100% 262% 266% 355%

Percent of Virtual call time

Static Virtual Dynamic Dyn. Indirect
Mac G4 g++ 41% 100% 106% 118%
Mac G4 gdc 41% 100% 106% 124%
Win Duron g++ 78% 100% 109% 118%
Win Duron dmd 10% 100% 128% 143%
Win Duron dmd (no-inline) 95% 100% 120% 41%
Linux Duron g++ 89% 100% 100% 111%
Linux Duron gdc 89% 100% 100% 111%
Linux Duron dmd 38% 100% 102% 135%

Conclusions

All those tests seems to show that the overhead of the Dynamic dispatch method is pretty small when calling repeatedly the same function in a small loop. In the C++ benchmarks, it’s 6% more than virtual dispatch on a PowerPC G4, 9% more on Windows on an AMD Duron, and, strangly, exactly the same on Linux. I think that’s fast enough to justify having a non-fragile ABI for class functions in D.

But those results have some limitations. They don’t take into account the initialization overhead to populate static variables giving the offset of a function that must be taken before constructing an object. It could also be argued that calling repetitively the same function in a tight loop isn’t realistic and gives more optimisation opportunities than you’d normally have in a real program.

It also remains to be investigated why there’s no overhead for Dynamic on Linux. My only guess up to now is that the linker performs the optimisation; that’d explain why it does the same for DMD, GDC and G++ on Linux, when the former doesn’t share the same optimiser as the others. But right now that’s only a guess.

All this deserves that we dig into the generated assembly code. I’ve done it on PowerPC to make sure there was no unexpected inlining, but given the results it’d be more interesting to disassemble the x86 executables.


Comments

dhasenan

You can ensure that a function is not inlined by declaring a prototype as extern and then defining it in another module. Then build them separately and link. If a class method is exhibiting the same behavior, use an interface.

Michel Fortin

Thanks Dhasenan, but defining something as extern in D just changes the linkage, or in other word the name mangling and the argument passing conventions. I doubt it’ll have any impact, and beside, I want to test the default D linkage, which means I’d have to use extern (D), and such a function can’t belong to another module (because the name mangling would be different).

My current theory is that I should build object files separately instead of telling DMD to compile all the files in one go. I’ll have to give that a try.

As for using an interface: calling a function from an interface adds some overhead to get the actual object pointer, so it’d rather avoid it. The inlining problem only appears for static calls anyway, so I won’t bother with classes.