In order to evaluate the impacts of the various vtable modifications I proposed in my last post about D, I built a test program implementing them so I could do some benchmarks. And I got strange and surprising results.
The test program runs by performing 3 laps each doing 400 000 000 function calls for each type of call supported. Supported type of calls:
If you want to check my results, or just the program, you can download it from here: VTable Benchmarks.zip
Unzip it, and you’ll have a folder containing the source code, and five scripts file named test-dmd.sh, test-dmd.bat, test-g++.sh, test-g++.bat, and test-gdc.sh. Those scripts compile the C++ or D code using the compiler in the script’s name and then run the test (you must have the compiler available in your path). .bat files are for Windows, .sh for Linux and Mac OS X. All arguments you pass to the scripts are passed to the compiler too, making it easy to test with various optimisation flags.
Feel free to experiment, improve the scripts and modify them to work with whatever compiler and platform that suits you, and please post your thoughts and own results in the comments.
Following are my own results.
./test-g++.sh -Os == 3-lap average; totaling 1200000000 function calls == Static: 2349424 ticks; 1.000000 (base time) C++: 5688451 ticks; 2.421211 Standard: 5687754 ticks; 2.420914 Dynamic: 6022387 ticks; 2.563346 Dynamic Indirect: 6691103 ticks; 2.847975 Objective-C: 10912837 ticks; 4.644898
Pretty much what I expected: C++ and Standard to yeild the same results, Dynamic being a little slower and Dynamic Indirect even more. Note that this is the only test in which I can compare with Objective-C’s dynamic dispatch. It’s interesting to note that method dispatching in Objective-C takes 465% the time of a static call and 192% the time of a C++ virtual call.
./test-gdc.sh -Os -frelease -finline == 3-lap average; totaling 1200000000 function calls == Static: 234 ticks; 1.000000 (base time) D class: 568 ticks; 2.423295 Standard: 568 ticks; 2.423295 Dynamic: 602 ticks; 2.565341 Dynamic Indirect: 702 ticks; 2.994318
The overall result for gdc is about the same as for g++, but the tick counts are way off. It’s almost as if you took the tick counts from the C++ test and divided them by 10000. I’m not sure why the
clock function is lying in
std.c.time, but there’s obviously a bug causing this strange behaviour in GDC. Because of this, we can’t compare ticks from those two tests directly. I think we can safely assume they’re just divided by 10000 because the result fits pretty well with those above if you do that, but this may not be very reliable.
test-g++.bat -Os == 3-lap average; totaling 1200000000 function calls == Static: 2373 ticks; 1.000000 (base time) C++: 3040 ticks; 1.281000 Standard: 3048 ticks; 1.284230 Dynamic: 3310 ticks; 1.394748 Dynamic Indirect: 3595 ticks; 1.514675
Once again, g++ gives me the results I’d expect: C++ virtual calls being almost the same as Standard, Dynamic being a little slower and Dynamic Indirect even more. It’s interesting to note that the gap between Standard and Dynamic is a lot smaller than between Dynamic and Dynamic Indirect.
test-dmd.bat -O -release -inline == 3-lap average; totaling 1200000000 function calls == Static: 674 ticks; 1.000000 (base time) D class: 6766 ticks; 10.034108 Standard: 2370 ticks; 3.514582 Dynamic: 8679 ticks; 12.871478 Dynamic Indirect: 9653 ticks; 14.315867
Ok… seems we’ve got a few problems here. Comparing the time for a static call with the one from g++, I’d say the static call was inlined despite my efforts to make that impossible. I’d also guess that Standard has been changed to a static call by the optimizer (the time is almost exactly the same as g++’s static call). So I’ll advance that the only reliable data would be the tick counts for the D class, Dynamic and Dynamic Indirect. When you compare them, you find that they’re all way higher than for g++, so I’m not sure what to think. Is DMD so bad at doing virtual calls?
I also tested DMD without without the
-inline switch, but it isn’t much better for comparisons.
test-dmd.bat -O -release == 3-lap average; totaling 1200000000 function calls == Static: 6760 ticks; 1.000000 (base time) D class: 7100 ticks; 1.050296 Standard: 7107 ticks; 1.051331 Dynamic: 8549 ticks; 1.264694 Dynamic Indirect: 2933 ticks; 0.433974
I’m not sure why the static call is so slow (compared to g++) nor why Dynamic Indirect is faster than them all, but at least the other results seems consistent in their ratios: D virtual dispatch is pretty much equivalent to Standard, and Dynamic is a little faster.
./test-g++.sh -Os == 3-lap average; totaling 1200000000 function calls == Static: 2693333 ticks; 1.000000 (base time) C++: 3023333 ticks; 1.122525 Standard: 3026666 ticks; 1.123762 Dynamic: 3023333 ticks; 1.122525 Dynamic Indirect: 3363333 ticks; 1.248762
The overall results looks good, but there’s one oddity: Dynamic is as fast as Standard and C++ virtual dispatch. Also note that the overhead of a virtual call seems less than with g++ on Windows (1.122525 times the time of a static call, compared to 1.281000 on Windows).
./test-gdc.sh -Os -frelease -finline == 3-lap average; totaling 1200000000 function calls == Static: 2690000 ticks; 1.000000 (base time) D class: 3030000 ticks; 1.126394 Standard: 3023333 ticks; 1.123916 Dynamic: 3030000 ticks; 1.126394 Dynamic Indirect: 3370000 ticks; 1.252788
Pretty much the same as with g++. Dynamic keeps being as fast as Standard and C++ virtual dispatch for some unknown reasons.
./test-dmd.sh -O -release -inline == 3-lap average; totaling 1200000000 function calls == Static: 1010000 ticks; 1.000000 (base time) D class: 2650000 ticks; 2.623762 Standard: 2690000 ticks; 2.663366 Dynamic: 2690000 ticks; 2.663366 Dynamic Indirect: 3590000 ticks; 3.554456
Again, we have some interesting results. First, the static call is a lot faster than with g++ or gdc, so it probably got inlined like in the results for Windows. Comparing tick counts, virtual dispatch (D, Standard, and Dynamic) seems to run at the same speed as the static call made by g++ and gdc. Perhaps the tick counts are just in a different unit as the ones from g++ and gdc. If that’s the case, then we can still compare the relative scale, which shows us that Standard and Dynamic run at the same speed (just like g++ and gdc) but that the D virtual dispatch mecanism is a little faster for some reason.
|Mac G4 g++||100%||242%||256%||285%|
|Mac G4 gdc||100%||243%||257%||300%|
|Win Duron g++||100%||128%||139%||151%|
|Win Duron dmd||100%||1004%||1288%||1432%|
|Win Duron dmd (no-inline)||100%||105%||126%||43%|
|Linux Duron g++||100%||112%||112%||125%|
|Linux Duron gdc||100%||113%||113%||125%|
|Linux Duron dmd||100%||262%||266%||355%|
|Mac G4 g++||41%||100%||106%||118%|
|Mac G4 gdc||41%||100%||106%||124%|
|Win Duron g++||78%||100%||109%||118%|
|Win Duron dmd||10%||100%||128%||143%|
|Win Duron dmd (no-inline)||95%||100%||120%||41%|
|Linux Duron g++||89%||100%||100%||111%|
|Linux Duron gdc||89%||100%||100%||111%|
|Linux Duron dmd||38%||100%||102%||135%|
All those tests seems to show that the overhead of the Dynamic dispatch method is pretty small when calling repeatedly the same function in a small loop. In the C++ benchmarks, it’s 6% more than virtual dispatch on a PowerPC G4, 9% more on Windows on an AMD Duron, and, strangly, exactly the same on Linux. I think that’s fast enough to justify having a non-fragile ABI for class functions in D.
But those results have some limitations. They don’t take into account the initialization overhead to populate static variables giving the offset of a function that must be taken before constructing an object. It could also be argued that calling repetitively the same function in a tight loop isn’t realistic and gives more optimisation opportunities than you’d normally have in a real program.
It also remains to be investigated why there’s no overhead for Dynamic on Linux. My only guess up to now is that the linker performs the optimisation; that’d explain why it does the same for DMD, GDC and G++ on Linux, when the former doesn’t share the same optimiser as the others. But right now that’s only a guess.
All this deserves that we dig into the generated assembly code. I’ve done it on PowerPC to make sure there was no unexpected inlining, but given the results it’d be more interesting to disassemble the x86 executables.