Timing CPU Code

Best practices for timing cpu execution seems to change with the seasons. One technique I always come back to, especially when dealing with short sections of code, is to use the CPU’s time stamp counter (TSC). Now there are a lot of caveats here, like we need to be on x86/x64 and we need to make sure our cpu supports “constant_tsc” (and that it’s synchronized among cores). But let’s leave all that discussion for another day and get straight to the technique –

For consistent TSC results with minimal overhead, Intel recommends┬╣

1. Issue a serializing instruction (CPUID)
2. Read the time stamp counter, store it (RDTSC)
3. Execute any code we are interested in timing (YOUR CODE HERE)
4. Read the time stamp counter again with a serializing read, store it (RDTSCP)
5. Issue a serializing instruction (CPUID)

And here’s the code to do so –

#include <stdint.h>

inline __attribute__((always_inline)) uint64_t start_clock()
{

        uint64_t hi, lo;

        asm volatile (
                "cpuid\n\t"
                "rdtsc\n\t"
                "mov %%rax, %0\n\t"
                "mov %%rdx, %1\n\t"
                : "=r"(lo),"=r"(hi)
                ::"%rax", "%rbx", "%rcx", "%rdx"
        );
        return hi << 32 | lo ;
}

inline __attribute__((always_inline)) uint64_t end_clock()
{
        uint64_t hi, lo;

        asm volatile (
                "rdtscp\n\t"
                "mov %%rax, %0\n\t"
                "mov %%rdx, %1\n\t"
                "cpuid\n\t"
                : "=r"(lo),"=r"(hi)
                ::"%rax", "%rbx", "%rcx", "%rdx"
        );
        return hi << 32 | lo ;
}

Stick this in a .h file, include it, and sandwich your code between these two function calls. Subtract the start time from the end time (elapsed = end – start) and we’ve got a pretty good estimate of the execution time in clock cycles. The overhead of the timing code itself is something like 35 cycles on my machine. Test it in your own environment.

It compiles on gcc and clang. I haven’t tried it with visual studio, but it almost certainly does not work there (port it!).



┬╣Gabriele Paoloni described this in his whitepaper.