What is branch miss and how it can hurt

What exactly compiler optimizations are for?
What happens if compiler fails to optimize the code properly?
How to NOT defeat compiler optimizations?
How to find the problematic code spots?

Outline

Simple experiment
Pipelining and out of the order execution
Finding branch misses with perf
Dealing with branch misses
Conclusion

Simple experiment

Compute sum of integers 1, 2, … 5000
Compute partial sum: use only elements >= 2500

import numpy as np

N = 5000
a = np.arange(1, N+1)
a.sum()
a.sum(where=a>=N/2)

np.random.shuffle(a)
a.sum()
a.sum(where=a>=N/2)

Questions:

Which operation is faster?
Does the order of elements matter?
a.sum(): N additions
a.sum(where=a>=N/2): N/2 additions + N comparisons

Expectations:

sum(where=a>=N/2) is 1.5x slower
The order of elements is irrelevant

python3 intro.py

Reality: (spoiler)

ordered all     : 1+2+...+5000     : 0.613 sec
ordered partial : >= 2500          : 1.184 sec
shuffled all    : 1839+1022+...+1299: 0.639 sec
shuffled partial: >= 2500          : 3.647 sec

Conditional sum is ~ 3 – 6x slower with shuffled array. Why?

Maybe it’s a Python bug?

complete source

#define LOOPS 100000

long long rand_partsum(unsigned N, unsigned limit) {
	long long sum = 0;
	unsigned vec[N];
	for (unsigned i = 0; i < N; i++) {
		vec[i] = i + 1;
	}
	random_shuffle(vec, N);

	for (int k = 0; k < LOOPS; k++) {
		for (unsigned i = 0; i < N; i++) {
			if (vec[i] >= limit) {
				sum += vec[i];
			}
		}
	}
	return sum;
}

It's not very different with C

$ gcc-8 -O0 -static -o rand_partsum rand_partsum.c
$ time ./rand_partsum
real	0m3,457s
user	0m3,411s
sys	0m0,004s

$ gcc-8 -O0 -static -o partsum partsum.c 
$ time ./partsum
real	0m1,319s
user	0m1,302s
sys	0m0,000s

More powerful tool: perf stat

Run the program and collect basic performance metrics

$ sudo sysctl -w kernel.perf_event_paranoid=0
$ perf stat -r 3 ./partsum
$ perf stat -r 3 ./rand_partsum

 Performance counter stats for './partsum' (3 runs):
      1 272,01 msec task-clock                #    0,985 CPUs utilized            ( +-  0,08% )
             5      context-switches          #    0,004 K/sec                    ( +- 12,50% )
             0      cpu-migrations            #    0,000 K/sec                    ( +-100,00% )
            28      page-faults               #    0,022 K/sec                    ( +-  1,18% )
 3 553 269 510      cycles                    #    2,793 GHz                      ( +-  0,08% )
 5 753 454 730      instructions              #    1,62  insn per cycle           ( +-  0,00% )
 1 000 692 747      branches                  #  786,701 M/sec                    ( +-  0,00% )
       272 136      branch-misses             #    0,03% of all branches          ( +-  2,12% )

       1,29187 +- 0,00355 seconds time elapsed  ( +-  0,27% )

 Performance counter stats for './rand_partsum' (3 runs):
      3 203,43 msec task-clock                #    0,995 CPUs utilized            ( +-  0,16% )
            11      context-switches          #    0,003 K/sec                    ( +-  5,25% )
             0      cpu-migrations            #    0,000 K/sec                    ( +-100,00% )
            28      page-faults               #    0,009 K/sec                  
 8 948 550 001      cycles                    #    2,793 GHz                      ( +-  0,16% )
 5 757 979 488      instructions              #    0,64  insn per cycle           ( +-  0,01% )
 1 001 672 449      branches                  #  312,688 M/sec                    ( +-  0,01% )
   179 510 980      branch-misses             #   <b>17,92%</b> of all branches          ( +-  0,09% )

       3,21938 +- 0,00910 seconds time elapsed  ( +-  0,28% )

Pipelining and out of the order execution

CPUs cheat: they don’t execute the (machine code) program as is.

Out of the order execution: run the instruction as soon as the necessary data is available
Pipelining: split instruction into stages. Separate blocks handle each stage. Thus several instructions can make progress within a cycle.

These tricks are good if the CPU can guess which instructions will be executed soon. Hence the branch prediction.

If the branch got mispredicted the pipeline must be reset (expensive!)

Finding branch misses with perf

Often it’s far from obvious where exactly branch misses occur. Use a sampling profiler: perf record. Sample branch misses, and visualize the result as a flame graph

What’s a sampling profiler?
How do I sample branch misses?
What’s a flame graph?

gcc-8 -O0 -g -static -o partsum_both partsum_both.c
perf record --event=branch-miss --call-graph=fp ./partsum_both -o randsum_both.data
perf script --header -i randsum_both.data > randsum_both.perfscript
stackcollapse-perf.pl randsum_both.perfscript > randsum_both.stacks
flamegraph.pl --title='Branch misses, no optimization' randsum_both.stacks > randsum_both.svg

Sampling: interrupt program on an event, save program stack trace
Flame graph: representation of collected stack traces

branch misses flamegraph

How sampling works (bird’s-eye view)

Hardware: Performance monitoring unit (PMU). Programmable event counters. Raise an interrupt on N events (CPU cycles, branch misses, cache misses, etc)
Software: kernel: set counters, handle interrupts, etc
Software: perf: further event processing (which process was on CPU, walk stack trace, record the sample)

Obtaining stack traces

Option 1: frame pointers

perf record --event=branch-miss --call-graph=fp ./partsum_both -o randsum_both.data

Debug info (compile with -g)
Frame pointers (-fno-omit-frame-pointer)

Pros:

Low run time overhead
Does not depend on advanced PMU features

Cons:

Compilers don’t preserve frame pointers by default
All libraries used by app should be compiled with frame pointers

Option 2: DWARF debug info

perf record --event=branch-miss --call-graph=dwarf ./partsum_both -o randsum_both.data

Debug info (compile with -g)

Pros:

Distro software can be profiled
Does not depend on advanced PMU features

Cons:

Extreme run time overhead
Truncated stack traces

Option 3: advanced PMU features

perf record --event=branch-miss --call-graph=lbr ./partsum_both -o randsum_both.data

Pros:

Negligible run time overhead
Does not need debug info

Cons:

Available only in some CPUs (Intel Haswell)
Only userspace stack can be recorded

Kernel part of stack trace

System.map should be readable
/proc/kallsyms should be readable (kernel.kptr_restrict=0)

Sort the input

Branch miss happens only once (approximately after N/2 elements)

Swap the loops

The same branch is taken 100000 in a row

for (unsigned i = 0; i < size; i++) {
	for (int k = 0; k < LOOPS; k++) {
		if (vec[i] >= limit) {
			sum += vec[i];
		}
	}
}

Rewrite the code without branches

unsigned step2(unsigned x, unsigned y) {
    return (1U ^ ((x - y) >> 31);
}

sum += vec[i]*step2(vec[i], limit);

This makes more calculations (including multiplication), but without branch misses the CPU runs 3x instructions per a second

Enable optimizations

$ gcc-8 -O2 -static -o rand_partsum_opt rand_partsum.c
$ perf stat -r3 ./rand_partsum_opt

Performance counter stats for ‘./rand_partsum_opt’ (3 runs):

        370,13 msec task-clock                #    0,971 CPUs utilized            ( +-  1,24% )
             5      context-switches          #    0,014 K/sec                    ( +- 16,54% )
             0      cpu-migrations            #    0,001 K/sec                    ( +-100,00% )
            28      page-faults               #    0,077 K/sec                    ( +-  1,18% )
 1 032 315 732      cycles                    #    2,789 GHz                      ( +-  1,31% )
 4 002 103 254      instructions              #    3,88  insn per cycle           ( +-  0,00% )
   500 476 292      branches                  # 1352,179 M/sec                    ( +-  0,00% )
       109 957      branch-misses             #    0,02% of all branches          ( +-  0,14% )

       0,38122 +- 0,00428 seconds time elapsed  ( +-  1,12% )

 Performance counter stats for './partsum_opt' (3 runs):
        373,75 msec task-clock                #    0,973 CPUs utilized            ( +-  0,57% )
             4      context-switches          #    0,012 K/sec                    ( +- 15,38% )
             0      cpu-migrations            #    0,000 K/sec                  
            28      page-faults               #    0,076 K/sec                    ( +-  1,18% )
 1 038 527 951      cycles                    #    2,779 GHz                      ( +-  0,21% )
 4 001 546 307      instructions              #    3,85  insn per cycle           ( +-  0,00% )
   500 316 183      branches                  # 1338,640 M/sec                    ( +-  0,00% )
       109 015      branch-misses             #    0,02% of all branches          ( +-  0,23% )

       0,38415 +- 0,00238 seconds time elapsed  ( +-  0,62% )

Almost 10x faster than non-optimized variant. Why? if (vec[i] >= limit) -> cmovge (conditional move instruction)

<+104>:	mov    0xc(%rsp),%r15d      # %r15: N
<+109>:	mov    $0x186a0,%r9d        # 0x186a0 == 100000
<+115>:	xor    %r13d,%r13d          # %r13: sum = 0
<+118>:	lea    0x4(%rbp,%r15,4),%r8 # %rbp: &vec[0], %r8: %rbp + 4*%r15 + 4 = &vec[0] + N
<+123>:	nopl   0x0(%rax,%rax,1)

<+128>:	mov    %rbp,%rdx
<+131>:	test   %r12d,%r12d
<+134>:	je     0x400cd8 <rand_partsum+168>
<+136>:	nopl   0x0(%rax,%rax,1)

<+144>:	movslq (%rdx),%rcx   # %rdx: &vec[i], %rcx <- vec[i] 
<+147>:	mov    %rcx,%rsi     # %rsi <- vec[i]
<+150>:	add    %r13,%rcx     # %r13: sum, %rcx = sum + vec[i]
<+153>:	cmp    %ebx,%esi     # %ebx: limit (= N/2); limit > vec[i] ?
<+155>:	cmovge %rcx,%r13     # if yes write %rcx to %r13
<+159>:	add    $0x4,%rdx     # %rdx points to next element
<+163>:	cmp    %r8,%rdx      # %r8: &vec[0] + N
<+166>:	jne    0x400cc0 <rand_partsum+144>

<+168>:	sub    $0x1,%r9d
<+172>:	jne    0x400cb0 <rand_partsum+128>

Conclusion

More is less: just because the program exectues more instructions does NOT mean it’s slower
- Measure instructions per cycle, branch misses
perf stat is your friend
Don’t benchmark Debug builds, ever
Better not use Debug builds at all
Premature optimization is root of all evil

P.S. Processor performance is not measured in GHz

The following table lists the run time of non-optimized (compiled with -O0) partsum and rand_partsum measured with perf stat on different CPUs/boards:

	Intel Core i7-7700HQ	AMD Ryzen 5 2600	Baikal-T1 (MIPS P5600)	U740 (RISC-V)	U540 (RISC-V)
instructions	5 756 616 461	5 755 856 610	12 522 218 790	10 554 022 119	10 533 849 293
seconds user	0.96 / 2.40	1.09 / 1.66	6.20 / 11.35	9.46 / 10.82	16.10 / 17.32
insns per cycle	1.63 / 0.66	1.40 / 0.90	1.69 / 0.85	0.93 / 0.81	0.65 / 0.61
branch miss, %	0.03% / 17,32%	0.02% / 8.20%	0.03% / 23.93%	– N/A –	– N/A –
clock rate	2.8 GHz	3.4 GHz	1.2 GHz	1 GHz	1 GHz

(branch miss counts are not available for RISC-V boards due to kernel issues)

	Loongson 3A5000
instructions	10 262 426 188
seconds user	1.85 / 4.26
insns per cycle	2.22 / 0.96
branch miss, %	0.10% / 23.51%
clock rate	2.5 GHz

Baikal-T1 is indeed superscalar and out-of-order. Thus partsum is almost 2x faster than rand_partsum on this SoC.
U740 is superscalar, but in-order. So superscalar execution makes partsum only 15% faster
U540 is in-order with a classical pipeline. No superscalar execution here.
The simple the core the less the difference between partsum and rand_partsum run time is
U740, U540 have the same clock rate, however the partsum benchmark runs 1.7x faster on U740
Baikal-T1 runs partsum 1.5x faster than U740, however the clock rate of Baikal-T1 is only 1.2x faster
The clock rate of Core i7-7700HQ is 2.3 faster than that of Baikal-T1, however Core i7 runs partsum almost 10x faster.

What is branch miss and how it can hurt

Outline

Simple experiment

Maybe it’s a Python bug?

More powerful tool: perf stat

Pipelining and out of the order execution

Finding branch misses with perf

How sampling works (bird’s-eye view)

Obtaining stack traces

Option 1: frame pointers

Option 2: DWARF debug info

Option 3: advanced PMU features

Kernel part of stack trace

Dealing with branch misses

Sort the input

Swap the loops

Rewrite the code without branches

Enable optimizations

Conclusion

P.S. Processor performance is not measured in GHz

Описание

Конвейеры