Branch prediction: Why CPUs can’t wait?14 min read
Recently, I have had some free time and started learning some low-level computer fundamentals, trying to practice and better understand the concepts in greater detail. Along the way, I learned about a modern compiler infrastructure named LLVM, which is a target-independent optimizer and code generator that has been used as a part of many compilers for different programming languages like Rust, Zig, or Haskell. While diving into LLVM’s aggressive optimization features, I have a humbling realization: even the most advanced compiler optimizations cannot save you from a fundamental limitation at the hardware level. If our program has unpredictable branches, all of the clever LLVM optimizations become secondary to something called “branch prediction”. Poor branch prediction patterns can easily negate the benefits of multiple LLVM optimizations.
There are some latency numbers that every programmer should know, according to Jeff Dean, and one of them is branch misprediction, which costs around 5ns in 2012, and the latency remains roughly the same as the time of writing this post. So what is branch prediction, what happens when it’s mispredicted, and why is it costly? Before diving into the explanation, I think it’s worth mentioning what kind of instruction execution pattern the CPU ideally expects. Because memory layout is linear, and when a program is compiled and stored in memory, its instructions are stored as a continuous sequence:
Memory Address Instruction PC Value
0x1000 MOV EAX, 5 PC = 0x1000
0x1003 ADD EAX, 10 PC = 0x1003
0x1006 MOV [0x2000], EAX PC = 0x1006
0x100A RET PC = 0x100A
The natural way to read this is from top to bottom, just like reading a book. In a CPU, there is a special register named Program Counter (PC), and its job is to record the address of the current instruction. The CPU’s program counter naturally increments to the next sequential instruction address:
[math] PC = PC + instruction\ size[/math]
The instruction size is the size of the current instruction, for example, after executing MOV EAX, 5
(3 bytes), The CPU automatically sets:
[math] PC\ =\ 0x1000\ +\ 3\ =\ 0x1003[/math]
So the CPU can find out the address of the next instruction and then execute the next one. But one important detail here is that to complete an instruction, there are multiple internal stages that each instruction will always have, and one question here is whether the CPU will just wait for all of the stages on one instruction to finish, then execute the next one? To answer this question, let’s first take a look at the different internal stages that a typical processor needs to perform on each instruction:
- Instruction Fetch (IF): Get the instruction from the memory
- Instruction Decode (ID): Figure out what it means
- Execute (EX): Perform the operation
- Memory Access (MEM): Access memory if needed
- Write Back (WB): Store the result in the memory
Modern CPUs employ a technique called instruction pipelining, meaning that even though one instruction hasn’t finished all of its stages, the next instruction could be started in parallel immediately, so the idea is to exploit all the hardware power that the CPU has. What it means is that each of the instruction stages will be handled by different hardware parts, for example, there could be Fetch Hardware Units, Decode Hardware Units, Execute Hardware Units. While the Execute Hardware Unit is working on the Execute stage of the current instruction, the Fetch Hardware Unit could be working on the Fetch stage of the next instruction in parallel.
┌─────────────────────────────────────────────────────────────┐
│ Multiple Fetch Units │ Multiple Decode Units │
│ ┌─────┐ ┌─────┐ │ ┌─────┐ ┌─────┐ ┌─────┐ │
│ │ IF1 │ │ IF2 │ │ │ ID1 │ │ ID2 │ │ ID3 │ │
│ └─────┘ └─────┘ │ └─────┘ └─────┘ └─────┘ │
├─────────────────────────┼─────────────────────────────────--┤
│ Multiple Execute Units │ Multiple Memory Units │
│ ┌─────┐ ┌─────┐ ┌─────┐│ ┌─────┐ ┌─────┐ │
│ │ALU1 │ │ALU2 │ │ FPU ││ │LD1 │ │ST1 │ │
│ └─────┘ └─────┘ └─────┘│ └─────┘ └─────┘ │
└─────────────────────────┴─────────────────────────────────--┘
Timeline with pipelining for independent instructions (modern CPUs):
Time: 1 2 3 4 5 6 7 8 9 10
Inst1: F D E M W
Inst2: F D E M W
Inst3: F D E M W
Pipelining works perfectly with sequential code because the CPU can confidently predict what comes next. When processing the instruction N
, it knows the instruction N + 1
is located at PC + instruction_size
. This predictability allows the pipeline to stay full and efficient.
But there is a problem when the sequential code is broken, where the CPU needs to jump to some location where it might not be just the next instruction from the current one. For example, there could be an instruction like JE target_address
, at this time, the CPU faces uncertainty: should it fetch the instruction at the target_address
, or continue sequentially to execute the right next instruction? The pipeline needs to know NOW, but the condition won’t be resolved for at least several cycles. Look at this code snippet:
0x1000: CMP EAX, 10 ; Compare EAX with 10 (takes 3-5 cycles)
0x1003: JE 0x2000 ; Jump if equal (2 bytes instruction)
0x1005: MOV EBX, 1 ; Path A: Set EBX = 1
0x1008: ADD ECX, EBX ; Path A: Add to ECX
...
0x2000: MOV EBX, 99 ; Path B: Set EBX = 99
After fetching the instruction JE 0x2000
at the address 0x1003
, the Fetch Hardware Unit needs to fetch the next instruction again, but this time, it doesn’t know for certain that it will execute the instruction at 0x1005
or at 0x2000
yet because the result of the previous comparison instruction hasn’t been available. Here we can visualize this by something like:
Cycle 1:
Fetch: Reading CMP EAX, 10 from address 0x1000
Decode: [empty]
Execute: [empty]
Memory: [empty]
Write: [empty]
PC = 0x1003 (already moved to next instruction)
Cycle 2:
Fetch: Reading JE 0x2000 from address 0x1003
Decode: Decoding CMP EAX, 10
Execute: [empty]
Memory: [empty]
Write: [empty]
PC = 0x1005 (moved again!)
Cycle 3: THE CRITICAL MOMENT
Fetch: ??? MUST decide what to fetch next ???
Decode: Decoding JE 0x2000
Execute: Executing CMP EAX, 10 (comparison in progress...)
Memory: [empty]
Write: [empty]
PC = ??? Should it be 0x1005 or 0x2000 ???
What should the CPU do now? Theoretically, it can wait until the result of the previous instruction is available; at the time, it will definitely know which next address should be fetched and executed. If waiting, then there will be a period when some of the hardware units just sit idle; how many cycles are wasted vastly depends on the complexity of the comparison operation, from a few cycles to hundreds of cycles.
Instead of stalling the pipeline (which would waste cycles), the CPU makes a prediction about which direction the branch will take and immediately starts fetching instructions from the predicted path. This creates 2 possible outcomes:
- Correct prediction: The pipeline continues at full speed and no penalty
- Incorrect prediction: The CPU must flush all speculative executed instructions and restart from the correct path, incurring a significant penalty.
It’s hard to imagine why a few wasted cycles would be a problem, but as the number of branches increases in our program, the more noticeable the effect becomes. There are different branch instructions that break the natural sequential flow of instructions; the most obvious one is the if-else
statement. Here is a performance challenge: two identical algorithms process the same e-commerce data — one with randomly arranged products, another with sorted products. Which do you think runs faster?
fn process_predictable_data(&self) -> f64 {
for product in &self.predictable_products {
if product.price < 100.0 { // PREDICTABLE: mostly true early on
if product.in_stock {
total_revenue += product.price;
if product.is_premium {
total_revenue += 5.0; // premium bonus
}
}
} else if product.price < 300.0 { // PREDICTABLE: mostly true in middle
if product.in_stock {
total_revenue += product.price * 1.1; // markup
if product.is_premium {
total_revenue += 15.0;
}
}
} else { // PREDICTABLE: mostly true at end
if product.in_stock {
total_revenue += product.price * 1.2; // higher markup
if product.is_premium {
total_revenue += 25.0;
}
}
}
}
total_revenue
}
fn process_unpredictable_data(&self) -> f64 {
let mut total_revenue = 0.0;
// Each condition could be true/false
for product in &self.random_products {
if product.price < 100.0 { // UNPREDICTABLE: random true/false
if product.in_stock {
total_revenue += product.price;
if product.is_premium {
total_revenue += 5.0;
}
}
} else if product.price < 300.0 { // UNPREDICTABLE: random true/false
if product.in_stock {
total_revenue += product.price * 1.1;
if product.is_premium {
total_revenue += 15.0;
}
}
} else { // UNPREDICTABLE: random true/false
if product.in_stock {
total_revenue += product.price * 1.2;
if product.is_premium {
total_revenue += 25.0;
}
}
}
}
total_revenue
}
Before running the benchmark, let’s think about it:
- Both processes produce the same 100K products
- Both use identical logic and calculations
- Both access the same amount of memory
- The only difference is data ordering
Here is the result:
🎯 Test 1: Predictable Branch Pattern
Data: Products sorted by price (low → high)
Branch behavior: Price checks are predictable
Revenue: $21068967.18
Result: 89.572224s total
🎲 Test 2: Unpredictable Branch Pattern
Data: Same products, randomly shuffled
Branch behavior: Price checks are unpredictable
Revenue: $21068967.18
Result: 142.038094625s total
What just happened? Both tests processed identical data and produced the exact same revenue $21068967.18, yet took roughly 60% longer than the other. The only difference is the order of the products in memory. This dramatic performance difference reveals the hidden cost of branch prediction. In the code, each if-else
block creates a conditional jump where the CPU must predict which path to take and execute speculatively. The CPU makes an educated guess about which branch to follow.
With sorted data (Test 1): The CPU prediction is highly accurate because of the predictability of the price comparison — cheap products early, expensive products later.
With random data (Test 2): Since the products are randomly ordered with different prices mixed together, this unpredictable pattern makes it much harder for the CPU to guess correctly. The same price comparisons become unpredictable coin flips. The CPU frequently guesses wrong, forcing it to discard speculatively executed instructions and restart from the correct path.
The problem compounds exponentially as the number of products and branches increases, turning a few wasted cycles into a significant performance penalty that costs real money in production systems.
Now we have a better idea of which data pattern is more favorable for the CPU to guess the branch direction. But what if there is no branch prediction at all, where the CPU just waits for the result for each branch condition? Let’s first find out the number of branches in the product revenue calculation function, for simplicity, we just count the conditional jump (i.e, if-elseif-else statement) and ignore the loop overhead (it matters in branch counting, but we just skip it for now):
- Branch 1: Should I jump to < 100 price block (YES/NO decision)
- Branch 2: Should I jump to the 100-300 price block (YES/NO decision)
- The final else statement doesn’t jump anywhere; it just continues execution sequentially
- Branch 3: Stock check
if product.in_stock
(always executed, exactly one of the 3 stock checks, depending on the price range) - Branch 4: Premium check
if product.is_premium
(only executed if theproduct.in_stock
is true)
100K products and 4 branches for products, so we have 400K branches in total. To make the calculation more practical, let’s specify some CPU specifications as well:
- Pipeline depth: 14 stages [Intel Optimization Manual 2021]
- Frequency: [math]f_{CPU} = 3GHz = 3 * 10^9 cycles\ /\ second[/math]
- Branch misprediction penalty: 15-22 cycles (Intel 12th gen: ~16-20, AMD Zen 4: ~18-22)
- Branch resolution time: 2-3 cycles for register comparison [Intel document]
We first need to calculate the base execution time (any instruction except for branch instructions), which could include:
- Memory load/store for product data
- Arithmetic operations (price calculations, revenue additions)
- Loop iteration overhead
- Function call overhead
Estimation for 100K products:
Instructions per product: ~8-12 (loads, arithmetic, stores)
Total instructions: 100,000 × 10 = 1,000,000 instructions
CPI (Cycles Per Instruction): ~1.2 (modern superscalar CPU)
Base cycles: 1,000,000 × 1.2 = 1,200,000 cycles
[math]\large{T_{base}\ =\ \frac{1.200.000}{3\ *\ 10^9}\ =\ 0.4\text{ms}}[/math]
Let’s assume that each branch resolution will take 3 cycles; then, we can derive the pipeline stall formula in which there is no branch prediction:
[math]\large{T_\text{no_pred} = \frac{N_{branches}\ *\ C_{resolution}}{f_{CPU}}}\\\large{T_\text{no_pred} = \frac{400.000\ *\ 3}{3\ *\ 10^9} = \frac{1.200.000}{3\ *\ 10^9} = 400\ *\ 10^{-6} = 0.40\ ms}[/math]
To calculate how much time was wasted compared with branch prediction enabled, as of the time of writing this post, the branch prediction accuracy rate of modern CPUs is around 90-97% accurancy depending on workload characteristics:
- Highly regular patterns (loop, sorted data): 95 – 97%
- Mixed patterns (typical applications): 90 – 95%
- and other patterns…
So the branch overhead formula will be:
[math]\large{T_\text{branch_pred} = \frac{N_{branches}\ *\ (R_{miss}\ *\ P_{miss}\ +\ R_{hit}\ *\ P_{hit})}{f_{CPU}} }[/math]
Where:
- [math]N_\text{branches}\ =\ 400.000[/math]
- [math]R_\text{miss}\ = 0.05\ (\text{5% of guessing wrong})[/math]
- [math]P_\text{miss}\ = 17\ (\text{Each failed prediction costs 17 cycles})[/math]
- [math]R_\text{hit}\ = 0.95\ (\text{95% of guessing right})[/math]
- [math]P_\text{hit}\ = 0\ (\text{Successful prediction doesn’t have any penalty})[/math]
We just need to plug these numbers into the formula and see how much time it theoretically takes a regular computer to process 100K products with the branch prediction success rate of 95%:
[math]\large{T_\text{branch_pred} =\ \frac{400.000\ *\ (0.05\ *\ 17\ +\ 0.9\ *\ 0)}{3\ *\ 10^9}}\\\large{T_\text{branch_pred} =\ \frac{400.000\ *\ 0.85}{3\ *\ 10^9}}\ \approx\ 0.113ms [/math]
Now we can calculate the total execution time when there is no branch prediction, and when there is branch prediction:
[math]\large{T_\text{total_no_pred}\ =\ T_\text{base}\ +\ T_\text{no_pred} = 0.4\ +\ 0.4\ =\ 0.8\text{ms}}\\\large{T_\text{total_pred}\ =\ T_\text{base}\ +\ T_\text{branch_pred} = 0.4\ +\ 0.113\ =\ 0.513\text{ms}}[/math]
Finally, we can calculate the slowdown factor:
[math]\large{\text{Slowdown Factor}\ =\ \frac{0.8}{0.513}\ \approx\ 1.56\ \text{times}}[/math]
Base Execution | 0.40ms | 0.40ms | 0ms |
Branch Overhead | 0.113ms | 0.40ms | +0.287ms |
Total Time | 0.513ms | 0.80ms | +56% slower |
The key insight here is that with branch prediction enabled, it saves 56% execution time automatically, and here we just have a small example, but in the real world, there could be billions of predictions happening every second on every smart device that we’re using that are completely transparent.
Conclusion
Every if
statement is a question that the CPU needs to answer, what we can do is to make the question easy to predict by:
- Process sorted data when possible
- Batch similar operations together
- Avoid random conditional logic in the performance-critical path
As we’ve seen in our e-commerce example, sometimes the most powerful optimizations aren’t changing our algorithm — it’s simply organizing the data to work with the hardware instead of against it.