Branch prediction: Why CPUs can't wait?

Recently, I have had some free time and started learning some low-level computer fundamentals, trying to practice and better understand the concepts in greater detail. Along the way, I learned about a modern compiler infrastructure named LLVM, which is a target-independent optimizer and code generator that has been used as a part of many compilers for different programming languages like Rust, Zig, or Haskell. While diving into LLVM’s aggressive optimization features, I have a humbling realization: even the most advanced compiler optimizations cannot save you from a fundamental limitation at the hardware level. If our program has unpredictable branches, all of the clever LLVM optimizations become secondary to something called “branch prediction”. Poor branch prediction patterns can easily negate the benefits of multiple LLVM optimizations.

There are some latency numbers that every programmer should know, according to Jeff Dean, and one of them is branch misprediction, which costs around 5ns in 2012, and the latency remains roughly the same as the time of writing this post. So what is branch prediction, what happens when it’s mispredicted, and why is it costly? Before diving into the explanation, I think it’s worth mentioning what kind of instruction execution pattern the CPU ideally expects. Because memory layout is linear, and when a program is compiled and stored in memory, its instructions are stored as a continuous sequence:

Markdown

Memory Address    Instruction           PC Value
0x1000           MOV EAX, 5             PC = 0x1000
0x1003           ADD EAX, 10            PC = 0x1003  
0x1006           MOV [0x2000], EAX      PC = 0x1006
0x100A           RET                    PC = 0x100A

Memory Address    Instruction           PC Value
0x1000           MOV EAX, 5             PC = 0x1000
0x1003           ADD EAX, 10            PC = 0x1003  
0x1006           MOV [0x2000], EAX      PC = 0x1006
0x100A           RET                    PC = 0x100A

The natural way to read this is from top to bottom, just like reading a book. In a CPU, there is a special register named Program Counter (PC), and its job is to record the address of the current instruction. The CPU’s program counter naturally increments to the next sequential instruction address:

[math] PC = PC + instruction\ size[/math]

The instruction size is the size of the current instruction, for example, after executing MOV EAX, 5 (3 bytes), The CPU automatically sets:

[math] PC\ =\ 0x1000\ +\ 3\ =\ 0x1003[/math]

So the CPU can find out the address of the next instruction and then execute the next one. But one important detail here is that to complete an instruction, there are multiple internal stages that each instruction will always have, and one question here is whether the CPU will just wait for all of the stages on one instruction to finish, then execute the next one? To answer this question, let’s first take a look at the different internal stages that a typical processor needs to perform on each instruction:

Instruction Fetch (IF): Get the instruction from the memory
Instruction Decode (ID): Figure out what it means
Execute (EX): Perform the operation
Memory Access (MEM): Access memory if needed
Write Back (WB): Store the result in the memory

Modern CPUs employ a technique called instruction pipelining, meaning that even though one instruction hasn’t finished all of its stages, the next instruction could be started in parallel immediately, so the idea is to exploit all the hardware power that the CPU has. What it means is that each of the instruction stages will be handled by different hardware parts, for example, there could be Fetch Hardware Units, Decode Hardware Units, Execute Hardware Units. While the Execute Hardware Unit is working on the Execute stage of the current instruction, the Fetch Hardware Unit could be working on the Fetch stage of the next instruction in parallel.

Markdown

┌─────────────────────────────────────────────────────────────┐
│  Multiple Fetch Units   │  Multiple Decode Units            │
│  ┌─────┐ ┌─────┐        │  ┌─────┐ ┌─────┐ ┌─────┐          │
│  │ IF1 │ │ IF2 │        │  │ ID1 │ │ ID2 │ │ ID3 │          │
│  └─────┘ └─────┘        │  └─────┘ └─────┘ └─────┘          │
├─────────────────────────┼─────────────────────────────────--┤
│  Multiple Execute Units │  Multiple Memory Units            │
│  ┌─────┐ ┌─────┐ ┌─────┐│  ┌─────┐ ┌─────┐                  │
│  │ALU1 │ │ALU2 │ │ FPU ││  │LD1  │ │ST1  │                  │
│  └─────┘ └─────┘ └─────┘│  └─────┘ └─────┘                  │
└─────────────────────────┴─────────────────────────────────--┘


Timeline with pipelining for independent instructions (modern CPUs):
Time:  1    2    3    4    5    6    7    8    9   10
Inst1: F    D    E    M    W
Inst2:      F    D    E    M    W
Inst3:           F    D    E    M    W


┌─────────────────────────────────────────────────────────────┐
│  Multiple Fetch Units   │  Multiple Decode Units            │
│  ┌─────┐ ┌─────┐        │  ┌─────┐ ┌─────┐ ┌─────┐          │
│  │ IF1 │ │ IF2 │        │  │ ID1 │ │ ID2 │ │ ID3 │          │
│  └─────┘ └─────┘        │  └─────┘ └─────┘ └─────┘          │
├─────────────────────────┼─────────────────────────────────--┤
│  Multiple Execute Units │  Multiple Memory Units            │
│  ┌─────┐ ┌─────┐ ┌─────┐│  ┌─────┐ ┌─────┐                  │
│  │ALU1 │ │ALU2 │ │ FPU ││  │LD1  │ │ST1  │                  │
│  └─────┘ └─────┘ └─────┘│  └─────┘ └─────┘                  │
└─────────────────────────┴─────────────────────────────────--┘


Timeline with pipelining for independent instructions (modern CPUs):
Time:  1    2    3    4    5    6    7    8    9   10
Inst1: F    D    E    M    W
Inst2:      F    D    E    M    W
Inst3:           F    D    E    M    W

Pipelining works perfectly with sequential code because the CPU can confidently predict what comes next. When processing the instruction N, it knows the instruction N + 1 is located at PC + instruction_size. This predictability allows the pipeline to stay full and efficient.

But there is a problem when the sequential code is broken, where the CPU needs to jump to some location where it might not be just the next instruction from the current one. For example, there could be an instruction like JE target_address, at this time, the CPU faces uncertainty: should it fetch the instruction at the target_address, or continue sequentially to execute the right next instruction? The pipeline needs to know NOW, but the condition won’t be resolved for at least several cycles. Look at this code snippet:

Markdown

0x1000:  CMP  EAX, 10         ; Compare EAX with 10 (takes 3-5 cycles)
0x1003:  JE   0x2000          ; Jump if equal (2 bytes instruction)
0x1005:  MOV  EBX, 1          ; Path A: Set EBX = 1
0x1008:  ADD  ECX, EBX        ; Path A: Add to ECX
...
0x2000:  MOV  EBX, 99         ; Path B: Set EBX = 99

0x1000:  CMP  EAX, 10         ; Compare EAX with 10 (takes 3-5 cycles)
0x1003:  JE   0x2000          ; Jump if equal (2 bytes instruction)
0x1005:  MOV  EBX, 1          ; Path A: Set EBX = 1
0x1008:  ADD  ECX, EBX        ; Path A: Add to ECX
...
0x2000:  MOV  EBX, 99         ; Path B: Set EBX = 99

After fetching the instruction JE 0x2000 at the address 0x1003, the Fetch Hardware Unit needs to fetch the next instruction again, but this time, it doesn’t know for certain that it will execute the instruction at 0x1005 or at 0x2000 yet because the result of the previous comparison instruction hasn’t been available. Here we can visualize this by something like:

Cycle 1:

Markdown

Fetch:    Reading CMP EAX, 10 from address 0x1000
Decode:   [empty]
Execute:  [empty]  
Memory:   [empty]
Write:    [empty]
PC = 0x1003 (already moved to next instruction)

Fetch:    Reading CMP EAX, 10 from address 0x1000
Decode:   [empty]
Execute:  [empty]  
Memory:   [empty]
Write:    [empty]
PC = 0x1003 (already moved to next instruction)

Cycle 2:

Markdown

Fetch:    Reading JE 0x2000 from address 0x1003
Decode:   Decoding CMP EAX, 10
Execute:  [empty]
Memory:   [empty]  
Write:    [empty]

PC = 0x1005 (moved again!)

Fetch:    Reading JE 0x2000 from address 0x1003
Decode:   Decoding CMP EAX, 10
Execute:  [empty]
Memory:   [empty]  
Write:    [empty]

PC = 0x1005 (moved again!)

Cycle 3: THE CRITICAL MOMENT

Markdown

Fetch:    ??? MUST decide what to fetch next ???
Decode:   Decoding JE 0x2000  
Execute:  Executing CMP EAX, 10 (comparison in progress...)
Memory:   [empty]
Write:    [empty]

PC = ??? Should it be 0x1005 or 0x2000 ???

Fetch:    ??? MUST decide what to fetch next ???
Decode:   Decoding JE 0x2000  
Execute:  Executing CMP EAX, 10 (comparison in progress...)
Memory:   [empty]
Write:    [empty]

PC = ??? Should it be 0x1005 or 0x2000 ???

What should the CPU do now? Theoretically, it can wait until the result of the previous instruction is available; at the time, it will definitely know which next address should be fetched and executed. If waiting, then there will be a period when some of the hardware units just sit idle; how many cycles are wasted vastly depends on the complexity of the comparison operation, from a few cycles to hundreds of cycles.

Instead of stalling the pipeline (which would waste cycles), the CPU makes a prediction about which direction the branch will take and immediately starts fetching instructions from the predicted path. This creates 2 possible outcomes:

Correct prediction: The pipeline continues at full speed and no penalty
Incorrect prediction: The CPU must flush all speculative executed instructions and restart from the correct path, incurring a significant penalty.

It’s hard to imagine why a few wasted cycles would be a problem, but as the number of branches increases in our program, the more noticeable the effect becomes. There are different branch instructions that break the natural sequential flow of instructions; the most obvious one is the if-else statement. Here is a performance challenge: two identical algorithms process the same e-commerce data — one with randomly arranged products, another with sorted products. Which do you think runs faster?

Rust

fn process_predictable_data(&self) -> f64 {
    for product in &self.predictable_products {
        if product.price < 100.0 {  // PREDICTABLE: mostly true early on
           if product.in_stock {
               total_revenue += product.price;
              if product.is_premium {
               total_revenue += 5.0; // premium bonus
              }
            }
        } else if product.price < 300.0 {  // PREDICTABLE: mostly true in middle
           if product.in_stock {
               total_revenue += product.price * 1.1; // markup
               if product.is_premium {
                  total_revenue += 15.0;
               }
           }
        } else {  // PREDICTABLE: mostly true at end
           if product.in_stock {
                total_revenue += product.price * 1.2; // higher markup
                if product.is_premium {
                  total_revenue += 25.0;
                }
           }
        }
    }
        
    total_revenue
}



fn process_unpredictable_data(&self) -> f64 {
    let mut total_revenue = 0.0;
        
    // Each condition could be true/false
    for product in &self.random_products {
       if product.price < 100.0 {  // UNPREDICTABLE: random true/false
          if product.in_stock {
              total_revenue += product.price;
              if product.is_premium {
                  total_revenue += 5.0;
              }
          }
        } else if product.price < 300.0 {  // UNPREDICTABLE: random true/false
          if product.in_stock {
              total_revenue += product.price * 1.1;
              if product.is_premium {
                 total_revenue += 15.0;
              }
          }
        } else {  // UNPREDICTABLE: random true/false
          if product.in_stock {
               total_revenue += product.price * 1.2;
               if product.is_premium {
                  total_revenue += 25.0;
               }
          }
        }
    }
        
    total_revenue
}

fn process_predictable_data(&self) -> f64 {
    for product in &self.predictable_products {
        if product.price < 100.0 {  // PREDICTABLE: mostly true early on
           if product.in_stock {
               total_revenue += product.price;
              if product.is_premium {
               total_revenue += 5.0; // premium bonus
              }
            }
        } else if product.price < 300.0 {  // PREDICTABLE: mostly true in middle
           if product.in_stock {
               total_revenue += product.price * 1.1; // markup
               if product.is_premium {
                  total_revenue += 15.0;
               }
           }
        } else {  // PREDICTABLE: mostly true at end
           if product.in_stock {
                total_revenue += product.price * 1.2; // higher markup
                if product.is_premium {
                  total_revenue += 25.0;
                }
           }
        }
    }
        
    total_revenue
}



fn process_unpredictable_data(&self) -> f64 {
    let mut total_revenue = 0.0;
        
    // Each condition could be true/false
    for product in &self.random_products {
       if product.price < 100.0 {  // UNPREDICTABLE: random true/false
          if product.in_stock {
              total_revenue += product.price;
              if product.is_premium {
                  total_revenue += 5.0;
              }
          }
        } else if product.price < 300.0 {  // UNPREDICTABLE: random true/false
          if product.in_stock {
              total_revenue += product.price * 1.1;
              if product.is_premium {
                 total_revenue += 15.0;
              }
          }
        } else {  // UNPREDICTABLE: random true/false
          if product.in_stock {
               total_revenue += product.price * 1.2;
               if product.is_premium {
                  total_revenue += 25.0;
               }
          }
        }
    }
        
    total_revenue
}

Before running the benchmark, let’s think about it:

Both processes produce the same 100K products
Both use identical logic and calculations
Both access the same amount of memory
The only difference is data ordering

Here is the result:

Markdown

🎯 Test 1: Predictable Branch Pattern
   Data: Products sorted by price (low → high)
   Branch behavior: Price checks are predictable
   Revenue: $21068967.18
   Result: 89.572224s total

🎲 Test 2: Unpredictable Branch Pattern
   Data: Same products, randomly shuffled
   Branch behavior: Price checks are unpredictable
   Revenue: $21068967.18
   Result: 142.038094625s total

🎯 Test 1: Predictable Branch Pattern
   Data: Products sorted by price (low → high)
   Branch behavior: Price checks are predictable
   Revenue: $21068967.18
   Result: 89.572224s total

🎲 Test 2: Unpredictable Branch Pattern
   Data: Same products, randomly shuffled
   Branch behavior: Price checks are unpredictable
   Revenue: $21068967.18
   Result: 142.038094625s total

What just happened? Both tests processed identical data and produced the exact same revenue $21068967.18, yet took roughly 60% longer than the other. The only difference is the order of the products in memory. This dramatic performance difference reveals the hidden cost of branch prediction. In the code, each if-else block creates a conditional jump where the CPU must predict which path to take and execute speculatively. The CPU makes an educated guess about which branch to follow.

With sorted data (Test 1): The CPU prediction is highly accurate because of the predictability of the price comparison — cheap products early, expensive products later.

With random data (Test 2): Since the products are randomly ordered with different prices mixed together, this unpredictable pattern makes it much harder for the CPU to guess correctly. The same price comparisons become unpredictable coin flips. The CPU frequently guesses wrong, forcing it to discard speculatively executed instructions and restart from the correct path.

The problem compounds exponentially as the number of products and branches increases, turning a few wasted cycles into a significant performance penalty that costs real money in production systems.

Now we have a better idea of which data pattern is more favorable for the CPU to guess the branch direction. But what if there is no branch prediction at all, where the CPU just waits for the result for each branch condition? Let’s first find out the number of branches in the product revenue calculation function, for simplicity, we just count the conditional jump (i.e, if-elseif-else statement) and ignore the loop overhead (it matters in branch counting, but we just skip it for now):

Branch 1: Should I jump to < 100 price block (YES/NO decision)
Branch 2: Should I jump to the 100-300 price block (YES/NO decision)
The final else statement doesn’t jump anywhere; it just continues execution sequentially
Branch 3: Stock check if product.in_stock (always executed, exactly one of the 3 stock checks, depending on the price range)
Branch 4: Premium check if product.is_premium (only executed if the product.in_stock is true)

100K products and 4 branches for products, so we have 400K branches in total. To make the calculation more practical, let’s specify some CPU specifications as well:

Pipeline depth: 14 stages [Intel Optimization Manual 2021]
Frequency: [math]f_{CPU} = 3GHz = 3 * 10^9 cycles\ /\ second[/math]
Branch misprediction penalty: 15-22 cycles (Intel 12th gen: ~16-20, AMD Zen 4: ~18-22)
Branch resolution time: 2-3 cycles for register comparison [Intel document]

We first need to calculate the base execution time (any instruction except for branch instructions), which could include:

Memory load/store for product data
Arithmetic operations (price calculations, revenue additions)
Loop iteration overhead
Function call overhead

Estimation for 100K products:

Markdown

Instructions per product: ~8-12 (loads, arithmetic, stores)
Total instructions: 100,000 × 10 = 1,000,000 instructions
CPI (Cycles Per Instruction): ~1.2 (modern superscalar CPU)
Base cycles: 1,000,000 × 1.2 = 1,200,000 cycles

Instructions per product: ~8-12 (loads, arithmetic, stores)
Total instructions: 100,000 × 10 = 1,000,000 instructions
CPI (Cycles Per Instruction): ~1.2 (modern superscalar CPU)
Base cycles: 1,000,000 × 1.2 = 1,200,000 cycles

[math]\large{T_{base}\ =\ \frac{1.200.000}{3\ *\ 10^9}\ =\ 0.4\text{ms}}[/math]

Let’s assume that each branch resolution will take 3 cycles; then, we can derive the pipeline stall formula in which there is no branch prediction:

[math]\large{T_\text{no_pred} = \frac{N_{branches}\ *\ C_{resolution}}{f_{CPU}}}\\\large{T_\text{no_pred} = \frac{400.000\ *\ 3}{3\ *\ 10^9} = \frac{1.200.000}{3\ *\ 10^9} = 400\ *\ 10^{-6} = 0.40\ ms}[/math]

To calculate how much time was wasted compared with branch prediction enabled, as of the time of writing this post, the branch prediction accuracy rate of modern CPUs is around 90-97% accurancy depending on workload characteristics:

Highly regular patterns (loop, sorted data): 95 – 97%
Mixed patterns (typical applications): 90 – 95%
and other patterns…

So the branch overhead formula will be:

[math]\large{T_\text{branch_pred} = \frac{N_{branches}\ *\ (R_{miss}\ *\ P_{miss}\ +\ R_{hit}\ *\ P_{hit})}{f_{CPU}} }[/math]

Where:

[math]N_\text{branches}\ =\ 400.000[/math]
[math]R_\text{miss}\ = 0.05\ (\text{5% of guessing wrong})[/math]
[math]P_\text{miss}\ = 17\ (\text{Each failed prediction costs 17 cycles})[/math]
[math]R_\text{hit}\ = 0.95\ (\text{95% of guessing right})[/math]
[math]P_\text{hit}\ = 0\ (\text{Successful prediction doesn’t have any penalty})[/math]

We just need to plug these numbers into the formula and see how much time it theoretically takes a regular computer to process 100K products with the branch prediction success rate of 95%:

[math]\large{T_\text{branch_pred} =\ \frac{400.000\ *\ (0.05\ *\ 17\ +\ 0.9\ *\ 0)}{3\ *\ 10^9}}\\\large{T_\text{branch_pred} =\ \frac{400.000\ *\ 0.85}{3\ *\ 10^9}}\ \approx\ 0.113ms [/math]

Now we can calculate the total execution time when there is no branch prediction, and when there is branch prediction:

[math]\large{T_\text{total_no_pred}\ =\ T_\text{base}\ +\ T_\text{no_pred} = 0.4\ +\ 0.4\ =\ 0.8\text{ms}}\\\large{T_\text{total_pred}\ =\ T_\text{base}\ +\ T_\text{branch_pred} = 0.4\ +\ 0.113\ =\ 0.513\text{ms}}[/math]

Finally, we can calculate the slowdown factor:

[math]\large{\text{Slowdown Factor}\ =\ \frac{0.8}{0.513}\ \approx\ 1.56\ \text{times}}[/math]

Base Execution	0.40ms	0.40ms	0ms
Branch Overhead	0.113ms	0.40ms	+0.287ms
Total Time	0.513ms	0.80ms	+56% slower

The key insight here is that with branch prediction enabled, it saves 56% execution time automatically, and here we just have a small example, but in the real world, there could be billions of predictions happening every second on every smart device that we’re using that are completely transparent.

Conclusion

Every if statement is a question that the CPU needs to answer, what we can do is to make the question easy to predict by:

Process sorted data when possible
Batch similar operations together
Avoid random conditional logic in the performance-critical path

As we’ve seen in our e-commerce example, sometimes the most powerful optimizations aren’t changing our algorithm — it’s simply organizing the data to work with the hardware instead of against it.

The complete code example is available here.

5 10 votes

Article Rating

Facebook0 Tweet0 LinkedIn0

1 Comment

Most Voted

Newest Oldest

Inline Feedbacks

View all comments

MIM

October 10, 2025 8:40 am

This is a classic case of the CPU trying to be too clever for its own good! Its like predicting the winning lottery numbers – sure, it might guess right sometimes, but when its wrong, oh boy, does it stall! The article hilariously illustrates how branch prediction is a gamble, and random data is the CPUs worst nightmare. Its like showing up to a party and finding out everyones wearing a different outfit – hows the CPU supposed to know whos who? The sorted data scenario is like giving the CPU a cheat sheet, while the random data is like… Read more »

Branch prediction: Why CPUs can’t wait?14 min read

Conclusion