38 Compiler Optimizations

What Happens to Your Code

You write code. The compiler transforms it. Sometimes by 100×.

Understanding what compilers do—and what they can’t do—is essential for writing fast code. This chapter opens the black box.

38.1 The Compiler’s Job

A compiler has one goal: generate the fastest possible machine code while preserving program semantics.

Input: Your source code (C++, Rust, etc.) Output: Machine code (x86, ARM, etc.)

Between input and output: hundreds of transformation passes.

Source Code
    ↓
Frontend (parsing, type checking)
    ↓
Intermediate Representation (IR)
    ↓
Optimization Passes (100+ transformations)
    ↓
Backend (instruction selection, register allocation)
    ↓
Machine Code

We’ll focus on the optimization passes—the transformations that make code fast.

38.2 Loop Transformations: The Core of Performance

Most compute-intensive code spends time in loops. Compilers have an arsenal of loop transformations.

38.2.1 Loop Unrolling

Original:

for (int i = 0; i < 100; i++) {
    a[i] = b[i] + c[i];
}

After unrolling (factor 4):

for (int i = 0; i < 100; i += 4) {
    a[i]   = b[i]   + c[i];
    a[i+1] = b[i+1] + c[i+1];
    a[i+2] = b[i+2] + c[i+2];
    a[i+3] = b[i+3] + c[i+3];
}

Why faster: 1. Fewer loop overhead instructions (less branching) 2. More instruction-level parallelism (4 adds can execute simultaneously) 3. Easier to vectorize

Tradeoff: Larger code size, potential instruction cache pressure.

Compiler flag:

gcc -O3 -funroll-loops source.c

38.2.2 Loop Interchange

Changes loop nesting order for better cache behavior.

Original (bad for cache):

for (int i = 0; i < N; i++) {
    for (int j = 0; j < N; j++) {
        C[i][j] = A[i][j] + B[j][i];  // B accessed in column-major!
    }
}

After interchange (good for cache):

for (int j = 0; j < N; j++) {
    for (int i = 0; i < N; i++) {
        C[i][j] = A[i][j] + B[j][i];  // Both row-major now
    }
}

Why faster: Sequential memory access → cache-friendly.

When compilers do this: When they can prove it’s legal (no dependencies broken).

38.2.3 Loop Tiling (Blocking)

Divides loops into tiles that fit in cache.

Original:

for (int i = 0; i < N; i++) {
    for (int j = 0; j < N; j++) {
        for (int k = 0; k < N; k++) {
            C[i][j] += A[i][k] * B[k][j];
        }
    }
}

After tiling:

#define TILE 32
for (int ii = 0; ii < N; ii += TILE) {
    for (int jj = 0; jj < N; jj += TILE) {
        for (int kk = 0; kk < N; kk += TILE) {
            // Process TILE×TILE block
            for (int i = ii; i < min(ii+TILE, N); i++) {
                for (int j = jj; j < min(jj+TILE, N); j++) {
                    for (int k = kk; k < min(kk+TILE, N); k++) {
                        C[i][j] += A[i][k] * B[k][j];
                    }
                }
            }
        }
    }
}

Why faster: Tiles fit in cache; each element reused many times before eviction.

Compiler limits: Advanced compilers (ICC, Pluto) can tile, but it’s difficult. Often needs manual help.

38.2.4 Loop Fusion

Combines multiple loops to reduce memory traffic.

Original:

for (int i = 0; i < N; i++) {
    a[i] = b[i] + 1;
}
for (int i = 0; i < N; i++) {
    c[i] = a[i] * 2;
}

After fusion:

for (int i = 0; i < N; i++) {
    a[i] = b[i] + 1;
    c[i] = a[i] * 2;
}

Why faster: 1. One pass over data instead of two 2. a[i] stays in registers between ops 3. Better cache locality

This is operator fusion (Fusion) at the compiler level!

38.2.5 Loop Fission (Distribution)

Opposite of fusion: splits loops to improve parallelization or vectorization.

Original:

for (int i = 0; i < N; i++) {
    a[i] = expensive_computation(b[i]);
    c[i] = cheap_computation(d[i]);  // Independent!
}

After fission:

for (int i = 0; i < N; i++) {
    a[i] = expensive_computation(b[i]);
}
for (int i = 0; i < N; i++) {
    c[i] = cheap_computation(d[i]);
}

Why faster: Second loop can be vectorized independently, or parallelized differently.

38.3 Auto-Vectorization (SIMD)

Modern CPUs have SIMD (Single Instruction, Multiple Data) units: - AVX-512: 512-bit vectors (16 floats, 8 doubles) - AVX2: 256-bit vectors (8 floats, 4 doubles) - NEON (ARM): 128-bit vectors (4 floats)

Scalar code:

for (int i = 0; i < N; i++) {
    c[i] = a[i] + b[i];
}

Vectorized (AVX2):

for (int i = 0; i < N; i += 8) {
    __m256 va = _mm256_load_ps(&a[i]);
    __m256 vb = _mm256_load_ps(&b[i]);
    __m256 vc = _mm256_add_ps(va, vb);
    _mm256_store_ps(&c[i], vc);
}

Speedup: 8× for FP32 (8 operations per instruction).

38.3.1 When Auto-Vectorization Works

Good (compiler vectorizes):

for (int i = 0; i < N; i++) {
    c[i] = a[i] + b[i];  // Simple, independent operations
}

Bad (compiler can’t vectorize):

for (int i = 1; i < N; i++) {
    a[i] = a[i-1] + b[i];  // Loop-carried dependency!
}

Why? Each iteration depends on the previous one. Can’t compute 8 iterations in parallel.

38.3.2 Helping the Compiler Vectorize

Use restrict to eliminate aliasing:

// Compiler doesn't know if a and b overlap
void add(float* a, float* b, float* c, int N) {
    for (int i = 0; i < N; i++) {
        c[i] = a[i] + b[i];
    }
}

// With restrict: compiler knows they don't overlap → can vectorize
void add(float* __restrict__ a, float* __restrict__ b,
         float* __restrict__ c, int N) {
    for (int i = 0; i < N; i++) {
        c[i] = a[i] + b[i];
    }
}

Align data:

// Misaligned loads are slower
float* a = (float*)malloc(N * sizeof(float));

// Aligned loads are fast
float* a = (float*)aligned_alloc(32, N * sizeof(float));

Use pragmas:

#pragma omp simd
for (int i = 0; i < N; i++) {
    c[i] = a[i] + b[i];
}

38.4 Common Compiler Optimizations

38.4.1 Constant Folding

Before:

int x = 3 + 4;
int y = x * 2;

After:

int x = 7;
int y = 14;

Computes at compile time instead of runtime.

38.4.2 Dead Code Elimination

Before:

int x = 5;
int y = x + 1;  // Never used
return x;

After:

int x = 5;
return x;

Removes unused computations.

38.4.3 Common Subexpression Elimination

Before:

int a = b * c + d;
int e = b * c + f;

After:

int temp = b * c;
int a = temp + d;
int e = temp + f;

Computes b * c once instead of twice.

38.4.4 Function Inlining

Before:

inline int square(int x) { return x * x; }

int main() {
    int a = square(5);
}

After:

int main() {
    int a = 5 * 5;
}

Eliminates function call overhead.

Tradeoff: Can increase code size significantly.

38.4.5 Strength Reduction

Replaces expensive operations with cheaper ones.

Before:

for (int i = 0; i < N; i++) {
    a[i] = i * 8;  // Multiply
}

After:

for (int i = 0; i < N; i++) {
    a[i] = i << 3;  // Bit shift (faster)
}

38.5 Polyhedral Optimization

Advanced technique for complex loop nests.

Polyhedral compilers (Pluto, Polly in LLVM) represent loop iterations as points in a polyhedron, then apply mathematical transformations.

Example:

for (int i = 0; i < N; i++) {
    for (int j = 0; j < N; j++) {
        A[i][j] = B[i][j] + C[i][j];
    }
}
for (int i = 0; i < N; i++) {
    for (int j = 0; j < N; j++) {
        D[i][j] = A[i][j] * 2;
    }
}

After polyhedral optimization (fusion + tiling):

#define TILE 32
for (int ii = 0; ii < N; ii += TILE) {
    for (int jj = 0; jj < N; jj += TILE) {
        for (int i = ii; i < min(ii+TILE, N); i++) {
            for (int j = jj; j < min(jj+TILE, N); j++) {
                A[i][j] = B[i][j] + C[i][j];
                D[i][j] = A[i][j] * 2;
            }
        }
    }
}

Benefit: Automatic tiling + fusion for cache efficiency.

Limitation: Only works for “affine” loops (array indices are linear functions of loop variables).

38.6 When Compilers Fail

Compilers can’t optimize everything. Here’s what stops them:

38.6.1 1. Pointer Aliasing

Problem:

void func(int* a, int* b, int N) {
    for (int i = 0; i < N; i++) {
        a[i] = b[i] + 1;
    }
}

Compiler thinks: “What if a and b overlap? I can’t vectorize.”

Solution: Use __restrict__ to promise no aliasing.

38.6.2 2. Complex Control Flow

Problem:

for (int i = 0; i < N; i++) {
    if (condition_depending_on_data(a[i])) {
        result += expensive_function(a[i]);
    } else {
        result += cheap_function(a[i]);
    }
}

Compiler thinks: “Too complex to analyze. Can’t vectorize.”

Solution: Sometimes manual SIMD or restructuring.

38.6.3 3. Function Calls in Loops

Problem:

for (int i = 0; i < N; i++) {
    a[i] = external_function(b[i]);  // Unknown function
}

Compiler thinks: “Don’t know what this function does. Can’t optimize.”

Solution: Mark function as inline or __attribute__((const)).

38.6.4 4. Virtual Functions (C++)

Problem:

for (int i = 0; i < N; i++) {
    result += objects[i]->virtual_method();  // Virtual dispatch
}

Compiler thinks: “Don’t know which implementation at compile time.”

Solution: Use template-based polymorphism when possible.

38.7 Reading Compiler Output

Compilers can tell you what they optimized (or didn’t).

38.7.1 GCC/Clang: Optimization Reports

gcc -O3 -fopt-info-vec -fopt-info-vec-missed source.c

Output:

source.c:10:5: optimized: loop vectorized using 16 byte vectors
source.c:25:5: missed: couldn't vectorize loop (data dependency)

Or more verbose:

gcc -O3 -fopt-info-all source.c

38.7.2 Clang: Remark System

clang -O3 -Rpass=loop-vectorize -Rpass-missed=loop-vectorize source.c

Output:

source.c:10:5: remark: vectorized loop (vectorization width: 8)
source.c:25:5: remark: loop not vectorized: cannot prove it is safe

38.7.3 Intel Compiler: Optimization Report

icc -O3 -qopt-report=5 -qopt-report-phase=vec source.c

Generates detailed .optrpt files showing every optimization decision.

38.8 Inspecting Generated Assembly

Sometimes you need to see the actual machine code.

gcc -O3 -S -fverbose-asm source.c
# Creates source.s with annotated assembly

Example output:

# Loop at source.c:10
.L3:
    vmovups (%rsi,%rax), %ymm0    # Load 8 floats (AVX2)
    vaddps  (%rdx,%rax), %ymm0, %ymm0
    vmovups %ymm0, (%rdi,%rax)    # Store 8 floats
    addq    $32, %rax              # Advance pointer
    cmpq    %rcx, %rax
    jne     .L3

Interpretation: Compiler vectorized with AVX2 (8 floats at a time).

38.9 Profile-Guided Optimization (PGO)

Use runtime data to guide compiler optimizations.

Step 1: Compile with instrumentation

gcc -O3 -fprofile-generate source.c -o program

Step 2: Run with representative workload

./program < typical_input.txt
# Generates .gcda profile files

Step 3: Recompile with profile

gcc -O3 -fprofile-use source.c -o program_optimized

What PGO enables: - Better inlining decisions (inline hot functions) - Better code layout (hot code contiguous in memory) - Better branch prediction

Speedup: 10-30% for some workloads.

38.10 Link-Time Optimization (LTO)

Optimize across entire program, not just individual files.

Without LTO:

gcc -O3 -c file1.c -o file1.o
gcc -O3 -c file2.c -o file2.o
gcc file1.o file2.o -o program
# Each file optimized independently

With LTO:

gcc -O3 -flto -c file1.c -o file1.o
gcc -O3 -flto -c file2.o -o file2.o
gcc -flto file1.o file2.o -o program
# Optimized as one unit

Enables: - Cross-file inlining - Better dead code elimination - Global constant propagation

Downside: Slower compilation.

38.11 How torch.compile Works

PyTorch 2.0’s torch.compile is an ahead-of-time compiler that applies many of these optimizations to PyTorch models.

Workflow: 1. Trace the model to capture computation graph 2. Optimize the graph (fusion, CSE, etc.) 3. Code generation to Triton or C++ 4. Compile the generated code

Example optimizations: - Operator fusion (like loop fusion) - Constant folding - Dead code elimination - Memory layout optimization

Effect: 2-10× speedup without code changes.

38.12 Practical Guidelines

38.12.1 1. Use Optimization Flags

# Release builds
gcc -O3 -march=native source.c

# Debug builds (don't optimize)
gcc -O0 -g source.c

Never profile -O0 builds!

38.12.2 2. Help the Compiler

Use restrict for non-aliasing pointers
Mark functions inline or const
Align data structures
Avoid complex control flow in hot loops

38.12.3 3. Check What Happened

# See what got vectorized
gcc -O3 -fopt-info-vec source.c

# See assembly
gcc -O3 -S -fverbose-asm source.c

38.12.4 4. Consider PGO for Hot Paths

If 90% of time is in one function, PGO can help.

38.12.5 5. Use LTO for Final Builds

gcc -O3 -flto *.c -o program

Small compile-time cost, often 5-15% runtime gain.

38.13 Case Study: Matrix Multiply

Let’s trace compiler optimizations on matrix multiply.

Original:

void matmul(float* C, float* A, float* B, int N) {
    for (int i = 0; i < N; i++) {
        for (int j = 0; j < N; j++) {
            for (int k = 0; k < N; k++) {
                C[i*N + j] += A[i*N + k] * B[k*N + j];
            }
        }
    }
}

What compilers do:

GCC -O3: - Loop interchange (reorders k and j sometimes) - Unrolling (partial) - No vectorization (can’t prove safety with complex indexing)

ICC -O3: - Loop interchange - Unrolling - Vectorization (AVX2) - Speedup: ~4× over GCC

ICC -O3 -march=native -qopt-report: - Loop blocking (tiling) - Aggressive unrolling - Prefetching hints - Speedup: ~8× over GCC

But still 10-20× slower than OpenBLAS or MKL (which use hand-tuned assembly and advanced tiling).

Lesson: Compilers are good, but expert libraries are better for critical kernels.

38.14 Connections

Locality: Loop tiling is a locality optimization the compiler sometimes does.

Fusion: Loop fusion at the compiler level.

Profiling: Use profiling to find optimization opportunities, then check compiler reports.

38.15 Key Takeaways

Compilers apply 100+ transformations: Loop optimizations, vectorization, inlining, CSE.
Loop transformations are critical: Interchange, tiling, unrolling, fusion.
Auto-vectorization gives 4-16× speedup: When it works. Help it with restrict, alignment, simple loops.
Aliasing is the enemy: Use __restrict__ to promise non-overlapping pointers.
Read compiler reports: -fopt-info and friends tell you what happened.
PGO can help hot paths: 10-30% gains with profile-guided optimization.
Compilers aren’t magic: For critical kernels (BLAS, FFT), use expert libraries.

Compiler Comparison

GCC: Good all-around, excellent standards compliance Clang: Fast compilation, good diagnostics ICC (Intel): Best for x86 (vectorization, loop opts) MSVC: Windows standard, improving

For numerical code on Intel CPUs, ICC often wins.

38.16 Further Reading

LLVM documentation: https://llvm.org/docs/
GCC optimization options: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
“Engineering a Compiler” by Cooper & Torczon
Polyhedral compilation: https://polyhedral.info/