37 Compiler Optimizations
What Happens to Your Code
You write code. The compiler transforms it. Sometimes by 100×.
Understanding what compilers do—and what they can’t do—is essential for writing fast code. This chapter opens the black box.
37.1 The Compiler’s Job
A compiler has one goal: generate the fastest possible machine code while preserving program semantics.
Input: Your source code (C++, Rust, etc.) Output: Machine code (x86, ARM, etc.)
Between input and output: hundreds of transformation passes.
Source Code
↓
Frontend (parsing, type checking)
↓
Intermediate Representation (IR)
↓
Optimization Passes (100+ transformations)
↓
Backend (instruction selection, register allocation)
↓
Machine Code
We’ll focus on the optimization passes—the transformations that make code fast.
37.2 Loop Transformations: The Core of Performance
Most compute-intensive code spends time in loops. Compilers have an arsenal of loop transformations.
37.2.1 Loop Unrolling
Original:
for (int i = 0; i < 100; i++) {
a[i] = b[i] + c[i];
}After unrolling (factor 4):
for (int i = 0; i < 100; i += 4) {
a[i] = b[i] + c[i];
a[i+1] = b[i+1] + c[i+1];
a[i+2] = b[i+2] + c[i+2];
a[i+3] = b[i+3] + c[i+3];
}Why faster: 1. Fewer loop overhead instructions (less branching) 2. More instruction-level parallelism (4 adds can execute simultaneously) 3. Easier to vectorize
Tradeoff: Larger code size, potential instruction cache pressure.
Compiler flag:
gcc -O3 -funroll-loops source.c37.2.2 Loop Interchange
Changes loop nesting order for better cache behavior.
Original (bad for cache):
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
C[i][j] = A[i][j] + B[j][i]; // B accessed in column-major!
}
}After interchange (good for cache):
for (int j = 0; j < N; j++) {
for (int i = 0; i < N; i++) {
C[i][j] = A[i][j] + B[j][i]; // Both row-major now
}
}Why faster: Sequential memory access → cache-friendly.
When compilers do this: When they can prove it’s legal (no dependencies broken).
37.2.3 Loop Tiling (Blocking)
Divides loops into tiles that fit in cache.
Original:
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
for (int k = 0; k < N; k++) {
C[i][j] += A[i][k] * B[k][j];
}
}
}After tiling:
#define TILE 32
for (int ii = 0; ii < N; ii += TILE) {
for (int jj = 0; jj < N; jj += TILE) {
for (int kk = 0; kk < N; kk += TILE) {
// Process TILE×TILE block
for (int i = ii; i < min(ii+TILE, N); i++) {
for (int j = jj; j < min(jj+TILE, N); j++) {
for (int k = kk; k < min(kk+TILE, N); k++) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
}
}
}Why faster: Tiles fit in cache; each element reused many times before eviction.
Compiler limits: Advanced compilers (ICC, Pluto) can tile, but it’s difficult. Often needs manual help.
37.2.4 Loop Fusion
Combines multiple loops to reduce memory traffic.
Original:
for (int i = 0; i < N; i++) {
a[i] = b[i] + 1;
}
for (int i = 0; i < N; i++) {
c[i] = a[i] * 2;
}After fusion:
for (int i = 0; i < N; i++) {
a[i] = b[i] + 1;
c[i] = a[i] * 2;
}Why faster: 1. One pass over data instead of two 2. a[i] stays in registers between ops 3. Better cache locality
This is operator fusion (Chapter 8) at the compiler level!
37.2.5 Loop Fission (Distribution)
Opposite of fusion: splits loops to improve parallelization or vectorization.
Original:
for (int i = 0; i < N; i++) {
a[i] = expensive_computation(b[i]);
c[i] = cheap_computation(d[i]); // Independent!
}After fission:
for (int i = 0; i < N; i++) {
a[i] = expensive_computation(b[i]);
}
for (int i = 0; i < N; i++) {
c[i] = cheap_computation(d[i]);
}Why faster: Second loop can be vectorized independently, or parallelized differently.
37.3 Auto-Vectorization (SIMD)
Modern CPUs have SIMD (Single Instruction, Multiple Data) units: - AVX-512: 512-bit vectors (16 floats, 8 doubles) - AVX2: 256-bit vectors (8 floats, 4 doubles) - NEON (ARM): 128-bit vectors (4 floats)
Scalar code:
for (int i = 0; i < N; i++) {
c[i] = a[i] + b[i];
}Vectorized (AVX2):
for (int i = 0; i < N; i += 8) {
__m256 va = _mm256_load_ps(&a[i]);
__m256 vb = _mm256_load_ps(&b[i]);
__m256 vc = _mm256_add_ps(va, vb);
_mm256_store_ps(&c[i], vc);
}Speedup: 8× for FP32 (8 operations per instruction).
37.3.1 When Auto-Vectorization Works
Good (compiler vectorizes):
for (int i = 0; i < N; i++) {
c[i] = a[i] + b[i]; // Simple, independent operations
}Bad (compiler can’t vectorize):
for (int i = 1; i < N; i++) {
a[i] = a[i-1] + b[i]; // Loop-carried dependency!
}Why? Each iteration depends on the previous one. Can’t compute 8 iterations in parallel.
37.3.2 Helping the Compiler Vectorize
Use restrict to eliminate aliasing:
// Compiler doesn't know if a and b overlap
void add(float* a, float* b, float* c, int N) {
for (int i = 0; i < N; i++) {
c[i] = a[i] + b[i];
}
}
// With restrict: compiler knows they don't overlap → can vectorize
void add(float* __restrict__ a, float* __restrict__ b,
float* __restrict__ c, int N) {
for (int i = 0; i < N; i++) {
c[i] = a[i] + b[i];
}
}Align data:
// Misaligned loads are slower
float* a = (float*)malloc(N * sizeof(float));
// Aligned loads are fast
float* a = (float*)aligned_alloc(32, N * sizeof(float));Use pragmas:
#pragma omp simd
for (int i = 0; i < N; i++) {
c[i] = a[i] + b[i];
}37.4 Common Compiler Optimizations
37.4.1 Constant Folding
Before:
int x = 3 + 4;
int y = x * 2;After:
int x = 7;
int y = 14;Computes at compile time instead of runtime.
37.4.2 Dead Code Elimination
Before:
int x = 5;
int y = x + 1; // Never used
return x;After:
int x = 5;
return x;Removes unused computations.
37.4.3 Common Subexpression Elimination
Before:
int a = b * c + d;
int e = b * c + f;After:
int temp = b * c;
int a = temp + d;
int e = temp + f;Computes b * c once instead of twice.
37.4.4 Function Inlining
Before:
inline int square(int x) { return x * x; }
int main() {
int a = square(5);
}After:
int main() {
int a = 5 * 5;
}Eliminates function call overhead.
Tradeoff: Can increase code size significantly.
37.4.5 Strength Reduction
Replaces expensive operations with cheaper ones.
Before:
for (int i = 0; i < N; i++) {
a[i] = i * 8; // Multiply
}After:
for (int i = 0; i < N; i++) {
a[i] = i << 3; // Bit shift (faster)
}37.5 Polyhedral Optimization
Advanced technique for complex loop nests.
Polyhedral compilers (Pluto, Polly in LLVM) represent loop iterations as points in a polyhedron, then apply mathematical transformations.
Example:
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
A[i][j] = B[i][j] + C[i][j];
}
}
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
D[i][j] = A[i][j] * 2;
}
}After polyhedral optimization (fusion + tiling):
#define TILE 32
for (int ii = 0; ii < N; ii += TILE) {
for (int jj = 0; jj < N; jj += TILE) {
for (int i = ii; i < min(ii+TILE, N); i++) {
for (int j = jj; j < min(jj+TILE, N); j++) {
A[i][j] = B[i][j] + C[i][j];
D[i][j] = A[i][j] * 2;
}
}
}
}Benefit: Automatic tiling + fusion for cache efficiency.
Limitation: Only works for “affine” loops (array indices are linear functions of loop variables).
37.6 When Compilers Fail
Compilers can’t optimize everything. Here’s what stops them:
37.6.1 1. Pointer Aliasing
Problem:
void func(int* a, int* b, int N) {
for (int i = 0; i < N; i++) {
a[i] = b[i] + 1;
}
}Compiler thinks: “What if a and b overlap? I can’t vectorize.”
Solution: Use __restrict__ to promise no aliasing.
37.6.2 2. Complex Control Flow
Problem:
for (int i = 0; i < N; i++) {
if (condition_depending_on_data(a[i])) {
result += expensive_function(a[i]);
} else {
result += cheap_function(a[i]);
}
}Compiler thinks: “Too complex to analyze. Can’t vectorize.”
Solution: Sometimes manual SIMD or restructuring.
37.6.3 3. Function Calls in Loops
Problem:
for (int i = 0; i < N; i++) {
a[i] = external_function(b[i]); // Unknown function
}Compiler thinks: “Don’t know what this function does. Can’t optimize.”
Solution: Mark function as inline or __attribute__((const)).
37.6.4 4. Virtual Functions (C++)
Problem:
for (int i = 0; i < N; i++) {
result += objects[i]->virtual_method(); // Virtual dispatch
}Compiler thinks: “Don’t know which implementation at compile time.”
Solution: Use template-based polymorphism when possible.
37.7 Reading Compiler Output
Compilers can tell you what they optimized (or didn’t).
37.7.1 GCC/Clang: Optimization Reports
gcc -O3 -fopt-info-vec -fopt-info-vec-missed source.cOutput:
source.c:10:5: optimized: loop vectorized using 16 byte vectors
source.c:25:5: missed: couldn't vectorize loop (data dependency)
Or more verbose:
gcc -O3 -fopt-info-all source.c37.7.2 Clang: Remark System
clang -O3 -Rpass=loop-vectorize -Rpass-missed=loop-vectorize source.cOutput:
source.c:10:5: remark: vectorized loop (vectorization width: 8)
source.c:25:5: remark: loop not vectorized: cannot prove it is safe
37.7.3 Intel Compiler: Optimization Report
icc -O3 -qopt-report=5 -qopt-report-phase=vec source.cGenerates detailed .optrpt files showing every optimization decision.
37.8 Inspecting Generated Assembly
Sometimes you need to see the actual machine code.
gcc -O3 -S -fverbose-asm source.c
# Creates source.s with annotated assemblyExample output:
# Loop at source.c:10
.L3:
vmovups (%rsi,%rax), %ymm0 # Load 8 floats (AVX2)
vaddps (%rdx,%rax), %ymm0, %ymm0
vmovups %ymm0, (%rdi,%rax) # Store 8 floats
addq $32, %rax # Advance pointer
cmpq %rcx, %rax
jne .L3Interpretation: Compiler vectorized with AVX2 (8 floats at a time).
37.9 Profile-Guided Optimization (PGO)
Use runtime data to guide compiler optimizations.
Step 1: Compile with instrumentation
gcc -O3 -fprofile-generate source.c -o programStep 2: Run with representative workload
./program < typical_input.txt
# Generates .gcda profile filesStep 3: Recompile with profile
gcc -O3 -fprofile-use source.c -o program_optimizedWhat PGO enables: - Better inlining decisions (inline hot functions) - Better code layout (hot code contiguous in memory) - Better branch prediction
Speedup: 10-30% for some workloads.
37.10 Link-Time Optimization (LTO)
Optimize across entire program, not just individual files.
Without LTO:
gcc -O3 -c file1.c -o file1.o
gcc -O3 -c file2.c -o file2.o
gcc file1.o file2.o -o program
# Each file optimized independentlyWith LTO:
gcc -O3 -flto -c file1.c -o file1.o
gcc -O3 -flto -c file2.o -o file2.o
gcc -flto file1.o file2.o -o program
# Optimized as one unitEnables: - Cross-file inlining - Better dead code elimination - Global constant propagation
Downside: Slower compilation.
37.11 How torch.compile Works
PyTorch 2.0’s torch.compile is an ahead-of-time compiler that applies many of these optimizations to PyTorch models.
Workflow: 1. Trace the model to capture computation graph 2. Optimize the graph (fusion, CSE, etc.) 3. Code generation to Triton or C++ 4. Compile the generated code
Example optimizations: - Operator fusion (like loop fusion) - Constant folding - Dead code elimination - Memory layout optimization
Effect: 2-10× speedup without code changes.
37.12 Practical Guidelines
37.12.1 1. Use Optimization Flags
# Release builds
gcc -O3 -march=native source.c
# Debug builds (don't optimize)
gcc -O0 -g source.cNever profile -O0 builds!
37.12.2 2. Help the Compiler
- Use
restrictfor non-aliasing pointers - Mark functions
inlineorconst - Align data structures
- Avoid complex control flow in hot loops
37.12.3 3. Check What Happened
# See what got vectorized
gcc -O3 -fopt-info-vec source.c
# See assembly
gcc -O3 -S -fverbose-asm source.c37.12.4 4. Consider PGO for Hot Paths
If 90% of time is in one function, PGO can help.
37.12.5 5. Use LTO for Final Builds
gcc -O3 -flto *.c -o programSmall compile-time cost, often 5-15% runtime gain.
37.13 Case Study: Matrix Multiply
Let’s trace compiler optimizations on matrix multiply.
Original:
void matmul(float* C, float* A, float* B, int N) {
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
for (int k = 0; k < N; k++) {
C[i*N + j] += A[i*N + k] * B[k*N + j];
}
}
}
}What compilers do:
GCC -O3: - Loop interchange (reorders k and j sometimes) - Unrolling (partial) - No vectorization (can’t prove safety with complex indexing)
ICC -O3: - Loop interchange - Unrolling - Vectorization (AVX2) - Speedup: ~4× over GCC
ICC -O3 -march=native -qopt-report: - Loop blocking (tiling) - Aggressive unrolling - Prefetching hints - Speedup: ~8× over GCC
But still 10-20× slower than OpenBLAS or MKL (which use hand-tuned assembly and advanced tiling).
Lesson: Compilers are good, but expert libraries are better for critical kernels.
37.14 Connections
Chapter 7 (Locality): Loop tiling is a locality optimization the compiler sometimes does.
Chapter 8 (Fusion): Loop fusion at the compiler level.
Chapter 19 (Profiling): Use profiling to find optimization opportunities, then check compiler reports.
37.15 Key Takeaways
Compilers apply 100+ transformations: Loop optimizations, vectorization, inlining, CSE.
Loop transformations are critical: Interchange, tiling, unrolling, fusion.
Auto-vectorization gives 4-16× speedup: When it works. Help it with
restrict, alignment, simple loops.Aliasing is the enemy: Use
__restrict__to promise non-overlapping pointers.Read compiler reports:
-fopt-infoand friends tell you what happened.PGO can help hot paths: 10-30% gains with profile-guided optimization.
Compilers aren’t magic: For critical kernels (BLAS, FFT), use expert libraries.
GCC: Good all-around, excellent standards compliance Clang: Fast compilation, good diagnostics ICC (Intel): Best for x86 (vectorization, loop opts) MSVC: Windows standard, improving
For numerical code on Intel CPUs, ICC often wins.
37.16 Further Reading
- LLVM documentation: https://llvm.org/docs/
- GCC optimization options: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
- “Engineering a Compiler” by Cooper & Torczon
- Polyhedral compilation: https://polyhedral.info/