Performance Optimization¶
C++ performance fundamentals: cache efficiency, move semantics, allocation strategies, compiler optimizations, and profiling techniques.
Key Facts¶
- Measure first, optimize second. Use profiler (perf, VTune, Instruments), not guesses
- Cache misses dominate: sequential memory access >> random access (100x+ difference)
std::vectorbeatsstd::listfor almost everything due to cache locality- Move semantics eliminate copies: pass-by-value + move for sink parameters
reserve()on vectors prevents reallocation in loops- Small Buffer Optimization (SBO):
std::string,std::functionavoid heap for small data inlineis a linkage hint, not a performance hint - compiler decides inlining[[likely]]/[[unlikely]](C++20) - branch prediction hints-O2/-O3for release;-Ogfor debug with optimizations- Link-Time Optimization (LTO):
-fltoenables cross-TU inlining - Profile-Guided Optimization (PGO): compile, run, recompile with profile data
Patterns¶
Cache-Friendly Data Layout¶
// BAD: Array of Structs with unused fields (AoS)
struct ParticleBad {
float x, y, z; // position (hot)
float r, g, b, a; // color (cold)
std::string name; // metadata (cold)
float vx, vy, vz; // velocity (hot)
};
std::vector<ParticleBad> particles; // iterating position touches cold data
// GOOD: Struct of Arrays (SoA) - separate hot from cold
struct Particles {
std::vector<float> x, y, z; // positions together
std::vector<float> vx, vy, vz; // velocities together
std::vector<float> r, g, b, a; // colors together (separate cache lines)
};
// Physics update only touches x,y,z,vx,vy,vz - no cache pollution from colors
Avoiding Copies¶
// Return by value - NRVO eliminates copy
std::vector<int> generate() {
std::vector<int> result;
result.reserve(1000);
for (int i = 0; i < 1000; ++i) result.push_back(i);
return result; // NRVO: constructed directly in caller
}
// Sink parameter: by value + move
class Registry {
std::vector<std::string> items_;
public:
void add(std::string item) { // copy/move into param
items_.push_back(std::move(item)); // move into container
}
};
// Pass large read-only by const ref
void process(const std::vector<int>& data); // no copy
void process(std::span<const int> data); // even lighter (C++20)
Reserve and Shrink¶
std::vector<int> v;
v.reserve(10000); // single allocation up front
for (int i = 0; i < 10000; ++i) {
v.push_back(i); // no reallocations
}
// Release excess memory
v.shrink_to_fit();
// For maps: rehash to reduce load factor
std::unordered_map<int, int> m;
m.reserve(1000); // pre-allocate buckets
constexpr Computation¶
// Move computation to compile time
constexpr auto lookup_table = [] {
std::array<int, 256> table{};
for (int i = 0; i < 256; ++i) {
table[i] = (i * i) % 256;
}
return table;
}();
// Zero runtime cost - embedded in binary
int fast_square_mod(uint8_t x) {
return lookup_table[x];
}
Compiler Hints¶
// Branch prediction
if ([[likely]] (ptr != nullptr)) {
process(ptr);
} else [[unlikely]] {
handle_error();
}
// Assume (C++23)
void process(int x) {
[[assume(x > 0)]]; // compiler can optimize based on this
// ...
}
// Restrict aliasing (compiler extension)
void add_arrays(float* __restrict a, const float* __restrict b, size_t n) {
for (size_t i = 0; i < n; ++i) a[i] += b[i]; // auto-vectorized
}
Custom Allocator (Pool)¶
#include <memory_resource>
// Stack-based buffer for temporary allocations
char buffer[4096];
std::pmr::monotonic_buffer_resource pool(buffer, sizeof(buffer));
std::pmr::vector<int> fast_vec(&pool); // allocates from stack buffer
// No heap allocation until buffer exhausted
for (int i = 0; i < 100; ++i) {
fast_vec.push_back(i);
}
Gotchas¶
- Issue: Premature optimization without profiling -> optimizing wrong code path -> Fix: Profile first. 90% of time is in 10% of code. Optimize the hot path.
- Issue:
std::endlflushes stream on every line -> 10-100x slower than'\n'-> Fix: Use'\n'for newlines,std::endlonly when flush is needed - Issue: Passing
std::stringby value when only reading -> unnecessary copy -> Fix: Usestd::string_vieworconst std::string&for read-only access - Issue:
shared_ptratomic ref-count overhead in single-threaded code -> Fix: Useunique_ptrwhen shared ownership not needed.shared_ptrref count is always atomic. - Issue: Virtual function call in tight loop prevents inlining -> Fix: Use CRTP for static polymorphism, or devirtualize with
final