Optimizing C++ Code with CRTP

Introduction

In C++, the choice between static and dynamic dispatch can have a significant impact on performance, especially in hot paths. Understanding when to use each approach and how to optimize for maximum efficiency is key to writing high-performance code. This post will explore the Curiously Recurring Template Pattern (CRTP) as a way to optimize dynamic dispatch, focusing on practical examples and comparing the resulting assembly.

Static vs Dynamic Dispatch

Dispatch refers to the mechanism that determines which function implementation is executed when a function call is made. In C++, we have two types of dispatch:

Static dispatch happens at compile time and generally results in more efficient code since the compiler knows exactly which function to call. Function inlining and optimizations are possible with static dispatch.
Dynamic dispatch occurs at runtime and is usually implemented using virtual functions. While dynamic dispatch allows for greater flexibility through polymorphism, it introduces an overhead due to the indirection of the vtable lookup. Dynamic dispatch also makes it harder for the compiler to inline the function for better optimization.

Dynamic Dispatch Example with a Virtual Function

Let’s consider a simple example that uses dynamic dispatch with virtual functions. Imagine we have a blending program, and each blending mode has a blendSingleColor() method that operates on a single color channel. We want to blend two images in a hot path (e.g., inside a tight loop). Here’s a basic implementation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44


#include <vector>
#include <memory>

class BlendingMode {
public:
    virtual ~BlendingMode() = default;
    virtual int blendSingleColor(int src, int dest) const = 0;
    void apply(const std::vector<int>& src, std::vector<int>& dest) const {
        for (size_t i = 0; i < src.size(); ++i) {
            dest[i] = blendSingleColor(src[i], dest[i]);
        }
    }
};

class NormalBlending : public BlendingMode {
public:
    int blendSingleColor(int src, int dest) const override {
        return src;
    }
};

class MultiplyBlending : public BlendingMode {
public:
    int blendSingleColor(int src, int dest) const override {
        return (src * dest) / 255;
    }
};

void blendWithMode(const BlendingMode& blendingMode, const std::vector<int>& src, std::vector<int>& dest) {
    blendingMode.apply(src, dest);
}

int main() {
    NormalBlending normalBlending;
    MultiplyBlending multiplyBlending;

    std::vector<int> src(1000, 150);
    std::vector<int> dest(1000, 50);

    blendWithMode(normalBlending, src, dest);
    blendWithMode(multiplyBlending, src, dest);

    return 0;
}

In this example, each call to blendSingleColor() involves a vtable lookup, which adds overhead. This overhead can be significant if blendSingleColor() is called repeatedly in a hot path with many pixels to process.

What is CRTP?

The Curiously Recurring Template Pattern (CRTP) is a technique in C++ where a class inherits from a template instantiation of itself. This allows the compiler to resolve function calls at compile time, effectively replacing dynamic polymorphism with static polymorphism.

Here is a simple CRTP framework:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


template <typename Derived>
class Base {
public:
    void interface() const {
        static_cast<const Derived*>(this)->implementation();
    }
};

class DerivedA : public Base<DerivedA> {
public:
    void implementation() const {
        // DerivedA specific implementation
    }
};

class DerivedB : public Base<DerivedB> {
public:
    void implementation() const {
        // DerivedB specific implementation
    }
};

In this framework, Base is a class template that uses the derived class as a template parameter. This allows Base to call methods in Derived at compile time, enabling the compiler to inline the functions and optimize them effectively.

With CRTP, the Base class can always be statically cast to the derived class (Derived). This is because the Base class knows the exact derived type it is working with at compile time. The use of static_cast<const Derived*>(this) enables the compiler to treat the current instance (this) as an instance of the derived class. This means that every call to a derived method can be resolved at compile time, providing concrete type information for optimization, such as inlining the derived implementation directly into the caller, avoiding any overhead associated with dynamic dispatch.

Moreover, this compile-time resolution ensures that all polymorphic behaviors are determined early, allowing the compiler to generate the most efficient code possible. Unlike dynamic polymorphism, which requires runtime type checks and vtable lookups, CRTP eliminates these runtime checks by relying on the known derived type during compilation.

Applying CRTP to the Blending Example

Now, let’s modify our original blending example to use CRTP instead of virtual functions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41


template <typename Derived>
class BlendingModeCRTP {
public:
    void apply(const std::vector<int>& src, std::vector<int>& dest) const {
        for (size_t i = 0; i < src.size(); ++i) {
            dest[i] = static_cast<const Derived*>(this)->blendSingleColorImpl(src[i], dest[i]);
        }
    }
};

class NormalBlendingCRTP : public BlendingModeCRTP<NormalBlendingCRTP> {
public:
    int blendSingleColorImpl(int src, int dest) const {
        return src;
    }
};

class MultiplyBlendingCRTP : public BlendingModeCRTP<MultiplyBlendingCRTP> {
public:
    int blendSingleColorImpl(int src, int dest) const {
        return (src * dest) / 255;
    }
};

template <typename BlendingMode>
void blendWithMode(const BlendingMode& blendingMode, const std::vector<int>& src, std::vector<int>& dest) {
    blendingMode.apply(src, dest);
}

int main() {
    std::vector<int> src(1000, 150);
    std::vector<int> dest(1000, 50);

    NormalBlendingCRTP normalBlending;
    MultiplyBlendingCRTP multiplyBlending;

    blendWithMode(normalBlending, src, dest);
    blendWithMode(multiplyBlending, src, dest);

    return 0;
}

In this version, the apply function now uses CRTP to call blendSingleColorImpl() on the derived class. The compiler can resolve these calls at compile time, allowing for inlining and eliminating the runtime overhead of dynamic dispatch.

Comparing the Assembly

To see the performance impact, we can compare the assembly generated by the dynamic dispatch version with that of the CRTP version. Here’s an example on godbolt. In this test, both examples are compiled using gcc 14.2 with -O2 optimization. You may notice that the CRTP version, all the function calls are already optimized out, and even uses SIMD instructions for batched operation. However, in the virtual function version, the function calls still cannot be optimized, and the resulting program are doing function calls as-is. You may see that in this case, the virtual functions are slow not only because the cost of vtable, but more importantly losing the chance for further optimization.

Conclusion

CRTP can be a powerful tool when you need to optimize performance in C++ by replacing dynamic polymorphism with static polymorphism. By eliminating vtable lookups, CRTP enables the compiler to inline functions, resulting in faster code execution, particularly in performance-critical sections.

If you have scenarios where runtime polymorphism isn’t strictly necessary, CRTP is a great way to achieve polymorphic behavior without the runtime cost.

Introduction#

Static vs Dynamic Dispatch#

Dynamic Dispatch Example with a Virtual Function#

What is CRTP?#

Applying CRTP to the Blending Example#

Comparing the Assembly#

Conclusion#