The signals library achieves clean API design, standard conformity, the best possible runtime performance defeating virtual function, while also securing recursion- and modification-safety, all with just below 400 lines of code.

I finally got some time to clean up my library based on a previous research. Despite having implemented observer patterns of all different flavors, every time I learned a bit more - a new pitfall in the use case, a new optimization opportunity that was overlooked, or trade-offs I haven’t previously thought of. Learning Boost::signals2 lets me rethink the usability design; Despite its (relatively) high latency, it does a great job hiding the differences between signal types into a plain connection class, and thus the function signature is repeated to the bare minimum - exactly once in the callback body. I’ve also realized that a performant library does not necessarily need to be absolutely fastest (in the sense of overhead, not algorithmic complexity) in every interface it provides, because those not on the critical path don’t matter to the overall performance; They are the perfect opportunities to trade speed for easier use and flexibility.

Virtual Function Observer Pattern

Signal-slots is a form of observer pattern that dates back to GoF design patterns. If you have worked with any "event driven" codebase, the following code may look familiar:

void CStreamEngine::OnSystemEvent(ESystemEvent event, UINT_PTR wparam, UINT_PTR lparam)
	switch (event)
			// unpause the streaming engine, when init phase is done
			PauseStreaming(false, -1);
    ... 10 other cases ...

The well-known implementation of observer pattern stores a set (or alternatively a list or array) of objects, all inheriting from the same interface. This straightforward approach has many issues, but let me only focus on the performance aspect for now:

  • Bad iteration performance (hash table or list): with the typical unordered_set or list implementations, the cache misses would add up to the invocation overhead.
  • Bad removal performance (array): Using arrays, iteration would be optimal but removal becomes the new problem. Every time the listener unregisters itself from the source, which often happens in a dynamic world, it has to linear-search and erase the object, which leads to worst-case O(N^2) runtime. Swapping the element to be removed with the end would help reduce the complexity, but the linear lookup remains a bottleneck.
  • Unnecessary calls. The observer pattern is quite verbose considering that it relies on the collaboration of many delicate code snippets: to actually call OnSystemEvent(), there is also the need of ISystemEvent, std::vector<ISystemEvent*>, and typically also RegisterSystemEventListener(), RemoveSystemEventListener() and SendSystemEvent(), while keeping in mind that the names shall match and not conflict with any other event. The whole pattern is so tedious that at some point, people realize they could as well embed a convenient enum ESystemEvent to control the internal flow. When there are 20 commonly used ESystemEvents, instead of storing the objects into 20 different, much smaller lists, people usually would save them into a large list and call everyone for every event that ever happend in the game, even if they are only interested in one or two cases.

Fast Delegate, Revisited

Don Clugston’s Fast Delegate Library was probably the fastest approach of invoking an object’s member function bound at runtime, though the library involved compiler- and platform-specific hacks. Those were necessary because there was no standard data structure, or even a unified size to describe all possible kinds of member function pointers. The simple reason why it is the fastest possible way is that under the hood, the processor doesn’t care about the type but only the addresses of the function and its arguments, thus the fewer steps it needs to acquire the instructions and data, the better prefetching and the shorter delay. If the calling convention is consistent within the program being executed, there is no reason why casting the pointer back to the function wouldn’t work. Using all these compiler tricks, it allows storing a type-erased, “fat” pointer to the bound member function. Then a typical interface would take an object and a member method:

signal.connect(obj, &my_class::callback); //signal and callback have identical signatures

This is where my exploration started. After more than a decade, there must be a better, standard-conform solution to create a 0-overhead delegate with the help of all the modern C++ features. std::function is nice, but is at least as slow as virtual functions because of its virtual-based type erasure implementation. (As a side topic, I believe std::function doesn't have to be virtual-based, but I'm not aware of alternative implementation that focuses on reducing the overhead except my implementation of CRYENGINE small function)

Edit: I was corrected that

Gcc doesn’t use virtual functions in std::function (as I remember). There is a union of fn pointer + buffer.

Kudos to DerShokus!

Note that the signal needs to store a fat pointer to the actual member function, which might or might not conform to the same format. Actually, it’s not quite true that the bound member function is only known at runtime. The programmer’s intention is clearly known when the code was written - in most cases, the second argument &my_class::callback is simply a constant. The issue here is the lack of enforced compile-time argument evaluation, even with the newest standard. Fortunately, C++17 has provided a new way: taking the function pointer as a non-type template parameter with template<auto>.

But NTTP is nothing new? You might ask. Indeed, it has been there all the time and has always supported member function pointers (PMF), but PMFs that belong to distinct classes are of different types, which means they wouldn’t fit into the template parameter of a fixed type.

template<void(???::*)()> void connect(); // there was no way to define the type of this NTTP
// we need to support all of the following:

For a long time, the only workaround was asking the user to substitute the class type, so that the PMF can be defined based on the now-known class type:

template<typename A...> struct signal<void(A...)> {
    template<class C, void(C::*ptr)(A...)> //2nd parameter is based on the 1st
    void connect(C*); //usage: sig.connect<my_class, &my_class::callback>(obj);

Now with C++17 template<auto> the following code becomes possible -

template<typename A...> struct signal<void(A...)> {
    template<auto ptr, class C>
    void connect(C*); //usage: sig.connect<&my_class::callback>(obj);

I’ve explained the details of this technique in a previous blog Partial Specializing Template Auto. The user would still deal with the cryptic PMF syntax at the interface, but that’s about as good as a member function callback can get without using a macro. I’ll explain further below how lambdas are able to significantly improve the usability without adding overhead like it does in std::function or other functor containers.
Taking the PMF by template, rather than function parameter, allows storing a generated static function or capture-less lambda that internally calls the PMF with the correct type. This technique is often used to store class constructors or destructors, whose addresses cannot be directly taken, or when the type information needs to be preserved while the difference at the interface level is erased. Now, instead of storing the PMF of unknown size, we could save the static / capture-less function pointer void(*)(void*, A...) which has a predefined signature and a fixed size. No more undefined behaviors!

//PMF does not need to be captured because it’s a template argument
func = [](void* obj, A ... args) {((*reinterpret_cast<C**>(obj))->*PMF)(args...); }); 

Another important advantage is optimization. The invocation (obj.*pmf)(args...) only extracts the actual function at the time of calling, since it needs to support virtual functions. However, when pmf is a compile-time constant, such as &my_class::callback, the compiler would be able to fold it as if it was directly calling obj.callback(args...). It usually emits the optimal assembly that either inlines the function call, or contains a statically predictable jump, which approximately reaches the performance of inlined-calls through instruction prefetching.


Clearly after C++11, any library dealing with functions would need to support lambda. In the context of signal-slots, lambdas (or pre-C++11 functors) are quite different from member functions:

  • Lambda has its own lifetime, whereas a member function’s scope depends on this object. This means we must copy or move the lambda to somewhere in the signal.
  • Lambda’s operator() is known at compile time because the function is tied to the unique class representing the lambda. In other words, we know which member function to call without having to mention it in the template arguments. This leads to usability improvements:
class game { /*...*/ };
fteng::signal<void(const game& instance)> game_created;

class subsystem{
  //Connects a signal with a lambda capturing 'this'
  fteng::connection on_game_created = game_created.connect([this](const game& instance){
    std::cout << "Game is created.\n";

A lambda object capturing this has exactly the size of a pointer, which fits perfectly into the delegate. The callback in the example above should run as fast as a member function. However, if the lambda is larger than the size of pointer, the object has to be stored separately. In contrast, the call operator can still be directly saved and thus benefits from triggering fewer cache misses.

In my next blog I'm going to talk about the design choices of containers, the RAII connection object, move-semantic support and modification safety!

The code is available on Github and can be directly tested here.