Dynamic Memory Allocation: Justifiably Taboo?
In today’s world of multi-core processors and multi-threaded applications, developers need to constantly think about how to harness the power of multiple CPU cores. Increasing application performance depends on the proper use of multiple application threads, which in turn hinges on the right approach to memory management.
In particular, multi-threaded applications running on multi-core systems can slow down markedly when performing many memory allocations. Often an application will run fine with a single CPU, but placing it on a system with two or more processors or processor cores yields a slowdown in performance, not the expected doubling of performance. This performance impact is very easy to miss at the application’s design stage as it is hidden deep inside the C runtime library.
Why does the standard list allocator perform so poorly in a multi-core environment? It stems from the fact that this allocator’s chain of pointers is a shared resource that must be protected. Chaos would ensue if a thread in the middle of breaking the chain to insert a new link, was interrupted by a context switch, and another thread tried to walk the (now broken) chain. So, the chain is protected by a mutex to prevent concurrent access and preserve the allocator’s consistency (picture 1).
Locking a mutex presents minor overhead when there are no, or few, conflicts. However, as the number of malloc and free calls increases, contention for this shared resource increases, creating a lock conflict. To resolve the conflict, the operating system imposes a context switch, suspending the thread that attempted to access the allocator and inserting it into the kernel’s waiting queue. When the allocator is released, the thread is allowed to run and access the allocator.
Even if each thread accesses only objects that it created, and so otherwise requires no synchronization, there is still only one memory allocator; the allocator doesn’t “know” that no synchronization is required, and acts to protect its meta-data, which results in a lot of conflicts between threads. As a result, the same application would perform better on a single CPU because the CPU can be kept busy with other tasks (it does not try to schedule tasks that are in the waiting queue). Conversely, in a multi-core setting, all but one core can be idled, with respect to dynamic memory management, because of the serialization of access to the heap (picture 2).
The ideal solution, from a performance standpoint, would be for each thread to allocate objects on the stack rather than in dynamic memory (in other words, with local variables declared in the function body). However, this simplified approach is rarely viable: thread stack size is limited and therefore allocating large objects or a large number of smaller objects on the stack is impossible.
A more practical approach is to provide a separate memory allocator for each thread, so that each allocator manages memory independently of the others. This approach is called a thread-local allocator.
The thread-local allocator is a custom allocator that avoids creating locking conflicts when objects are allocated and released within a single task. When allocating in one task and de-allocating in another, a lock is, of course, required. However, this allocator takes measures to minimize those locking conflicts.