Dynamic Memory Allocation: Justifiably Taboo?

Anbieter zum Thema

Implementing the Thread-local memory manager

The implementation of the thread-local memory manager is based on two concepts: the block allocator algorithm that we have already discussed, and the concept of thread-local storage, or TLS.

Thread-local storage provides a means to map global memory to a local thread’s memory. In other words, data in a global variable is usually located at the same memory location when it is referred to by threads from the same process. Sometimes, it is advantageous to have different threads that refer to the same global variable while referring to different memory locations. Thread-local storage accomplishes this. Likewise, the thread-local allocator maps portions of the global heap to individual threads.

Again, the thread-local memory manager is based on the block allocator algorithm discussed earlier. The allocator creates and maintains a number of chains of same-size small blocks that are made out of large pages. To allocate memory, the allocator simply unlinks a block from the appropriate chain and returns the pointer to the block to the application. When and if a new large page is necessary, the allocator can use a general-purpose memory manager (standard malloc) to allocate the page.

As long as all objects are allocated and de-allocated locally (by the same thread), this algorithm does not require any synchronization at all because each thread has its own allocator (and therefore doesn’t need to synchronize with any other thread when allocating/de-allocating its own memory).

What happens when objects are not local? The memory manager maintains a Pending free Requests List (PRL) for each thread. When an object allocated in one thread is being de-allocated by another thread, the de-allocating thread simply links the object into its PRL list. Of course, PRL access is protected by a mutex. Each thread periodically de-allocates objects in its PRL at once. When does this occur? It could be based on a timer, or when a certain number of requests are pending, or when a certain amount of memory has accumulated in the PRL, or according to any other application-specific criteria.

It’s important to note that regardless of the criteria, the number of synchronization requests is reduced significantly using this approach. First, objects are often freed by the same thread that allocated them. Second, even when the object is de-allocated by a different thread, it does not interfere with all other threads, but only with those that need to use the same PRL. For example, assume you have eight threads, and you know based on your application’s logic flow that memory allocated by thread #1 will only ever be de-allocated by threads #4 or #7. Therefore, locking the PRL in any one of those threads will only interfere with the other two threads, not with all seven of the other threads, as would be the case with the default allocator. In this way, locking conflicts are reduced even when allocation/de-allocation is not “local.”

Following is a diagram of the internal structures of thread local allocators (diagram 4).

The thread-local allocator can create an arbitrary number of block lists, of varying sizes. The diagram shows just one possible example.

To applications, the allocator exports three functions with syntax similar to the standard C runtime allocation API: thread_malloc(), thread_realloc() and thread_free(). For applications written in C++, the memory manager’s interface also includes a simple way to redefine the default new and delete operators.

We developed two tests to examine the impact of the thread-local memory manager. The first test compares performance of the thread-local allocator and the standard C runtime allocator when the allocation pattern is thread-local: all de-allocations are performed by the same thread as the original allocations. This is a “best case” scenario.

The second test compares performance when objects are allocated by a “producer” thread and freed by a “consumer” thread (picture 3 & 4). This is a “worst case” scenario.

We ran these tests on a Sunfire x4450 system with four 6-core Xeon processors and 24GB of memory. The first test performed 10 million allocation/free pairs in each of 24 threads (for a total of 240 million allocation/free pairs). Because all the allocations were so-called local and required no synchronization, all 24 cores were utilized with the thread-local allocator. Standard malloc, of course, could only utilize a single core, and this accounts for the dramatic performance difference.

The second test also performed 10 million allocation/free pairs but with only two threads. In this case, performance improved, but only by about 20%, due to three factors: (1) Allocation doesn’t require any synchronization, (2) there was some benefit from the reduced synchronization on the PRL (but minimal, because there were only two threads), and (3) the block allocator that the thread-local allocator uses is simply a superior allocator compared to standard malloc.

The results show that significant performance improvements are obtained by replacing the standard allocation mechanism with a thread-local allocator, especially as the number of cores increases. This is a classic case of “your mileage may vary” (YMMV): The benefit that any application will experience will be a function of (1) the number of cores and (2) the ratio of local to global allocations and (3) the logic flow that determines the number of synchronizations on the PRLs.

Conclusion

Approaches to memory management significantly affect embedded code safety, performance and predictability, as well as prospects for DO-178B airborne software certification. This is due to the fact that dynamic memory allocation is risky. It can and should be eliminated from safety-critical processes. General purpose C language memory allocators are not optimized for embedded systems. When possible they should be replaced with custom allocators that deliver safety, predictability and reduced overhead when allocating memory. A number of algorithms can be considered, including bitmaps allocators, block allocators and stack-based allocators. Finally, the performance of multi-threaded applications on multi-core systems can be improved with a custom thread-local allocator.

Der Stapelspeicher (Stack), das unbekannte Wesen: Auf dem Stack werden lokale Variablen einer Funktion allokiert. Allerdings sind ein Überlauf beziehungsweise ein Überschreiben der Inhalte des Stapelspeichers sehr leicht möglich, erfordern also die besondere Aufmerksamkeit des Entwicklers. (gemeinfrei/Pixabay)

* Steven Graves co-founded McObject in 2001. As the company’s president and CEO, he has spearheaded McObject’s growth as a provider of embedded database system software and worked with customers across a wide range of embedded systems market segments. Prior to McObject, Mr. Graves was president and chairman of Centura Solutions Corporation, and VP/worldwide consulting for Centura Software Corporation (NASDAQ: CNTR); he also served as President and Chief Operating Officer of Raima Corporation.

(ID:44195367)