Multicore Manycore Design – Going Beyond 8 Cores

| Autor / Redakteur: Masaki Gondo* / Martina Hafner

This paper introduces eMCOS, a new manycore real-time OS. eMCOS is a distributed micro-kernel architecture OS based on message-passing model.

Firma zum Thema

(Bild: ClipDealer)

Each micro kernel is very small, while providing its application SMP view, abstracting the physical cores. It has hierarchical server architecture, including its schedulers. The basic design of eMCOS is presented, and a novel semi-priority based scheduling algorithm is explained.


The experimental result of the new algorithm outperforms others significantly as the deviation in the amount of work performed by application threads becomes larger, resembling real systems.


It is clear that the number of cores embedded in a single chip is increasing. Many of these are heterogeneous cores but we are also seeing more homogenous cores. Heterogeneous approach is often considered an optimal way to attain better performance per watt, however, this often is a consequence of targeting specific application thus it may not necessarily scale to another application.

This is an issue as the process technology advances and it becomes more and more difficult to find a large enough market to consume the application specific chip. This in general calls for more homogenous cores that are more scalable.

To be precise, a typical chip will be heterogeneous as a whole, where a potentially typical configuration is to have a (cluster of) large general purpose cores, possibly a (cluster of) graphic cores, several special purpose accelerator cores, and a (multiple) cluster(s) of many homogeneous small cores. Further, a variety of memory sub-systems can be expected.

Each cluster may feature a cluster local shared memory and external memory not directly accessible from the cores. Or each core may feature a core local memory which can be accessed from other cores forming a distributed shared memory, or each core may host a multi-level cache which are cache coherent and a globally shared memory.

The heterogeneous nature of a chip as a whole is already abundant nowadays and many are in production – the significant change will be the part of many homogenous cores and their clusters, and management of the heterogeneous system as a whole, including the memory.

Note that in this paper the type of heterogeneous cores where ISAs are the same while the gate counts are different resulting in different performance characteristics (clock, pipeline depth, order, etc.) – these are treated as the homogeneous cores with different and fixed DVFS points.

The rise in the number of cores calls for more software processing and their management. The most popular approach to manage runtime software is to use an Operating System, and in this case, a manycore OS. For multicore processors, where the number of cores is mostly up to four and not much greater, either one of Asymmetric Multiprocessing (AMP) or Symmetric Multiprocessing (SMP) models are used.

Although AMP is widely used in embedded systems due to its ease in securing real-time processing by avoiding hard-to-predict interferences such as thread migration, cache-related issues, etc., the design cost of statically allocating all the threads to each core, dealing with core-to-core thread communication, device sharing, and service sharing are quite significant.

Likely, a hypervisor based partitioning model also suffers from the same problem, simply because it is still the AMP model in principle and cannot escape from the architectural property. Therefore, although there are some OS that attempts to solve this by blending AMP into SMP, such as it-Kernel Multi-Core Edition developed by eSOL, the increasing number of cores generally calls for an SMP model manycore OS from the software system point of view.

This is particularity important as the cost of software development is paramount and the software must be reused – this also has a similar impact on the software quality, which is a significant concern in safety-critical embedded systems.

Implementing an SMP OS on a manycore processor brings new issues to currently deployed OS used in embedded systems. One is the lack or the overhead of cache coherency mechanism. Although some manycore processors offer cache coherency, quite a few manycore processors does not offer hardware cache coherency mostly due to various side-effects it introduces.

Most of the SMP OS used in embedded systems require the hardware cache coherency and a globally shared memory. Even with the manycore processors with cache coherency, the cost of maintaining the coherency between distant caches in a chip is much more costly than that of current dual-core processor, for example.

As discussed, the increasing number of cores, the variety of memory subsystems, and the management of heterogeneity present architectural implementation challenges. Yet another challenge for manycore OS is that of scheduling.

In principle, load-balancing of tasks or threads are an efficient mechanism, given that the variation in the software is enlarging and that the more dynamic nature in processing making it difficult to access scheduling statically at the design time.

However, it is well known that while the load-balancing achieves better average throughputs, it exhibits degraded realtime capability, due to the inherent overhead of thread-migration and its uncertainty. Therefore, the manycore realtime OS needs to find a balance between the higher average throughputs while assuring the realtime processing.