Implementierung Efficient C Code for ARM Devices

Autor / Redakteur: Chris Shore* / Martina Hafner

In any development, some degree of “optimization” is almost inevitable in order to develop software which is performant and efficient. When optimizing software, it is crucial to establish your optimization goals and then work within the capabilities and constraints of the tools, the language, the processor and the target system to realize the best possible outcome.

Firmen zum Thema

(Bild: gemeinfrei / CC0 )

Software engineers don’t just eat pizza and drink coffee! They spend a lot of time developing software. But how do they really spend that time. Surveys show that the greatest portion is spend on optimizing, reviewing and maintaining existing code. The smallest part is spent actually writing new code. So optimization is a major activity for the majority of engineers.

Optimization is a combination of several activities, working towards a variety of goals. You may care about speed, or code size, or data size, or the amount of data processing, or a number of other criteria. These goals are often mutually exclusive. For instance, faster code is typically larger code. The key is to decide on a goal and work consistently towards that.

Coding standards do not often mention optimization. Instead, they concentrate on things like reliability, readability, portability and so on. These are all very admirable goals but none of them address the efficiency of your code. When asked what they themselves really care about, however, engineers report “performance” as their top priority.

There is a key difference between the priorities of coding standards and the priorities of the engineers who actually write the code. So how do we resolve this?

I believe that an engineer should always prioritize the coding standard rules, for example readability, over considerations of performance. Sometimes, though, it is necessary to break those rules in order to achieve some optimization goal. This means that it is extremely important to identify very carefully those areas in which you are to concentrate your optimization efforts.

A good simple rule is that 90% of the execution time is spent in 10% of the code. So, use whatever tools you have to identify that 10% and spend 90% of your time optimizing it. That way, you get the biggest payback.

“We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%,” says Donald Knuth (to whom all software engineers should pay serious attention!).

We now consider several areas in which you can optimize your code: Language, Hardware, Tools and Platform.


When writing in almost any high level language, and here we concentrate on C as it is most widely used by far, it is important to remember that “fewer lines of code does not make it faster”. For instance, writing complex expressions on one line is no faster than splitting the same expression over multiple lines, possibly making use of temporary variables to split it over several statements. In fact, short code is sometimes slower.

What is undeniable is that shorter, compressed code is always harder to read, harder to review and harder to maintain.

Language ambiguity

Be careful of any ambiguity in the language. For instance, compiler writers are free to choose whether the char type is signed or unsigned. And ambiguity sometimes extends to the underlying machine as well. For instance, different hardware architectures will behave differently when asked to shift a 32-bit variable left by more than 32 bits.

(Bild: ARM)

The following example results in unpredictable behavior unless the signed-ness of the char type is known at compile-time.

(Bild: ARM)

The following results in undefined behavior. It is not actually valid C code but the result depends on which processor the code is executed.

Language features

Consider also the constraints which are placed on compiler optimization by features which are built into the language. Pointers in C are extremely powerful but, because of that power, the phenomenon of pointer aliasing greatly restricts the freedom of the compiler in carrying out optimizations.

(Bild: ARM)

In the following sequence, each input value must be loaded twice. This is because the compiler must assume that there is some possibility that the output array and the input array overlap.

Sometimes, languages provide extra keywords (in this case the ‘restrict’ keyword) which can help here.

Language definition

Consider also where the precise definition of the language may work against you. For instance, most integer data types default to a signed representation. This means that most arithmetic operations are carried out to generate a signed result, often requiring extra instructions to normalize the result for correct size and sign.

If you actually want unsigned arithmetic, make sure that your types are defined correctly to avoid these unnecessary operations.

(Bild: ARM)

The following simple division (which, with an unsigned variable could be implemented using a simple shift) results in quite a complex sequence of instructions as the compiler needs to maintain the correct sign of the result.

(Bild: ARM)

Here is the resulting code output.


Most languages incorporate features which allow you to pass “meta-information” to the compiler which will assist in code generation. A common example is the ‘const’ modifier which allows you to inform the compiler that a particular data item will not change.

First, this means that a ‘const’ variable can be allocated to ROM rather than occupying valuable RAM. Secondly, when used to modify a function parameter, the compiler is informed that the function will not change this parameter. This serves two functions: firstly, the compiler can warn you if you attempt to change it; secondly, the compiler knows that it can optimize in the knowledge that this item will not change.

(Bild: ARM)

In the following example, the compiler does not need to reload ‘foo’ in the if statement as it is able to assume that the function call will not change it.


When developing for an embedded processor, you are typically writing within several constraints. It is important to know what these limits are and to work within them.

Instruction set

Any processor supports a well-defined instruction set. You should familiarize yourself with that instruction set and make sure that you use it effectively. In the case of ARM processors, most support at least two instruction sets, and sometimes three.

It is a compile-time choice which instruction set to use. Usually, the Thumb-2 instruction set is used as it provides the best balance between performance, functionality and code density. In some circumstances, it can be beneficial to select the ARM instruction set for high performance in critical code regions.

Assembly code

Traditionally, assembly code has been used to extract maximum performance from machines by optimizing data access and by re-ordering instructions to minimize pipeline hazards and load-use penalties. For two reasons, this is less useful in a modern system.

Firstly, compilers are much better at exploiting the architecture. Secondly, modern processors often incorporate out-of-order or superscalar execution hardware which works to minimize pipeline hazards at execution time.


The area of branches, though, is one in which it is worth understanding how the underlying hardware behaves. In processors which support even simple branch prediction, the behavior of branches is crucial to execution performance. On an ARM11, for instance, a predicted branch can take as little as zero cycles, while a mis-predicted branch can take as many as seven.

Understanding how that branch prediction works can allow you to structure your software to work withit rather than against it. It is also worth knowing that certain kinds of branch are inherently non-predictable. On an ARM processor, these include branches which change state, instructions which set the condition codes at the same time as modifying the PC, and instruction which change mode.

Most ARM prediction schemes tend to predict the backward branches will be taken and forward branches will not (more modern schemes will start with this default before rapidly training themselves on the way in which the code actually behaves). This leads to the easy conclusion that loops should have the test an conditional branch at the bottom of the loop rather than at the top. In general ‘for’ and ‘dp’ loops are examples of the former, ‘while’ loops are an example of the latter.

By the same logic, a sequence of ‘if’ statements will be more efficient if the most commonly taken block is placed at the top, with less common blocks placed lower down.

On most ARM systems, the branch range should also be considered. In Thumb-2, for instance, the maximum branch distance of a single branch instruction is 16MB. A branch longer than this will need extra code to be inserted, either by the compiler or the linker, to extend the branch. This will cost time and space.


Recall that ARM processors do not, in general, have division hardware. This means that most integer divide operations will call a runtime library function and will consume several tens or hundreds of cycles. If the divisor is a constant, the compiler can use much more efficient algorithms and avoid using the library.

(Bild: ARM)

Remember that a modulo operation is really a division…so test-and-reset is often quicker than using a modulo to implement a cyclic counter. The following examples illustrate this.

Memory systems

Embedded systems generally have very constrained memory systems. Many do not have cache; those that do often only have a very small amount. It is important to know the memory and cache architecture of your system and develop code which is sensitive to it.

For instance, working with a dataset which is significantly larger than the available cache can actually make your code run more slowly due to data being evicted before it can be re-used. Using cache-friendly access patterns and making use of pre-load instructions can make a big difference.

When using virtual memory, be aware of the size and limitations of the Translation Lookaside Buffers (TLB) in the Memory Management Unit (MMU). These are used to cache details of virtual-physical address translations. In an ARM system, a 64-entry TLB can cache sufficient translations to cover anywhere between 256KB and 4MB, depending on the translation granularity.

In some applications, this can be a severe limitation and careful data allocation in memory can make a big difference.


Almost all ARM processors can access aligned data more efficiently than unaligned data. Sometimes it is advantageous to pack data structures to remove unnecessary padding and therefore reduce data memory size. This can result in unaligned structure fields which can be slow to access.


Know your compiler well and know what features it supports. A few examples of features which can help the compiler in optimization are ‘pure’, ‘restrict’ and ‘packed’. These can allow the programmer to convey meta-information about the program behavior and data layout which can make for much more efficient code.

All of them, however, rely on the programmer giving correct information to the compiler. If you do not, your code may break and it will be your fault!


Knowledge of the platform, its features and capabilities, is crucial. In areas like power management, inter-processor communication, coprocessor hardware, floating point etc, it is important to know what capabilities your hardware has and to exploit them well.

* Chris has worked at ARM for over 16 years, currently Director of Technical Marketing. For 15 years, he was responsible for ARM’s customer training activity – delivering over 200 training courses every year to ARM’s customers and end users all over the world. He is also a regular speaker at conferences and industry events.