Efficient C Code for ARM Devices
Firmen zum Thema
The area of branches, though, is one in which it is worth understanding how the underlying hardware behaves. In processors which support even simple branch prediction, the behavior of branches is crucial to execution performance. On an ARM11, for instance, a predicted branch can take as little as zero cycles, while a mis-predicted branch can take as many as seven.
Understanding how that branch prediction works can allow you to structure your software to work withit rather than against it. It is also worth knowing that certain kinds of branch are inherently non-predictable. On an ARM processor, these include branches which change state, instructions which set the condition codes at the same time as modifying the PC, and instruction which change mode.
Most ARM prediction schemes tend to predict the backward branches will be taken and forward branches will not (more modern schemes will start with this default before rapidly training themselves on the way in which the code actually behaves). This leads to the easy conclusion that loops should have the test an conditional branch at the bottom of the loop rather than at the top. In general ‘for’ and ‘dp’ loops are examples of the former, ‘while’ loops are an example of the latter.
By the same logic, a sequence of ‘if’ statements will be more efficient if the most commonly taken block is placed at the top, with less common blocks placed lower down.
On most ARM systems, the branch range should also be considered. In Thumb-2, for instance, the maximum branch distance of a single branch instruction is 16MB. A branch longer than this will need extra code to be inserted, either by the compiler or the linker, to extend the branch. This will cost time and space.
Recall that ARM processors do not, in general, have division hardware. This means that most integer divide operations will call a runtime library function and will consume several tens or hundreds of cycles. If the divisor is a constant, the compiler can use much more efficient algorithms and avoid using the library.
Remember that a modulo operation is really a division…so test-and-reset is often quicker than using a modulo to implement a cyclic counter. The following examples illustrate this.
Embedded systems generally have very constrained memory systems. Many do not have cache; those that do often only have a very small amount. It is important to know the memory and cache architecture of your system and develop code which is sensitive to it.
For instance, working with a dataset which is significantly larger than the available cache can actually make your code run more slowly due to data being evicted before it can be re-used. Using cache-friendly access patterns and making use of pre-load instructions can make a big difference.
When using virtual memory, be aware of the size and limitations of the Translation Lookaside Buffers (TLB) in the Memory Management Unit (MMU). These are used to cache details of virtual-physical address translations. In an ARM system, a 64-entry TLB can cache sufficient translations to cover anywhere between 256KB and 4MB, depending on the translation granularity.
In some applications, this can be a severe limitation and careful data allocation in memory can make a big difference.
Almost all ARM processors can access aligned data more efficiently than unaligned data. Sometimes it is advantageous to pack data structures to remove unnecessary padding and therefore reduce data memory size. This can result in unaligned structure fields which can be slow to access.
Know your compiler well and know what features it supports. A few examples of features which can help the compiler in optimization are ‘pure’, ‘restrict’ and ‘packed’. These can allow the programmer to convey meta-information about the program behavior and data layout which can make for much more efficient code.
All of them, however, rely on the programmer giving correct information to the compiler. If you do not, your code may break and it will be your fault!
Knowledge of the platform, its features and capabilities, is crucial. In areas like power management, inter-processor communication, coprocessor hardware, floating point etc, it is important to know what capabilities your hardware has and to exploit them well.