You have no items in your shopping cart.

Subtotal: 0.00


This Tutorial explains how you can maximize the efficiency of your applications by writing your C source in such a way as to make the compiler generate fast and compact machine code. It gives advice on which command line switches to use with the C compiler to optimize the resulting program, and shows you how to identify and eliminate unused sections of code. It describes how to compile and link C code for deeply embedded applications, using the components of the standalone C run time system.

Writing Efficient C for the ARM

The ARM C compiler can generate very good machine code for if you present it with the right sort of input. In this section we explain:

☞What the C compiler compiles well and why.

☞How to help the C compiler generate efficient machine code.

Some of the rules presented here are quite general; some are quite specific to the ARM or the ARM C compiler. It should be clear from context which rules are portable.

When writing C, there are a number of considerations which, if handled intelligently, will result in more compact and efficient ARM code:

☞The way functions are written, their size, and the way in which they call each other.

☞The distribution of variables within functions, and their scoping. This affects the register allocation of variables, and the frequency with which they are spilled to memory:

☞The use of alternatives to the switch () statement

Under certain circumstances, reductions in code size can be achieved by avoiding the use of switch ().

Function design

Function call overhead on the ARM is small, and is often in proportion to the work done by the called function. Several features contribute to this:

☞The minimal ARM call-return sequence is BL... MOV pc, lr, which is extremely economical.

☞The multiple load and store instructions, STM and LDM, which reduce the cost of entry to and exit from functions that must create a stack frame and/or save registers.

☞The ARM Procedure Call Standard, which has been carefully designed to allow two very important types of function call to be optimized so that the entry and exit overheads are minimal.

In general, it is a good idea to keep functions small, because this will help keep function calling overheads low.

This section describes the conditions under which function call overhead is minimized, how small functions help the ARM C compiler, and explain how to assist the C compiler when functions cannot be kept small.

Leaf functions

In 'typical' programs, about half of all function calls made are to leaf functions (a leaf function is one that makes no calls from within its body).

Often, a leaf function is rather simple. On the ARM, if it is simple enough to compile using just five registers (a1-a4 and IP); it will carry no function entry or exit overhead. A surprising proportion of useful leaf functions can be compiled within this constraint.

Once registers have to be saved, it is efficient to save them using STM. In fact, the more you can save at one go the better. In a leaf function, all the registers which need to be saved will be saved by a single STMFD sp!, {regs, lr} on entry and a matching LDMFD sp!,{regs, pc} on exit.

In general, the cost of pushing some registers on entry and popping them on exit is very small compared to the cost of the useful work done by a leaf function that is complicated enough to need more than five registers.

Overall, you should expect a leaf function to carry virtually no function entry and exit overhead, and at worst, a small overhead, most likely in proportion to the useful work done by it.

Veneer Functions (Simple Tail Continued Functions)

Historically, abstraction veneers have been relatively expensive. The kind of veneer function which merely changes the types of its arguments, or which calls a low-level implementation with an extra argument (say), has often cost much more in entry and exit overhead than it was worth in useful work.

If a function ends with a call to another function, that call can be converted to a tail continuation. In functions that don’t need to save any registers, the effect can be dramatic.

Consider, for example:

extern void *__sys_alloc(unsigned type, unsigned n_words);

#define    NOTGCable      0x80000000

#define    NOTMovable 0x40000000

void                    *malloc(unsigned n_bytes)

{ return __sys_alloc(NOTGCable+NOTMovable, n_bytes/4);


From this input, armcc generates:


MOV a2,a1,LSR #2

MOV a1,#&c0000000

B |__sys_alloc|

Here there is no function entry or exit overhead, and the function return has disappeared entirely —return is direct from __sys_alloc to malloc's caller. In this case, the basic call-return cost for the function pair has been reduced from:

BL + BL + MOV pc,lr + MOV pc,lr


BL + B + MOV pc,lr

This works out as a saving of 25%.

More complicated functions in which the only function calls are immediately before a return, collapse equally well.

An artificial example is:

extern int f1(int), int f2(int, int);

int f(int a, int b)


if (b == 0)

return a;

else if (b < 0)

return f2(a, -b);


return f2(b, a); /* argument order swapped */


armcc generates the following, extremely efficient code (the version of armcc supplied with your release may produce slightly different output):

f CMP a2,#0

MOVEQS pc,lr

RSBLT a2,a2,#0

BLT f2

MOV a3,a1

MOV a1,a2

MOV a2,a3

B f2
Function arguments and argument passing

The final aspect of function design which influences low-level efficiency is argument passing.

Under the ARM Procedure Call Standard, up to four argument words can be passed to a function in registers.

Functions of up to four integral (not floating point) arguments are particularly efficient and incur very little overhead beyond that required to compute the argument expressions themselves (there may be a little register juggling in the called function, depending on its complexity).

If more arguments are needed, then the 5th, 6th, etc., words will be passed on the stack. This incurs the cost of an STR in the calling function and an LDR in the called function for each argument word beyond four.

To minimize argument passing:

☞Try to ensure that small functions take four or fewer arguments. These will compile particularly well.

☞If a function needs many arguments, try to ensure that it does a significant amount of work on every call, so that the cost of passing arguments is amortized.

☞Factor out read-mostly global control state and make this static. If it has to be passed as an argument, wrap it up in a struct and pass a pointer to it.

Such control state is:

☞Logically global to the compilation unit or program.

☞Read-mostly, often read-only except in response to user input, and for almost all functions cannot be changed by them or any function called from them.

☞Referenced throughout the program, but relatively rarely in any given function.

Frequent references inside a function should be replaced by references to a local, non-static copy.

Note Don't confuse control state with computational arguments, the values of which differ on every call.

☞Collect related data into struct. Decide whether to pass pointers or struct values based on the use of each struct in the called function:

☞If few fields are read or written, passing a pointer is best.

☞The cost of passing a struct via the stack is typically a share in an LDM-STM pair for each word of the struct. This can be better than passing a pointer if on average each field is used at least once and the register pressure in the function is high enough to force a pointer to be repeatedly re-loaded.

As a general rule, you cannot lose much efficiency if you pass pointers to struct rather than struct values. To gain efficiency by passing struct values rather than pointers usually requires careful study of a function's machine code.

Register allocation and how to help it

It is well known that register allocation is critical to the efficiency of code compiled for RISC processors. It is particularly critical for the ARM, which has only 16 registers rather than the 'traditional' 32.

The C compiler has a highly efficient register allocator which operates on complete functions and which tries to allocate the most frequently used variables to registers.

It produces very good results unless the demand for registers seriously outstrips supply.

As code generation proceeds for a function, new variables are created for expression temporaries. These are never reused in later expressions and cannot be spilled to memory.

Usually, this causes no problems. However, a particularly pathological expression could, in principle, occupy most of the allocatable registers, forcing almost all program variables to memory.

Because the number of registers required to evaluate an expression is a logarithmic function of the number terms in it, it takes an expression of more than 32 terms to threaten the use of any variable register.

As a general rule, avoid very large expressions (more than 30 terms).

Static and extern variables

A variable in a register costs nothing to access: it is simply there, waiting to be used. A local (auto) variable is addressed via the sp register, which is always available for the purpose.

A static variable, on the other hand, can only be accessed after the static base for the compilation unit has been loaded. So the first such use in a function always costs two LDR instructions or an LDR and an STR. However, if there are many uses of static variables within a function, there is a good chance that the static base will become a global common sub expression (CSE) and that; overall, access to static variables will be no more expensive than to auto variables on the stack.

Extern variables are fundamentally more expensive: each has its own base pointer. Thus each access to an extern is likely to cost two LDR instructions or an LDR and an STR.

It is much less likely that a pointer to an extern will become a global CSE—and almost certain that there cannot be several such CSEs—so if a function accesses lots of extern variables, it is bound to incur significant access costs.

A further cost occurs when a function is called: the compiler has to assume—in the absence of inter-procedural data flow analysis—that any non- const static or extern variable could be side-effected by the call. This severely limits the scope across which the value of a static or extern variable can be held in a register.

Sometimes a programmer can do better than a compiler could do, even a compiler that did inter procedural data flow analysis. An example in C is given by the standard streams: stdin, stdout and stderr. These are not pointers to const objects (the underlying FILE struct are modified by I/O operations), nor are they necessarily const pointers (they may be assignable in some implementations). Nonetheless, a function can almost always safely slave a reference to a stream in a local FILE * variable.

It is a common practice to mimic the standard streams in applications.

Consider, for example, the shape of a typical non-leaf printing function:

extern FILE *out; extern FILE *out;

/* the output stream */ /* the output stream */

void print_it(Thing *t) void print_it(Thing *t)



                   FILE *f = out;

fprintf(out, ...); fprintf(f, ...);

print_1(t->first); print_1(t->first);

fprintf(out, ...); fprintf(f, ...);

print_2(t->second); print_2(t->second);

fprintf(out, ...); fprintf(f, ...);

... ...



In the left-hand case, out has to be re-computed or re-loaded after each call to print_... (And after each call to fprintf...). In the right-hand case, f can be held in a register throughout the function (and probably will be).

Uniform application of this transformation to the disassembly module of the ARM C compiler saved more than 5% of its code space.

In general, it is difficult and potentially dangerous to assert that no function you call (or any functions they in turn call) can affect the value of any static or extern variables of which you currently have local copies.

However, the rewards can be considerable so it is usually worthwhile to work out at the program design stages which global variables are slavable locally and which are not.

Trying to retrofit this improvement to existing code is usually hazardous, except in very simple cases like the above.

The switch () statement

The switch () statement can be used:

☞To transfer control to one of several destinations—effectively implementing an indexed transfer of control.

☞To generate a value related to the controlling expression—in effect computing an in-line function of the controlling expression.

In the first role, switch () is hard to improve upon: the ARM C compiler does a good job of deciding when to compile jump tables and when to compile trees of if-then-else’s. It is rare for a programmer to be able to improve upon this by writing if-then-else trees explicitly in the source.

In the second role, however, use of switch () is often mistaken. You can probably do better by being more aware of what is being computed and how.

The example below is a simplified version of a function taken from an early version of the ARM C compiler’s disassembly module. Its purpose is to map a 4-bit field of an ARM instruction to a 2-character condition code mnemonic. We will use it to demonstrate:

☞The cost of implementing an in-line function using switch ().

☞How to implement the same function more economically.

Here is the source:

har *cond_of_instr(unsigned instr)


char *s;‘

switch (instr & 0xf0000000)


case 0x00000000: s = "EQ"; break;

case 0x10000000: s = "NE"; break;

... ... ...


case 0xF0000000: s = "NV"; break;


return s;


The compiler handles this code fragment well, generating 276 bytes of code and string literals.

But we could do better. If performance is not critical (and in disassembly it never is) we could look up the code in a table, using something like:

char *cond_of_instr(unsigned instr)


static struct {char name[3]; unsigned code;}

conds[] = {

"EQ", 0x00000000,

"NE", 0x10000000,


"NV", 0xf0000000,



int j;

for (j = 0; j < sizeof(conds)/sizeof(conds[0]); ++j)

if ((instr & 0xf0000000) == conds[j].code)

return conds[j].name;

return "";


This fragment compiles to 68 bytes of code and 128 bytes of table data. Already this is a 30% improvement on the switch () case, but this schema has other advantages: it copes well with a random code-to-string mapping and, if the mapping is not random, admits further optimization.

Another advantage for table lookup is that is possible to share the same table between disassembles and an assembler—the assembler looks up the mnemonic to obtain the code value, rather than the code value to obtain the mnemonic. Where performance is not critical, the symmetric property of lookup tables can sometimes be exploited to yield significant space savings.

Finally, by exploiting the denseness of the indexing and the uniformity of the returned value it is possible to do better again, both in size and performance, by direct indexing:

char *cond_of_instr(unsigned instr)


return "\


HI\0\0LS\0\0GE\0\0LT\0\0GT\0\0LE\0\0AL\0\0NV" + (instr >> 28)*4;


This expression of the problem causes a miserly 16 bytes of code and 64 bytes of string literal to be generated and is probably close to what an experienced assembly language programmer would naturally write if asked to code this function.

This was the solution finally adopted in the ARM C compiler's disassemble module.

The uniform application of this transformation to the disassemble module of the ARM C compiler saved between 5% and 10% of its code space.

You should therefore think hard before using switch () to compute an in-line function, especially if code size is an important consideration. Although switch () compiles to high-performance code, table lookup will often be smaller; where the function's domain is dense (or piecewise dense) direct indexing into a table will often be both faster and smaller.

Improving Code Size and Performance

Once you have optimized your source code, you may be able to obtain further performance benefits by using the appropriate command line options when you come to compile and link your program.

This section gives advice on which options to use and which to avoid, and explains how to identify and remove unused functions from your C source.

Compiler options

The ARM C compiler has a number of command line options which control the way in which code is generated. In ARM software Development Toolkit, have a number of compiler options which can affect the size and/or the performance of generated code.


-g severely impacts the size and performance of generated code, since it turns off all compiler optimizations. You should use it only when actually debugging your code and it should never be enabled for a release build.

–Ospace -Otime

These options are complementary:

-Ospace optimises for code size at the expense of performance -Otime optimises for performance at the expense of size.

They can be used together on different parts of a build. For example, -Otime could be used on time critical source files, with -Ospace being used on the remainder.

If neither is specified, the compiler will attempt a balance between optimizing for code size and optimizing for performance.


This disables cross jump optimization. Cross jump optimization is a space-saving optimization whereby common sections of code at the end of each element in a switch () statement are identified and commoner together, each occurrence of the code section being replaced with a branch to the commoner code section.

However, this optimization can lead to extra branches being executed which may decrease performance, especially in interpreter-like applications which typically have large switch () statements.

Use the -zpj0 option to disable this optimization if you have a time-critical switch () statement.

Alternatively, you can use:

#pragma nooptimise_crossjump

Before the function containing the switch () and:

#pragma optimise_crossjump

The function re-enables the optimization for remainder of this function in the file.

–apcs /nofp

By default, armcc generates code which uses a dedicated frame pointer register.

This register holds a pointer to the stack frame and is used by the generated code to access a function’s arguments.

A dedicated frame pointer can make the code slightly larger.

By specifying -apcs/nofp on the command line, you can force armcc to generate code which does not use a frame pointer, but which accesses the function’s arguments via offsets from the stack pointer instead.

Note tcc never uses a frame pointer, so this option does not apply when compiling Thumb code.

–apcs /noswst

By default, armcc generates code which checks that the stack has not overflowed at the head of each function. This code can contribute several percent to the code size, so it may be worthwhile disabling this option with -apcs /noswst.

Be careful, however, to ensure that your program’s stack is not going to overflow, or that you have an alternative stack checking mechanism such as an MMU-based check.

Note tcc has stack checking disabled by default.


This option applies to armcc only.

By default, armcc generates code which is suitable for running on processors that implement ARM Architecture 3 (eg. ARM6, ARM7). If you know that the code is going to be run on a processor with half word support, you can use the -ARM7T option to instruct the compiler to use the ARM Architecture 4 half word and signed byte instructions. This can result in significantly improved code density and performance when accessing 16-bit data.


The code generated by the compiler can be slightly larger when compiling with the -pcc switch.

This is because of extra restrictions on the C language in the ANSI standard which the compiler can take advantage of when compiling in ANSI mode.

Note If your code will compile in ANSI mode, do not use the -pcc switch.

Identifying and eliminating unused code sections

During program development it can happen that functions which were used at an earlier stage of development are no longer called, and may therefore be eliminated.

The compiler and linker provide facilities to enable you to identify and eliminate such functions.

This section explains how these facilities work, taking as an example the unused.c program in directory examples/unused.

Stage 1—allocating a code segment to each function

The first stage in eliminating unused functions is to compile all your sources with the -zo option.

This instructs the compiler to place each function in a separate code segment:

armcc -c -zo unused.c

Stage 2—removing unreferenced segments

The second stage is to instruct the linker to remove those segments which are unreferenced by code in any other segment. This is done with the -Remove command line option. You can also specify -info unused to get the linker to tell you which code segments are unused:

armlink -info unused -remove -o unused unused.o armlib.32l

Note If armlib.32l is not in the current directory, you will need to specify the full pathname.

In this instance, the linker will produce the following output:

ARM Linker: Unreferenced AREA unused.o(C$$code) (file unused.o) omitted from output.

Stage 3—identifying unused functions

The linker has removed the unused function from the output file.

If you wish to find the name of this function so that you can remove it from your source, instruct the compiler to generate assembler output with the -S option.

armcc -c -zo -S unused.c

Edit the assembler output file unused.s and search for the AREA name which was given in the linker output (in this case C$$code). This will give the name of the unused function, as shown in the following extract from the file:


|x$codeseg| DATA

unused_function <<<< Name of unused function

ADD a1,pc,#L000008-.-8

B _printf


Choosing a Division Implementation

In some applications it is important that a general-purpose divide executes as quickly as possible.

This section describes:

☞How to select a divide implementation for the C library.

☞How to use the fast divide routines from the examples directory.

☞A comparison of the speeds of the divide routines.

Divide implementations in the C library

The C library offers a choice between two divide variants. This choice is basically a speed vs. space tradeoff. The options are:



In the C library build directory (e.g. directory semi for the semi-hosted library), the options file is used to select variants of the C library.

The supplied file contains the following

memcpy = fast

divide = unrolled

stdfile_redirection = off

fp_type = module

stack = contiguous

backtrace = off

thumb = false

Unrolled divide

The default divide implementation 'unrolled' is fast, but occupies a total of 416 bytes. This is an appropriate default for most Toolkit users who are interested in obtaining maximum performance.

Small divide

Alternatively you can change this file to select 'small' divide which is more compact at 136 bytes(20 instructions for signed plus 14 instructions for unsigned) but somewhat slower, as there is considerable looping overhead.

Signed division example timings

Cycle times are F-cycles on a cached ARM6 series processor excluding call and return If you have a specific requirement, you can modify the supplied routines to suit your application.


Unrolled cycles

Rolled cycles

















Table: Signed division example timings

For instance, you could write an unrolled-2-times version or you could combine the signed and unsigned versions to save more space.

Divide routines for real-time applications

The C library also contains two fully unwound divide routines. These have been carefully implemented for maximum speed.

They are useful when a guaranteed performance is required, e.g. for real-time feedback control routines or DSP. The long maximum division time of the standard C library functions may make them unsuitable for some real-time applications.

The supplied routines implement signed division only; it would be possible to modify them for unsigned division if required.

The routines are described by the standard header file "stdlib.h" which contains the following declarations:

typedef struct __sdiv32by16 { int quot, rem; } __sdiv32by16;

/* used int so that values return in regs, although 16-bit */

typedef struct __sdiv64by32 { int rem, quot; } __sdiv64by32;

__value_in_regs extern __sdiv32by16 __rt_sdiv32by16(

int /*numer*/,

short int /*denom*/);


* Signed div: (16-bit quot), (16-bit rem) = (32-bit) / (16-bit)


__value_in_regs extern __sdiv64by32 __rt_sdiv64by32(

int /*numer_h*/, unsigned int /*numer_l*/,

int /*denom*/);


* Signed div: (32-bit quot), (32-bit rem) = (64-bit) / (32-bit)

These routines both have guaranteed performance:

108 cycles for __rt_sdiv64by32 (excluding call & return)

44 cycles for __rt_sdiv32by16

Currently the C compiler does not automatically use these routines, as the default routines have early-termination which yields good performance for small values.

In order to use these new divide routines, you should explicitly call the relevant function. The __rt_div64by32 function is complicated by the fact that our C compiler does not currently support 64-bit integers, as you have to split a 64-bit value between two 32-bit variables.

The following example program shows how to use these routines. This is available as dspdiv.c in directory examples/progc. Once the program has been compiled and linked, type the following line to calculate 1000/3:

armsd dspdiv 1000 3

divdsp.c source code





int main(int argc, char *argv[])


if (argc != 3)

puts("needs 2 numeric args");



      __sdiv32by16 result;

     result = __rt_sdiv32by16(atoi(argv[1]), atoi(argv[2]));

     printf("quotient %d\n", result.quot);

     printf("remainder %d\n", result.rem);


return 0;


The standard division routine used by the C library can be selected by using the options file in the C library build area.

If the supplied variants are not suitable, you can write your own.

For real-time applications, the maximum division time must be as short as possible to ensure that the calculation can complete in time. In this case, the functions __rt_sdiv64by32 and __rt_sdiv32by16 are useful.

Using the C Library in Embedded Applications

This section discusses the Toolkit’s standalone runtime support system for C programming in deeply embedded applications. In particular it explains:

☞What rtstand.s supports

☞How to make use of it

☞How to extend it by adding extra functionality from the C library

☞the size of the standalone run time library

The standalone runtime system

The semi hosted ANSI C library provides all the standard C library facilities and is thus quite large. This is acceptable when running under emulation with plenty of memory, or when running on development hardware with access to a real debugging channel. However, in a deeply embedded application many of its facilities—file access functions or time and date functions, for example—may no longer be relevant, and its size may be prohibitive if memory is severely limited.

For deeply embedded applications a minimal C runtime system is needed which takes up as little memory as possible, is easily portable to the target hardware, and only supports those functions that are required for such an application.

The ARM Software Development Toolkit comes with a minimal runtime system in source form.

The ’behind the scenes’ jobs which it performs are:

☞Setting up the initial stack and heap, and calling main ().

☞Program termination—either automatic (returning from main ()) or forced (explicitly calling __rt_exit).

☞Simple heap allocation (__rt_alloc).

☞Stack limits checking.

setjmp and longjmp support.

☞Divide and remainder functions (calls to which can be generated by armcc).

☞High level error handler support (__err_handler).

☞Optional floating point support means to detect whether floating point support is available (__rt_fpavailable).

The source code rtstand.s documents the options which you may want to change for your target. These are not covered here. The header file rtstand.h documents the functions which rtstand.s provides to the C programmer.

A Thumb version of this file is located in thumb/rtstand.s.

Using the standalone runtime system

In this section the main features of the standalone runtime system are demonstrated by example programs.

Before attempting any of the demonstrations below, proceed as follows:

☞Create a working directory, and make this your current directory.

☞Copy the contents of directory examples/clstand into your working directory.

☞Copy the files fpe*.o from directory cl/fpe into your working directory.

You are now ready to experiment with the C standalone runtime system.

In the examples below, the following options are passed to armcc, armasm, and in the first case armsd:

-li specifies that the target is a little endian ARM.

-apcs 3/32bit/hardfp specifies that the 32-bit variant of APCS 3 should be used. For armasm this is used to set the built-in variable {CONFIG} to 32.

ARM FPA instructions are used for floating point operations.

These arguments can be changed if the target hardware differs from this configuration, or omitted if your tools have been configured to have these options by default.

You may find it useful to study the sources to rtstand.s, errtest.c and memtest.c while working through the example programs.

A simple program

Let us first compile the example program errtest.c, and assemble the standalone runtime system. These can then be linked together to provide an executable image, errtest:

armcc       -c    errtest.c -li -apcs 3/32bit/hardfp

armasm   rtstand.s -o rtstand.o -li -apcs 3/32bit

armlink    -o    errtest errtest.o rtstand.o

We can then execute this image using armsd as follows:

> armsd -li - size 512K errtest

A.R.M. Source-level Debugger, version 4.10 (A.R.M.) [Aug 26 1992] ARMulator V1.20, 512 Kb RAM, MMU present, Demon 1.01, FPE, Little endian.

Object program file errtest

armsd: go

(the floating point instruction-set is not available)

Using integer arithmetic ...

10000 / 0X0000000A = 0X000003E8

10000 / 0X00000009 = 0X00000457

10000 / 0X00000008 = 0X000004E2

10000 / 0X00000007 = 0X00000594

10000 / 0X00000006 = 0X00000682

10000 / 0X00000005 = 0X000007D0

10000 / 0X00000004 = 0X000009C4

10000 / 0X00000003 = 0X00000D05

10000 / 0X00000002 = 0X00001388


10000 / 0X00000001 = 0X00002710

Program terminated normally at PC = 0x00008550

0x00008550: 0xef000011 .... : > swi 0x11

armsd: quit


The > prompt is the Operating System prompt, and the armsd: prompt is output by armsd to indicate that user input is required.

Already several of the standalone runtime system's facilities have been demonstrated:

☞The C stack and heap have been set up.

☞main () has clearly been called.

☞The fact that floating point support is not available has been detected.

☞The integer division functions have been used by the compiler.

☞Program termination was successful.

Error handling

The same program, errtest, can also be used to demonstrate error handling, by recompiling errtest.c and predefining the DIVIDE_ERROR macro:

armcc -c errtest.c -li -apcs 3/32bit/hardfp -DDIVIDE_ERROR

armlink -o errtest errtest.o rtstand.o

Again, we can now execute this image under the armsd as follows:

> armsd -li -size 512K errtest

A.R.M. Source-level Debugger, version 4.10 (A.R.M.) [Aug 26 1992] ARMulator V1.20, 512 Kb RAM, MMU present, Demon 1.01, FPE, Little endian.

Object program file errtest

armsd: go

(the floating point instruction-set is not available)

Using integer arithmetic ...

10000 / 0X0000000A = 0X000003E8

10000 / 0X00000009 = 0X00000457

10000 / 0X00000008 = 0X000004E2


10000 / 0X00000007 = 0X00000594

10000 / 0X00000006 = 0X00000682

10000 / 0X00000005 = 0X000007D0

10000 / 0X00000004 = 0X000009C4

10000 / 0X00000003 = 0X00000D05

10000 / 0X00000002 = 0X00001388

10000 / 0X00000001 = 0X00002710

10000 / 0X00000000 = errhandler called:

code = 0X00000001: divide by 0

caller's pc = 0X00008304


run time error: divide by 0

program terminated

Program terminated normally at PC = 0x0000854c

0x0000854c: 0xef000011 .... : > swi 0x11

armsd: quit



This time an integer division by zero has been detected by the standalone runtime system, which called __err_handler(). __err_hander() output the first set of error messages in the above output.

Control was then returned to the runtime system which output the second set of error messages and terminates the execution.

longjmp and setjmp
armcc -c errtest.c -li -apcs 3/32bit hardfp -DDIVIDE_ERROR -DLONGJMP

armlink -o errtest errtest.o rtstand.o

Then rerun errtest under armsd. We expect the integer divide by zero to occur once again:

> armsd -li -size 512K errtest

A.R.M. Source-level Debugger, version 4.10 (A.R.M.) [Aug 26 1992] ARMulator V1.20, 512 Kb RAM, MMU present, Demon 1.01, FPE, Little endian.

Object program file errtest


armsd: go

(the floating point instruction-set is not available)

Using integer arithmetic ...

10000 / 0X0000000A = 0X000003E8

10000 / 0X00000009 = 0X00000457

10000 / 0X00000008 = 0X000004E2

10000 / 0X00000007 = 0X00000594

10000 / 0X00000006 = 0X00000682

10000 / 0X00000005 = 0X000007D0

10000 / 0X00000004 = 0X000009C4

10000 / 0X00000003 = 0X00000D05

10000 / 0X00000002 = 0X00001388

10000 / 0X00000001 = 0X00002710

10000 / 0X00000000 = err handler called:

code = 0X00000001: divide by 0

caller's pc = 0X00008310


Returning from __err_handler() with

errnum = 0X00000001

Program terminated normally at PC = 0x00008558

0x00008558: 0xef000011 .... : > swi 0x11

armsd: quit




The runtime system detected the integer divide by zero, and as before __err_handler() was called, which produced the error messages. However, this time __err_handler() used longjmp to return control to the program, rather than the runtime system.

Floating point support

Using errtest we can also demonstrate floating point support. You should already have copied the appropriate floating point emulator object code into your working directory. For the configuration used in this example fpe_32l.o is the correct object file.

However, in addition to this it is also necessary to link with an fpe stub, which we must compile from the source given (fpestub.s).

 armsd -li -size 512K errtest

A.R.M. Source-level Debugger, version 4.10 (A.R.M.) [Aug 26 1992] ARMulator V1.20, 512 Kb RAM, MMU present, Demon 1.01, FPE, Little endian.

Object program file errtest

armsd: go

(the floating point instruction-set is available)

Using Floating point, but casting to int ...

10000 / 0X0000000A = 0X000003E8

10000 / 0X00000009 = 0X00000457

10000 / 0X00000008 = 0X000004E2

10000 / 0X00000007 = 0X00000594

10000 / 0X00000006 = 0X00000682

10000 / 0X00000005 = 0X000007D0

10000 / 0X00000004 = 0X000009C4

10000 / 0X00000003 = 0X00000D05


10000 / 0X00000002 = 0X00001388

10000 / 0X00000001 = 0X00002710

10000 / 0X00000000 = errhandler called:

code = 0X80000202: Floating

Point Exception : Divide By Zero

caller's pc = 0XE92DE000


Returning from __err_handler() with

errnum = 0X80000202

Program terminated normally at

PC = 0x00008558 (__rt_exit + 0x10)

+0010 0x00008558: 0xef000011 .... : > swi 0x11

armsd: quit



This time the floating point instruction set is found to be available, and when a floating point division by zero is attempted, __err_handler is called with the details of the floating point divide by zero exception.

Running out of heap

A second example program, memtest.c demonstrates how the standalone runtime system copes with allocating stack space, and also demonstrates the simple memory allocation function __rt_alloc. Let us first compile this program so that it should repeatedly request more memory, until there is none left:

armcc -li -apcs 3/32bit memtest.c -c

armlink -o memtest memtest.o rtstand.o

This can be run under armsd in the usual way:

> armsd -li -size 512K memtest

A.R.M. Source-level Debugger, version 4.10 (A.R.M.) [Aug 26 1992] ARMulator V1.20, 512 Kb RAM, MMU present, Demon 1.01, FPE, Little endian.

Object program file memtest

armsd: go



kernel memory management test

force stack to 4KB

request 0 words of heap - allocate 256 words at 0X000085A0

force stack to 8KB


force stack to 60KB

request 33211 words of heap - allocate 33211 words at 0X00049388

force stack to 64KB

request 49816 words of heap - allocate 5739 words at 0X00069A74

memory exhausted, 105376 words of heap, 64KB of stack

Program terminated normally at PC = 0x0000847c

0x0000847c: 0xef000011 .... : > swi 0x11

armsd: quit



This demonstrates that allocating space on the stack is working correctly, and also that the __rt_alloc() routine is working as expected. The program terminated because in the end __rt_alloc() could not allocate the requested amount of memory.

Stack overflow checking

memtest can also be used to demonstrate stack overflow checking by recompiling with the macro STACK_OVERFLOW defined. In this case the amount of stack required is increased until there is not enough stack available, and stack overflow detection causes the program to be aborted.

To recompile and link memtest.c, issue the following commands:

armcc -li -apcs 3/32bit memtest.c -c -DSTACK_OVERFLOW

armlink -o memtest memtest.o rtstand.o

Running this program under armsd produces the following output:


> armsd -li -size 512K memtest

A.R.M. Source-level Debugger, version 4.10 (A.R.M.) [Aug 26 1992] ARMulator V1.20, 512 Kb RAM, MMU present, Demon 1.01, FPE, Little endian.

Object program file memtest

armsd: go

kernel memory management test

force stack to 4KB


force stack to 256KB

request 1296 words of heap - allocate 1296 words at 0X0000AE20

force stack to 512KB

run time error: stack overflow

program terminated

Program terminated normally at PC = 0x0000847c

0x0000847c: 0xef000011 .... : > swi 0x11

armsd: quit



Clearly stack overflow checking did indeed catch the case where too much stack was required, and caused the runtime system to terminate the program after giving an appropriate diagnostic.

Extending the standalone runtime system

For a many applications it may be desirable to have access to more of the standard C library than is provided by the minimal runtime system. This section demonstrates how to take out a part of the standard C library and plug it into the standalone runtime system.

The function which we will add to rtstand is memmove(). Although this is small and easily extracted from the C library source, the same methodology can be applied to larger sections of the C library, e.g. the dynamic memory allocation system (<bStyle="font-size:12px; color:#7d053f">malloc(), free(), etc.).

The source of the C library can be found in directory cl. The source for the memmove() function is in cl/string.c.

The extracted source for memmove() has been put into memmove.c, and the compile time option _copywords has been removed.

The function declaration for memmove() and a typedef for size_t (extracted from include/stddef.h) have been put into examples/clstand/memmove.h.

Our module can be compiled using:

armcc -c memmove.c -li -apcs 3/32bit

The output, memmove.o can be linked with the user’s other object modules together with rtstand.o in the normal way (see previous examples in this section).

The files rtstand1.s and rtstand1.h are modified version of rtstand.s and rtstand.h respectively. rtstand1.s has the assembler code generated for __rt_memmore included in it. memmore.h has been merged with rtstand1.h to produce rtstand1.h.

The size of the standalone runtime library

<bStyle="font-size:12px; color:#7d053f">rtstand.s has been separated into several code Areas. This allows armlink to detect any unreferenced Areas and then eliminate them from the output image.

The table below shows the typical size of the Areas in rtstand.o:


Size in bytes







__main, __rt_exit















longjmp, setjmp



__rt_sdiv, __rt_udiv, __rt_udiv10,__rt_udiv10, __rt_divtest





Table: Typical Area sizes in rtstand.o

If floating point support is definitely not required, then the EnsureNoFPSupport variable can be set to {TRUE}, and some extra space will be saved. After making any modifications to rtstand.s, the size of the various areas can be found using one of the following commands:

decaof -b rtstand.o decaof -q rtstand.o

From the above table it is clear that for many applications the standalone runtime library will be roughly 0.5Kb.