Programming with ARM Processors

Tags: Introduction of ARM Processors, Programming with ARM, programming with arm processor, c programming with arm,arm processor programming, arm processor programming tutorial,
Overall rating

Introduction of ARM Processors

ARM7 is a 32-bit RISC (Reduced Instruction Set Computer) processor architecture developed by ARM Corporation. It was previously known as Advanced RISC Machines and prior to that Acron RISC Machines. ARM cores are used in mobile phones, handheld organizers (PDA), portable consumer devices.

The ARM architecture has been designed to allow very small, yet high-performance implementations. The architectural simplicity of ARM processors leads to very small implementations, and small implementations allow devices with very low power consumption.

The ARM7 families provide a wide range of performance, from 100 MIPS to 1000 MIPS. ARM7 has produced architectural families that are compatible, flexible, and encompass the full range of embedded requirements.

ARM processors possess a unique combination of features that makes ARM the most popular embedded architecture today. First, ARM cores are very simple compared to most other general-purpose processors, which means that they can be manufactured using a comparatively small number of transistors, leaving plenty of space on the chip for application-specific macro cells.

A typical ARM chip can contain several peripheral controllers, a digital signal processor, and some amount of on-chip memory, along with an ARM core.

Second, both ARM ISA and pipeline design are aimed at minimizing energy consumption — a critical requirement in mobile embedded systems.

Third, the ARM architecture is highly modular: the only mandatory component of an ARM processor is the integer pipeline; all other components, including caches, MMU, floating point and other co-processors are optional, which gives a lot of flexibility in building application-specific ARM-based processors.

Finally, while being small and low-power, ARM processors provide high performance for embedded applications.

ASSEMBLER ARM Introduction

The ARM Assembler is a program that translates symbolic code (assembly language) into executable object code. This object code can be executed with an ARM7-based or ARM9-based microcontroller. If you have ever written a computer program directly in machine-recognizable form, such as binary or hexadecimal code, you will appreciate the advantages of programming in symbolic assembly language.

Assembly language operation codes (mnemonics) are designed to be easy to remember. You may symbolically express addresses and values referenced in the operand field of instructions. Since you assign these names, you can make them as meaningful as the mnemonics for the instructions.

An assembly program consists of three types of constructs:

  • Machine instructions which are code the machine can execute. Detailed discussion of the machine instructions is in the hardware manuals of the ARM microcontroller.
  • Assembler directives which define the program structure and symbols, and generate non-executable code (data, messages, and so on.).
  • Assembler controls which set assembly modes and direct assembly flow.

C ARM Introduction

The C programming language is a general-purpose programming language that provides code efficiency, elements of structured programming, and a rich set of operators. C is not a big language and is not designed for any one particular area of application.

Its generality, combined with its absence of restrictions, makes C a convenient and effective programming solution for a wide variety of software tasks. Many applications can be solved more easily and efficiently with C than with other more specialized languages.

The C ARM Compiler is not a universal C compiler adapted for the ARM target. It is a ground-up implementation, dedicated to generating extremely fast and compact code for ARM microcontrollers. The C ARM Compiler provides you with the flexibility of programming in C and the code efficiency and speed of assembly language.

The C language on its own is not capable of performing operations (such as input and output) that would normally require intervention from the operating system. Instead, these capabilities are provided as part of the standard library. Because these functions are separate from the language itself, C is especially suited for producing code that is portable across a wide number of platforms.

Since the C ARM Compiler is a cross compiler, some aspects of the C programming language and standard libraries are altered or enhanced to address the peculiarities of an embedded target processor.

ARM Instruction Set

The ARM instruction set has the following key features, some of which are common to other processors, and some of which are not.

Load/store architecture

Only load and store instructions can access memory. This means that data processing operations have to use intermediate registers, loading the data from memory beforehand and storing it back again afterwards. Most operations actually require several instructions to carry out the required calculation, and each instruction will run as fast as possible instead of being slowed down by external memory accesses.

32-bit instructions

All instructions are of the same length, so the processor can fetch every instruction from memory in one cycle. In addition, all instructions are stored word-aligned in memory, which means that the bottom two bits of the program counter (r15) are always set zero.

32-bit and 8-bit data

All ARM processors have load and store instructions that handle data as 32-bit words or 8-bit bytes. Words are always aligned on 4-byte boundaries.

32-bit addresses

All 3 & later version ARM processors have a 32-bit addressing. Those implementing ARM Architectures 3 and 4 have retained the ability to perform 26-bit addressing backwards compatibility.

37 registers

These comprise:

  • 30 general purpose registers, 15 of which are accessible at any one time.
  • 6 status registers, of which either one or two are accessible at any one time.
  • A program counter.

Flexible load and store multiple instructions

The ARM’s multiple load and store instructions allow any set of registers from a single bank to be transferred to and from memory by a single instruction.

No single instruction to move a value to register

In general, a literal value loaded from memory. However, 32-bit values can be generated in a single instruction.

Conditional execution

All instructions are executed conditionally on the state of the Current Program Status Register (CPSR).

Only data processing operations with the S bit set change the state of the current program status register.

Powerful barrel shifter

The second argument to all data-processing and single data-transfer operations can be shifted in quite a general way before the operation is performed.

This supports— but is not limited to— scaled addressing, multiplication by a small constant, and the construction of constants, within a single instruction.

Co-processor instructions

These support a general way to extend the ARM’s architecture in a customer-specific manner.

Structure of an Assembler Module

This section describes it in a simplified manner which gives you the basics required for writing simple assembler programs.

The following is a simple example which illustrates some of the core constituents of an ARM assembler module:

AREA Example, CODE, READONLY       ; name this block of code

ENTRY                                                       ; mark first instruction

                                                                  ; to execute start

MOV r0, #15                                            ; Set up parameters

MOV r1, #20

BL firstfunc                                              ; Call subroutine

SWI 0x11                                                  ; terminate

firstfunc                                                    ; Subroutine firstfunc

ADD r0, r0, r1                                          ; r0 = r0 + r1

MOV pc, lr                                                ; Return from subroutine

                                                                   ; with result in r0

END                                                           ; mark end of file

The AREA directive

Areas are chunks of data or code that are manipulated by the linker. A complete application will consist of one or more areas.

The example above consists of a single area which contains code and is marked as being read-only. A single CODE area is the minimum required to produce an application.

The ENTRY directive

The first instruction to be executed within an application is marked by the ENTRY directive. An application can contain only a single entry point and so in a multi-source-module application, only a single module will contain an ENTRY directive.

Note that when an application contains C code, the entry point will usually be contained within the C library.

Conditional Execution

The ARM’s ALU status flags

The ARM’s Program Status Register contains, among other flags, copies of the ALU status flags:

N - Negative result from ALU flag

Z - Zero result from ALU flag

C - ALU operation carried out

V - ALU operation overflowedData processing instructions change the state of the ALU’s N, Z, C and V status outputs, but these are latched in the PSR’s ALU flags only if a special bit (the S bit) is set in the instruction.

Execution conditions

Every ARM instruction has a 4-bit field that encodes the conditions under which it will be executed.

Field mnemonic



Z set (equal)


Z clear (not equal)


C set (unsigned >=)


C clear (unsigned <)


N set (negative)


N clear (positive or zero)


V set (overflow)


V clear (no overflow)


C set and Z clear (unsigned >)


C clear and Z set (unsigned <=)


N and V the same (signed >=)


N and V differ (signed <)


Z clear, N and V the same (signed >)


Z set, N and V differ (signed <=)


Always execute (the default if none is specified)


If the condition field indicates that a particular instruction should not be executed given the current settings of the status flag, the instruction will simply soak up one cycle but will have no other effect.

If the current instruction is a data processing instruction, and the flags are to be updated by it, the instruction must be post fixed by an S. The exceptions to this are CMP, CMN, TST and TEQ, which always update the flags (since this is their only effect).


ADD r0, r1, r2             ; ro = r1 + r2, don’t update flags

ADDS r0, r1, r2           ; r0 = r1 + r2 and UPDATE flags

ADDEQS r0, r1, r2      ; If Z flag set then r0 = r1 + r2,

                                      ; and UPDATE flags

CMP r0, r1                   ; Update flags based on r0 - r1

Using conditional execution

Non-ARM processors only allow conditional execution of branch instructions. This means the code executed under certain conditions need to be avoided branch statement.

Consider Euclid’s Greatest Common Divisor algorithm:

function gcd (integer a, integer b) : result is integer

while (a<> b) do

if (a > b) then

a = a - b


b = b - a



result = a

This might be coded as:


CMP r0, r1

BEQ end

BLT less

SUB r0, r0, r1

BAL gcd


SUB r1, r1, r0

BAL gcd


This will work correctly on an ARM, but every time a branch is taken, three cycles will be wasted in refilling the pipeline and continuing execution from the new location.

Also, because of the number of branches in the code, the code will occupy seven words of memory. Using conditional execution, ARM code can improve both its execution time and code density:


CMP r0, r1

SUBGT r0, r0, r1

SUBLT r1, r1, r0

BNE gcd

Not only has code size been reduced from seven words to four, but execution time has also decreased. In this case, replacing branches with conditional execution of all instructions has given a saving of three cycles. With all inputs to the gcd algorithm, the conditional version of the code will execute in the same number of cycles (when both inputs are the same), or fewer cycles.


r1: b





CMP r0, r1




BEQ end

Not executed -  1



BLT Less




SUB r1,r1,r0




BAL gcd




CMP r0,r1




BEQ end





Total = 13


Table. Only branches condition


r1: b





CMP r0, r1




SUBGT r0,r0,r1

Not executed -  1



SUBLT r1,r1,r0




BNE gcd




CMP r0,r1




SUBGT r0,r0,r1

Not executed -  1



SUBLT r1,r1,r0

Not executed -  1



BNE gcd

Not executed -  1




Total = 10


Table. All Instructions condition

Running the gcd examples

Both assembler versions of the gcd algorithm can be found in directory examples/basic asm.

To assemble them, first copy them to your current work directory and then issue the commands:

armasm gcd1.s

armlink gcd1.o -o gcd1

This produces the version with conditional execution of branch statements only.

To produce the version with full conditional execution of all instructions use:

armasm gcd2.s

armlink gcd2.o -o gcd2

Run these using the debugger and examine the difference in the way they execute.

The ARM’s Barrel Shifter

The ARM core contains a barrel shifter which takes a value to be shifted or rotated, an amount to shift or rotate by the type of shift or rotate. This can be used by various classes of ARM instructions to perform comparatively complex operations in a single instruction. Instructions take no longer to execute by making use of the barrel shifter, unless the amount to be shifted is specified by a register, in which case the instruction will take an extra cycle to complete.

The barrel shifter can perform the following types of operation:

  • LSL shift left by n bits (multiplication by 2n).
  • LSR logical shift right by n bits (unsigned division by 2n).
  • ASR arithmetic shift right by n bits. The bits fed into the top end of the operand are copies of the original top—or sign—bit.


(signed division by 2n)

ROR rotate right by n bits

CF Destination 0

..0 Destination CF

Destination CF

Sign bit shifted in

Destination CF

RRX rotates right extended by 1 bit. This is a 33-bit rotate, where the 33rd bit is the PSR Carry flag.

The barrel shifter can be used in several of the ARM’s instruction classes. The options available in each case are described below.

Data processing operations

The last operand (the second for binary operations and the first for unary operations) may be:

An 8 bit constant rotated right (ROR) through an even number of positions

For example:

ADD r0, r1, #0xC5, 10

MOV r5, #0xFC000003

Note that in the second example the assembler is left to work out how to split the constant 0xFC000003 into an 8-bit constant and an even rotate (in this case #0xFC000003 could be replaced by #0xFF, 6).

A register (optionally) shifted or rotated either by a 5-bit constant or by another register

For example:

ADD r0, r1, r2

SUB r0, r1, r2, LSR #10

CMP r1, r2, r1, ROR R5

MVN r3, r2, RRX

Note that in the last example, the rotate right extended does not take a parameter, but rather rotates right by only a single bit. RRX is actually encoded by the assembler as ROR #0.

Example: Constant multiplication

The ARM core provides a powerful multiplication facility in the MUL and MLA instructions (plus UMULL, UMAL, SMULL and SMLAL on processors that implement ARM Architectures 3M and 4M). Those instructions make use of Booth’s Algorithm to perform integer multiplication, taking up to 17 cycles to complete for MUL and MLA and up to 6 or 7 cycles to complete for UMULL, UMAL, SMULL and SMLAL. In cases where the multiplication is by a constant, it can be quicker to make use of the barrel shifter, as the operations it provides are effectively multiply / divide by powers of two.

For example:

r0 = r1 * 4                                           MOV r0, r1, LSL #2

r0 = r1 * 5 => r0 = r1 + (r1 * 4)     ADD r0, r1, r1, LSL #2

r0 = r1 * 7 => r0 = (r1 * 8) - r1      RSB r0, r1, r1, LSL #3

Using a move/add/subtract combined with a shift, all multiplications by a constant which are a power of two or a power of two +/– 1 can be carried out in a single cycle.

Single data transfer instructions

The single data transfer instructions LDR and STR load and store the contents of a single register to and from memory. They make use of a base register (r0 in the examples below) plus an index (or offset) which can be a register shifted by any 5-bit constant or an un-shifted 12-bit constant.

STR            r7, [r0], #24                    ; Post-indexed

LDR r2, [r0], r4, ASR #4                 ; Post-indexed

STR            r3, [r0, r5, LSL #3]         ; Pre-indexed

LDR r6, [r0, r1, ROR #6]!; Pre-indexed + Writeback

In pre-indexed instructions, the offset is calculated and added to the base, and the resulting address is used for the transfer. If writeback is selected, the transfer address is written back into the base register.

In post-indexed instructions the offset is calculated and added to the base after the transfer. The base register is always updated by post-indexed instructions.

Example: Addressing an entry in a table of words

The following fragment of code calculates the address of an entry in a table of words and then loads the desired word:

; r0 holds the entry number [0,1,2,...]

LDR            r1, =StartOfTable

MOV         r3, #4

MLA                    r1, r0, r3, r1

LDR            r2, [r1]



DCD                    <table data>a

It first loads the start address of the table, then moves the immediate constant 4 into a register, using the multiply and accumulate instruction to calculate the address, and finally loads the entry.

However, this operation can be performed more efficiently with the barrel shifter, as follows:

; r0 holds the entry number [0,1,2,...]

LDR                 r1, =StartOfTable

LDR                 r2, [r1, r0, LSL #2]



DCD                <table data>

Here, the barrel shifter shifts r0 left 2 bits (multiplying it by 4). This intermediate value is then used as the index for the LDR instruction. Thus a single instruction is used to perform the whole operation. Such significant savings can frequently be made by utilizing the barrel shifter.

Program status register transfer instructions

It is possible to modify the N, Z, C and V flags of the PSRs by use of an MSR instruction of the form

MSR cpsr_flg, #expression;


spsr_flg in privileged mode

The assembler will attempt to generate a shifted 8-bit value to match the expression, the top four bits of which can be loaded into the top four bits of the PSR. This will not disturb the control bits. The flag bits are the only part of the CPSR which can be modified while in User mode.

Loading Constants into Registers

Why is loading constants an issue?

Since all ARM instructions are 32 bits long and since they don’t use the instruction stream as data, there is no single instruction which load a 32-bit immediate constant to a register without performing a data load from memory.

Although a data load will place any 32-bit value in a register, there are more direct—and therefore more efficient—ways to load many commonly used constants.

Direct loading with MOV/MVN

The MOV instruction allows 8-bit constant values to be loaded directly into a register, giving a range of 0x0 to 0xFF (255). The bitwise complement of these values can be constructed using MVN, giving the added ability to load values in the range 0xFFFFFF00 to 0xFFFFFFFF.

We can construct even more constants by using MOV and MVN in conjunction with the barrel shifter. These particular constants are 8-bit values rotated right through an even number of positions (giving rotate rights of 0, 2, 4...28, 30):

0 - 255 0 - 0xFF with no rotate

256, 260, 264, ..., 1016, 1020 0x100 - 0x3FC in steps of 4 by rotating right by 30 bits

1024, 1040, 1056, ..., 4080 0x400 - 0xFF0 in steps of 16 by rotating right by 28 bits

4096, 4160, 4224, ..., 16320 0x1000 - 0x3FC0 in steps of 64 by rotating right by 26 bits

And so on, plus their bitwise complements. We can therefore load constants directly into registers using instructions such as:

MOV r0, #0xFF; r0 = 255

MOV r0, #0x1, 30; r0 = 1020

MOV r0, #0xFF, 28; r0 = 4080

MOV r0, #0x1, 26; r0 = 4096

However, converting a constant into this form is an onerous task. The assembler therefore attempts the conversion itself. If the supplied constant cannot be expressed as a shifted 8-bit value or its bitwise complement, the assembler will report this as an error.

The following example illustrates how this works. The left-hand column lists the ARM instructions entered by the user, while the right-hand column shows the assembler’s attempts to convert the supplied constants to an acceptable form.

MOV r0, #0xFF; r0 = 255

MOV r0, #0x1, 30; r0 = 1020

MOV r0, #0xFF, 28; r0 = 4080

MOV r0, #0x1, 26; r0 = 4096

The above code is available as loadcon1.s in directory examples/basic asm. To assemble it, first copy it into your current working directory and then issue the command:

armasm loadcon1.s -o loadcon1.o

To confirm that the assembler has produced the correct code, you can disassemble it using the ARM Object format decoder: decaof -c loadcon1.o

Direct loading with LDR Rd, =numeric constant

The assembler provides a mechanism which, unlike MOV and MVN, can construct any 32-bit numeric constant, but which may not result in a data processing operation to do it. This is the LDR Rd, = instruction.

If the constant which is specified in an LDR Rd, = instruction can be constructed with either MOV or MVN, the assembler will use the appropriate instruction, otherwise it will produce an LDR instruction with a PC-relative address to read the constant from a literal pool.

Literal pools

A literal pool is a portion of memory set aside for constants. By default, a literal pool is placed at every END directive.

However, for large programs, this may not be accessible throughout the program (due to the LDR offset being a 12-bit value, giving a 4Kbyte range), so further literal pools can be placed using the LTORG directive.

When an LDR, Rd, = instruction needs to access a constant in a literal pool, the assembler first checks previously encountered literal pools to see whether the desired constant is already available and addressable.

If so, it addresses the existing constant, otherwise it will attempt to place the constant in the next available literal pool. If this is not addressable—because it does not exist or is further than 4Kbytes away—an error will result, and an additional LTORG should be placed close to (but after) the failed LDR Rd,= instruction

To see how this works in practice, consider the following example. The instructions listed as comments are the ARM instructions which are generated by the assembler

AREA          Loadcon2, CODE

ENTRY                                               ; Mark first instruction

BL               func1                             ; Branch to first subroutine

BL               func2                             ; Branch to second subroutine

SWI             0x11                                       ; Terminate


LDR             r0, =42                           ; => MOV R0, #42

LDR             r1, =0x55555555         ; => LDR R1, [PC, #offset to

; Literal Pool 1]

LDR             r2, =0xFFFFFFFF                     ; => MVN R2, #0

MOV          pc, lr

LTORG                                               ; Literal Pool 1 contains

; literal &55555555


LDR             r3, =0x55555555         ; => LDR R3, [PC, #offset to

; Literal Pool 1]

; LDR          r4, =0x66666666         ; If this is uncommented it

; will fail, as Literal Pool 2

; is not accessible (out of reach)

MOV          pc, lr

LargeTable          % 4200                ; Clears a 4200 byte area of

; memory starting at the

; current location to zero.

END                                                   ; Literal Pool 2 is empty


Note that the literal pools must be placed outside sections of code, since otherwise they would be executed by the processor as instructions. This will typically mean placing them between subroutines as is done here if more pools than the default one at END are required.

The above code is available as loadcon2.s in directory examples/basic asm. To assemble this, first copy it into your current working directory and then issue the command:

armasm loadcon2.s

To confirm that the assembler has produced the correct code, the code area can be disassembled using the ARM Object format decoder:

decaof -c loadcon2.o

Loading Addresses into Registers

It will often be necessary to load a register with an address—the location of a string constant within the code segment or the start location of a jump table, for example.

However, because ARM code is inherently re-locatable and because there are limitations on the values that can be directly moved into a register, absolute addressing cannot be used for this purpose. Instead, addresses must be expressed as offsets from the current PC. A register can either be directly set by combining the current PC with the appropriate offset, or the address can be loaded from a literal pool.

The ADR and ADRL pseudo instructions

Sometimes it is important for the purposes of efficiency that loading an address does not perform a memory access. The assembler provides two pseudo instructions, ADR and ADRL, which make it easier to do this. ADR and ADRL accept a PC-relative expression (a label within the same code area) and calculate the offset required to reach that location.

ADR will attempt to produce a single instruction to load an address into a register in the same way that the LDR Rd, = mechanism produces instructions.

If the desired address cannot be constructed in a single instruction, an error will be raised. In typical usage the offset range is 255 bytes for an offset to a non word-aligned address, and 1020 bytes (255 words) for an offset to a word-aligned address.

ADRL will attempt to produce two data processing instructions to load an address into a register.

Even if it is possible to produce a single data processing instruction to load the address, a second, redundant instruction will be produced (this is a consequence of the strict two-pass nature of the assembler). In cases where it is not possible to construct the address using two data processing instructions ADRL will produce an error, and in such cases the LDR, = mechanism is probably the best alternative. In typical usage the range of an ADRL is 64Kbytes for a non-word aligned address and 256Kbytes for a word-aligned address.

The following example shows how this works. The instruction listed in the comment is the ARM instruction which is generated by the assembler.

AREA         Loadcon3, CODE

ENTRY                                            ; Mark first instruction


ADR                    r0, Start                       ; => SUB r0, PC, #offset to Start

ADR                    r1, DataArea ; => ADD r1, PC, #offset to DataArea

ADRL         r3, DataArea+4300 ; => ADD r2, PC, #offset1

; ADD r2, r2, #offset2

SWI            0x11                             ; Terminate

DataArea % 8000


The above code is available as loadcon3.s in directory examples/basic asm. To assemble this, first copy it into your current working directory and then issue the command:

armasm loadcon3.s

To confirm that the assembler produced the correct code, the code area can be disassembled using the ARM Object format decoder:

decaof -c loadcon3.o

LDR Rd, =PC-relative expression

As well as numeric constants, the LDR Rd, = mechanism can cope with PC-relative expressions such as labels. Even if a PC-relative ADD or SUB could be constructed, an LDR will be generated to load the PC-relative expression.

If a PC-relative ADD or SUB is desired, ADR should be used instead.

If no suitable literal is already available, the literal placed into the next literal pool will be the offset into the AREA, and an AREA-relative relocation directive will be added to ensure that the constant is appropriate wherever the containing AREA gets located by the linker.

The following example illustrates how this works. The instruction listed in the comment is the ARM instruction which is generated by the assembler.

AREA         Loadcon4, CODE

ENTRY                                   ; Mark first instruction


BL               func1                  ; Branch to first subroutine

BL               func2                  ; Branch to second subroutine

SWI            0x11                   ; Terminate


LDR            r0, =Start           ; => LDR R0,[PC, #offset to

; Litpool 1]

LDR            r1, =Darea +12 ; => LDR R1,[PC, #offset to

; Litpool 1]

LDR            r2, =Darea + 6000 ; => LDR R2, [PC, #offset to

; Litpool 1]

MOV         pc,lr                    ; Return

LTORG                                   ; Literal Pool 1 contains 3 literals


LDR            r3, =Darea +6000     ; => LDR r3, [PC, #offset to

;Litpool 1]



; (sharing with previous literal)

; LDR r4, =Darea +6004 ; If uncommented will produce an

; error as Litpool 2 is out of range


MOV         pc, lr                   ; Return

Darea       % 8000

END                                                 ; Literal Pool 2 is out of

; range of the LDR instructions

; above

The above code is available as loadcon4.s in directory examples/basic asm. To assemble this, first copy it into your current working directory and then issue the command armasm loadcon4.s

To confirm that the assembler produced the correct code, the code area can be disassembled using the ARM Object format decoder:

decaof -c loadcon4.o

Loading addresses into registers


The following program contains a function, strcpy(), which copies a string from one memory location to another. Two arguments are passed to the function: the address of the source string and the address of the destination. The last character in the string is a zero, and will be copied.

AREA         StrCopy, CODE

ENTRY                                   ; mark the first instruction

main         ADR r1, srcstr  ; pointer to first string

ADR                    r0, dststr            ; pointer to second string

BL               strcopy              ; copy the first into second

SWI            0x11                    ; and exit

srcstr         DCB "This is my first (source) string",0

dststr        DCB "This is my second (destination) string",0

ALIGN                                    ; realign address to word boundary


LDRB         r2, [r1], #1         ; load byte, then update address

STRB         r2, [r0], #1         ; store byte, then update address


CMP          r2, #0                  ; check for zero terminator

BNE           strcopy              ; keep going if not

MOV         pc, lr                   ; return


ADR is used to load the addresses of the two strings into registers r0 and r1, for passing to strcpy() function. These two strings have been stored in memory using the assembler directive DCB (Define Constant Byte). The first string is 33 bytes long, so the ADR offset to the second (as a non-word aligned offset) is limited to 255 bytes, which is therefore within reach.

Notice the use of an auto-indexing address mode to update the address registers in the LDR instructions. Thus:

LDRB r2, [r1], #1

Replaces a sequence like:

LDRB r2, [r1]

ADD r1, r1, #1

But it takes only one cycle to execute rather than two.

The above code is available as strcopy1.s in directory examples/basic asm. Copy this into your current working directory and assemble it, with debug information included:

armasm strcopy1.s -g

Then link it and load it into the debugger

armlink strcopy1.o -o strcopy1 -d

armsd strcopy1

You can now view the source and destination strings using:

print/s @srcstr

print/s @dststr

Run the program and check that the destination string has been updated:


print/s @srcstr

print/s @dststr

Also in the examples directory is a version of this program called strcopy2.s, which uses LDR Rd,=PC-relative expression rather than ADR. Assemble this and compare the code and the code size with that of strcopy1.s, using the ARM Object format decoder:

armasm strcopy2.s

decaof -c strcopy2.o

decaof -c strcopy1.o

It is preferable to use ADR wherever possible, both because it results in shorter code (no storage space is required for addresses to be placed in the literal pool) and because the resulting code will run more quickly (a non-sequential fetch from memory to get the address from the literal pool is not required).

Jump Tables

Often it is necessary for an application to carry out one of a number of actions dependent upon a certain condition.

In C, for instance, this will often be implemented as a switch () statement.

In assembly language this can be done using a jump table.

Suppose we have a function that implements a simple set of arithmetic operations whose first argument controls a logic gate and whose second and third arguments are the gate’s inputs. The gate’s output is passed as the function’s result.

The operations the gate function will respond to are:

0 result = argument1

1 result = argument2

2 result = argument1 + argument2

3 result = argument1 – argument2

4 result = argument2 – argument1

Values outside this range will have the same effect as value 0.

AREA          ArithGate, CODE  ; name this block of code

ENTRY                                       ; mark the first instruction to call

main           MOV r0, #2           ; set up three parameters

MOV          r1, #5

MOV          r2, #15

BL               arithfunc               ; call the function

SWI             0x11                               ; terminate

arithfunc                                  ; label the function

CMP           r0, #4                    ; Treat code as unsigned integer

BHI             ReturnA1              ; If code > 4 then return first argument

ADR           r3, JumpTable      ; Load address of the jump table

LDR             pc,[r3,r0,LSL #2]  ; Jump to appropriate routine


DCD           ReturnA1

DCD            ReturnA2

DCD            DoAdd

DCD            DoSub

DCD            DoRsb


MOV          r0, r1                    ; Operation 0, >4

MOV          pc,lr


MOV          r0, r2                    ; Operation 1

MOV          pc,lr




ADD            r0, r1, r2              ; Operation 2

MOV          pc,lr


SUB            r0, r1, r2              ; Operation 3

MOV          pc,lr


RSB             r0, r1, r2              ; Operation 4

MOV          pc,lr

END                                         ; mark the end of this file

The ADR pseudo instruction loads the address of the jump table into r3.

The following LDR then multiplies the function code in r0 by 4 (using the barrel shifter) and adds this onto the address of the jump table to give the address of the required entry within the jump table. The jump table itself is set up using the DCD directive, which stores the address of the relevant routine (placed there by the linker).

The above code is available as jump.s in directory examples/basic asm. Copy this into your current working directory and assemble and link it:

armasm jump.s

armlink jump.o -o jump

Then load the resulting program into the debugger:

armsd jump


and display the registers:


the value of r0 should be 0x14.

Using the Load and Store Multiple Instructions

Multiple versus single transfers

The load and store multiple instructions LDM and STM provide an efficient way of moving the contents of several registers to and from memory. The advantages of using a single load or store multiple instructions over a series of single data transfer instructions are:

  • Smaller code size.
  • There is only a single instruction fetch overhead, rather than many instruction fetches.
  • Only one register write back cycle is required for a load multiple, as opposed to one for every load single.
  • On un-cached ARM processors, the first word of data transferred by a load or store multiple is always a non-sequential memory cycle, but all subsequent words transferred can be sequential (faster) memory cycles.

The register list

The registers transferred by the load and store multiple instructions are encoded into the instruction by one bit for each of the registers r0 to r15.

A set bit indicates that the register will be transferred, and a clear bit indicates that it will not be transferred. Thus it is possible to transfer any subset of the registers in a single instruction.

The subset of registers to be transferred is specified by listing them in curly brackets. For example:

{r1, r4-r6, r8, r10}

Increment/decrement, before/after

The base address for the transfer can either be incremented or decremented between register transfers, and this can happen either before or after each register transfer:

STMIA r10, {r1, r3-r5, r8}

The suffix IA could also have been IB, DA or DB, where I indicates increment, D decrement, A after and B before.

In all cases the lowest numbered register is transferred to or from the lowest memory address and the highest numbered register to or from the highest address.

The order in which the registers appear in the register list makes no difference. Also, the ARM always performs sequential memory accesses in increasing memory address order. Therefore ‘decrementing’ transfers actually perform a subtraction first and then increment the transfer address register by register.


Join the World's Largest Technical Community

we respect your privacy.