You have no items in your shopping cart.

Subtotal: 0.00

Exploring ARM Assembly Language

Introduction

The ARM Assembler is a program that translates symbolic code (assembly language) into executable object code. This object code can be executed with an ARM7-based or ARM9-based microcontroller. If you have ever written a computer program directly in machine-recognizable form, such as binary or hexadecimal code, you will appreciate the advantages of programming in symbolic assembly language.


Assembly language operation codes (mnemonics) are designed to be easy to remember. You may symbolically express addresses and values referenced in the operand field of instructions. Since you assign these names, you can make them as meaningful as the mnemonics for the instructions.

 
Integer to String Conversion

This section explains how to:


  • Convert an integer to a string in ARM assembly language.
  • Use a stack in an ARM assembly language program.
  • Write a recursive function in ARM assembly language.

The example used can be found in file utoa1.s in directory examples/explasm. Its dtoa entry point converts a signed integer to a string of decimal digits (possibly with a leading '-'); its utoa entry point converts an unsigned integer to a string of decimal digits.


Algorithm

To convert a signed integer to a decimal string, generate a '-' and negate the number if it is negative; then convert the remaining unsigned value.


To convert a given unsigned integer to a decimal string, divide it by 10, yielding a quotient and a remainder. The remainder is in the range 0-9 and is used to create the last digit of the decimal representation.


If the quotient is non-zero it is dealt with in the same way as the original number, creating the leading digits of the decimal representation; otherwise the process has finished.


Implementation
utoa

STMFD     sp!, {v1, v2, lr}           ; function entry –

;save some v-registers and the return address.

MOV         v1, a1 ; preserve arguments over following

MOV         v2, a2                           ; function calls

MOV         a1, a2

BL               udiv10                         ; a1 = a1 / 10

SUB           v2, v2, a1, LSL #3      ; number - 8*quotient

SUB           v2, v2, a1, LSL #1      ; - 2*quotient = remainder

CMP          a1, #0                           ; quotient non-zero?

MOVNE   a2, a1                           ; quotient to a2...

MOV         a1, v1 ; buffer pointer unconditionally to a1

BLNE         utoa ; conditional recursive call to utoa

ADD          v2, v2, #'0'                  ; final digit

STRB         v2, [a1], #1                 ; store digit at end of buffer

LDMFD     sp!, {v1, v2, pc}         ; function exit

The implementation of utoa employs the register naming and usage conventions of the ARM Procedure Call Standard:


  • a1-a4 are argument or scratch registers (a1 is the function result register) v1-v5 are 'variable' registers (preserved across function calls).
  • sp is the stack pointer.
  • lr holds the subroutine call return address at routine entry.
  • pc is the program counter.

Explanation

On entry, a2 contains the unsigned integer to be converted and a1 addresses a buffer to hold the character representation of it.


On exit, a1 points immediately after the last digit written. Both the buffer pointer and the original number have to be saved across the call to udiv10. This could be done by saving the values to memory.


However, it turns out to be more efficient to use two 'variable' registers, v1 and v2 (which, in turn, have to be saved to memory).


Because utoa calls other functions, it must save its return link address passed in lr. The function therefore begins by stacking v1, v2 and lr using STMFD sp!, {v1,v2,lr}.


In the next block of code, a1 and a2 are saved (across the call to udiv10) in v1 and v2 respectively and the given number (a2) is moved to the first argument register (a1) before calling udiv10 with a BL (Branch with Link) instruction.


On return from udiv10, 10 times the quotient is subtracted from the original number (preserved in v2) by two SUB instructions. The remainder (in v2) is ready to be converted to character form (by adding ASCII '0') and to be stored into the output buffer.


But first, utoa has to be called to convert the quotient, unless that is zero.


The next four instructions do this, comparing the quotient (in a1) with 0, moving the quotient to the second argument register (a2) if not zero, moving the buffer pointer to the first argument/result register (a1), and calling utoa if the quotient is not zero.


Note that the buffer pointer is moved to a1 unconditionally:


If utoa is called recursively, a1 will be updated but will still identify the next free buffer location; if utoa is not called recursively, the next free buffer location is still needed in a1 by the following code which plants the remainder digit and returns the updated buffer location (via a1).


The remainder (in a2) is converted to character form by adding '0' and is then stored in the location addressed by a1. A post-incrementing STRB is used which stores the character and increments the buffer pointer in a single instruction, leaving the result value in a1.


Finally, the function is exited by restoring the saved values of v1 and v2 from the stack, loading the stacked link address into pc and popping the stack using a single multiple-load instruction:


LDMFD    sp!, {v1,v2,pc}


Stacks in assembly language

In this example, three words are pushed on to the stack on entry to utoa and popped off again on exit. By convention, ARM software uses r13, usually called sp, as a stack pointer pointing to the last-used word of a downward growing stack (a so-called 'full, descending' stack). However, this is only a convention and the ARM instruction set supports equally all four stacking possibilities: FD, FA, ED, EA.


The instruction used to push values on the stack was:


STMFD sp!, {v1, v2, lr}


The action of this instruction is as follows:


  • Subtract 4 * number-of-registers from sp
  • Store the registers named in {...} in ascending register number order to memory at [sp], [sp,4], [sp,8] ...

The matching pop instruction was:


LDMFD     sp!, {v1, v2, pc}


Its action is:


  • Load the registers named in {...} in ascending register number order from memory at [sp], [sp,4], [sp,8] ...
  • Add 4 * number-of-registers to sp.

Many, if not most, register-save requirements in simple assembly language programs can be met using this approach to stacks.


A more complete treatment of run-time stacks requires a discussion of:


  • Stack-limit checking (and extension).
  • Local variables and stack frames.

In the utoa program, you must assume the stack is big enough to deal with the maximum depth of recursion, and in practice this assumption will be valid. The biggest 32-bit unsigned integer is about four billion, or ten decimal digits. This means that at most 10 x 3 registers = 120 bytes have to be stacked. Because the ARM Procedure Call Standard guarantees that there are at least 256 bytes of stack available when a function is called, and because we can guess (or know) that udiv10 uses no stack space, we can be confident that utoa is quite safe if called by an APCS-conforming caller such as a compiled C test harness.


The stacking technique illustrated here conforms to the ARM Procedure Call Standard only if the function using it makes no function calls.


However, when writing a whole program in assembly language you often know much more than when writing a program fragment for general, robust service. This allows you to gently break the APCS in the following way:


  • Any chain of function/subroutine calls can be considered compatible with the APCS provided it uses less than 256 bytes of stack space.

So the utoa example is APCS compatible, even though it is not APCS conforming.


Multiplication by a Constant

This section explains how to construct a sequence of ARM instructions to multiply by a constant.


For some applications in which speed is essential—Digital Signal Processing, for example— multiply is used extensively.


In many cases where a multiply is used, one of the values is a constant (e.g. weeks*7).


This section demonstrates how to improve the speed of multiply-by-constant by using a sequence of arithmetic instructions instead of the general-purpose multiplier.


Introduction

The MUL instruction has the following syntax:


MUL Rd, Rm, Rs


The timing of this instruction depends on the value in Rs. The ARM6 datasheet specifies that for Rs between 2^(2m-3) and 2^(2m-1)-1 inclusive takes 1S + mI cycles.


Note ARM 7M family processors have a different implementation of MUL. This leads to a different relationship of cycle counts to values of Rs.


When multiplying by a constant value, it is possible to replace the general multiply with a fixed sequence of add and subtracts that have the same effect. For instance, multiply by 5 could be achieved using a single instruction:


ADD Rd, Rm, Rm, LSL #2 ; Rd = Rm + (Rm * 4) = Rm * 5


This is obviously better than the MUL version:


MOV Rs, #5

MUL Rd, Rm, Rs


The cost of the general multiply includes the instructions needed to load the constant into a register (up to four may be needed, or an LDR from a literal pool) as well as the multiply itself.


Finding the optimum sequence

The difficulty in using a sequence of arithmetic instructions is that the constant must be de-composed into a set of operations which can be done by one instruction each. Consider multiply by 105:


This could be achieved by decomposing, as follows:

105 == 128 - 23

== 128 - (16 + 7)

== 128 - (16 + (8 - 1))

 

RSB                 Rd, Rm, Rm, LSL #3 ; Rd = Rm*7

ADD        Rd, Rd, Rm, LSL #4

; Rd = Rm*7 + Rm*16 = Rm*23

RSB                 Rd, Rd, Rm, LSL #7

; Rd = -Rm*23 + Rm*128 = Rm*105

Or as follows:


105 == 15 * 7

== (16 - 1) * (8 - 1)

RSB                 Rt, Rm, Rm, LSL #4 ; Rt = Rm*15 (tmp reg)

RSB                 Rd, Rt, Rt, LSL #3 ; Rd = Rt*7 = Rm*105

The second method is the optimal solution (fairly easy to find the small values such as 105).


However, the problem of finding the optimum becomes much more difficult for larger constant values. A program can be written to search exhaustively for the optimum, but it may take a long time to execute.


There are no known algorithms which solve this problem quickly.


Temporary registers can be used to store intermediate results to help achieve the shortest sequence. For a large constant, more than one temporary may be needed; otherwise the sequence will be longer.


The C compiler restricts the amount of searching it performs in order to minimize the impact on compilation time. The current version of armcc has a cut-off so that it uses a normal MUL if the number of instructions used in the multiply-by-constant sequence exceeds some number N.


This is to avoid the sequence becoming too long.


Division by a Constant

The ARM instruction set was designed following a RISC philosophy. One of the consequences of this is that the ARM core has no divide instruction, so divides must be performed using a subroutine.


This means that divides can be quite slow, but this is not a major issue as divide performance is rarely critical for applications.


It is possible to do better than the general divide in the special case when the divisor is a constant.


This section shows how the divide-by-constant technique works, and how to generate ARM assembler code for divide-by-constant.


This section explains:


  • How to improve on the general divide code for the case when the divisor is a constant.
  • The simple case for divide-by-2^n using the barrel shifter.
  • How to use divc.c to generate ARM code for divide-by-constant.

Special case for divide-by-2^n

In the special case when dividing by 2^n, a simple right shift is all that is required.


There is a small caveat which concerns the handling of signed and unsigned numbers.


For signed numbers, an arithmetic right shift is required, as this performs sign extension (to handle negative numbers correctly). In contrast, unsigned numbers require a 0-filled logical shift right:


MOV a2, a1, lsr #5 ; unsigned division by 32

MOV a2, a1, asr #10 ; signed division by 1024


Explanation of divide-by-constant ARM code

The divide-by-constant technique basically does a multiply in place of the divide, but is somewhat more complicated than the multiply technique.


Given that:


x/y == x * (1/y),


Consider the underlined portion as a 0.32 fixed-point number (truncating any bits past the most significant 32). 0.32 means 0 bits before the decimal point and 32 after it.


== (x * (2^32/y)) / 2^32,


The underlined portion here is a 32.0 bit fixed-point number:


== (x * (2^32/y)) >> 32


This is effectively returning the top 32-bits of the 64-bit product of x and (2^32/y).


If y is a constant, then (2^32/y) is also a constant.


Using 16-bit Data on the ARM

Note This section will only be of interest to designers working with Architecture 3 ARM cores and devices (eg. ARM6, ARM60, ARM610).


ARM processors designed using ARM Architecture 4 has instructions for loading and storing half word values. ARM processors designed using version 3 of the architecture, while lacking half word support, are still capable of handling16-bit data efficiently, as this section will demonstrate.


This section covers several different approaches to 16-bit data manipulation on ARM processors which do not have half word support:


  • Converting the 16-bit data to 32-bit data, and from then on treating it as 32-bit data.
  • Converting 16-bit data into 32-bit data when loading and storing, but using 32-bit data within ARM's registers.
  • Loading 16-bit data into the top 16-bits of ARM registers, and processing it as 16-bit data (i.e. keeping the bottom 16-bits clear at all times).

Useful code fragments are given which can be used to help implement these different approaches efficiently.


16-bit values in 32-bit words

Because data is 16-bit in size does not mean that it cannot be considered as 32-bit data and thus be manipulated using the ARM instruction set in the normal way.


Any unsigned 16-bit value can be held as a 32-bit value in which the top 16 bits are all zero.


Similarly any signed 16-bit value can be held as a 32-bit value with the top 16 bits sign extended (i.e. copied from the top bit of the 16-bit value).


The disadvantage of storing 16-bit data as 32-bit data for ARM-based systems is that it takes up twice as much space in memory or on disk. If the amount of memory taken up by the 16-bit data is small, it simply treating it as 32-bit data is likely to be the easiest and most efficient technique.


When the space taken by 16-bit data in memory or on disk is not small, an alternative method can be used. The 16-bit data is loaded and converted to be 32-bit data for use within the ARM, and then when processed, can either be output as 32-bit or 16-bit data.


Detecting 16-bit data

An issue which may arise when 16-bit data is converted to 32-bit data for use in the ARM and then stored back out as 16-bit data is detecting whether the data is still 16-bit data, i.e. whether it has 'overflowed' into the top 16 bits of the ARM register.


Another approach which avoids having to use explicit code to check whether results have overflowed into the top 16-bits is to keep 16-bit data as 16-bit data all the time, by loading it into the top half of ARM registers, and ensuring that the bottom 16 bits are always 0.


Little-endian loading

Code fragments in this section which transfer a single 16-bit data item transfer it to the least significant 16 bits of an ARM register. The byte offset referred to is the byte offset within a word at the load address. E.g. the address 0x4321 has a byte offset of 1.


One data item - any alignment (byte offsets 0,1,2 or 3)


The following code fragment loads a 16-bit value into a register, whether the data is byte, half word or word-aligned in memory, by using the ARM's load byte instruction.


This code is also optimal for the common case where the 16-bit data is half word-aligned, i.e. at either byte offset 0 or 2 (but the same code is required to deal with both cases). Optimizations can be made when it is known that the data is at byte offset 0, and also when it is known to be at byte offset 2 (but not when it could be at either offset).

LDRB      R0, [R2, #0]   ; 16-bit value is loaded from the

LDRB      R1, [R2, #1]   ; address in R2, and put in R0

ORR        R0, R0, R1, LSL #8 ; R1 is required as a

; MOV     R0, R0, LSL #16     ; temporary register

; MOV     R0, R0, ASR #16

The two MOV instructions are only required if the 16-bit value is signed, and it may be possible to combine the second MOV with another data-processing operation by specifying the second argument as R0, ASR, #16 rather than just R0.


One data item - byte offset 2


If the data is aligned on a half word boundary, but not a word boundary (i.e. the byte offset is 2), then the following code fragment can be used (which is clearly much more efficient than the general case given above):

LDR   R0, [R2, #-2]                  ; 16-bit data is loaded from

; address in R2 into R0

MOV          R0, R0, LSR #16  ; (R2 has byte offset 2)

The LSR should be replaced with ASR if the data is signed. Note that as in the previous example it may be possible to combine the MOV with another data processing operation.


One data item - byte offset 0


If the data is on a word boundary, the following code fragment will load a 16-bit value (again a significant improvement over the general case):

LDR            R0, [R2, #0]         ; 16-bit value is loaded from the

MOV         R0, R0, LSL #16 ; word-aligned address in R2

MOV         R0, R0, LSR #16 ; into R0

As before, LSR should be replaced with ASR if the data is signed. Also, it may be possible to combine the second MOV with another data processing operation.


This code can be further optimized if non-word-aligned word-loads are permitted (i.e. alignment faults are not enabled). This makes use of the way ARM rotates data into a register for

LDR            R0, [R2, #2]                ; 16-bit value is loaded from the

MOV         R0, R0, LSR #16         ; word-aligned address in R2

; into R0.

Two data items - byte offset 0

Two 16-bit values stored in one word can be loaded more efficiently than two separate values.

The following code loads two unsigned 16-bit data items into two registers from a word-aligned address:

LDR R0, [R2, #0] ; 2 unsigned 16-bit values are MOV R1, R0, LSR #16 ; loaded from one word of memory BIC R0, R0, R1, LSL #16 ; [R2]. The 1st is put in R0, and ; the 2nd in R1. The version of this for signed data is: LDR R0, [R2, #0] ; 2 signed 16-bit values are MOV R1, R0, ASR #16 ; loaded from one word of memory MOV R0, R0, LSL #16 ; [R2]. The 1st is put in R0, and MOV R0, R0, ASR #16 ; the 2nd in R1.

The address in R2 should be word-aligned (byte offset 0), in which case these code fragments load the data item in bytes 0-1 into R0, and the data item in bytes 2-3 into R1.


Little-endian storing

The code fragment in this section transfers a single 16-bit data item from the least-significant 16 bits of an ARM register. The byte offset referred to is the byte offset within a word of the store address. For example, the address 0x4321 has a byte offset of 1.


One data item - any alignment (byte offsets 0,1,2 or 3)


The following code fragment saves a 16-bit value to memory, whatever the alignment of the data address, by using the ARM's byte-saving instructions:

STRB         R0, [R2, #0]                ; 16-bit value is stored to the

MOV         R0, R0, ROR #8          ; address in R2.

STRB         R0, [R2, #1]

; MOV       R0, R0, ROR #24

; MOV       R0, R0, ROR #24

The second MOV instruction can be omitted if the data is no longer needed after being stored.

Unlike load operations, knowing the alignment of the destination address does not make optimizations possible.

Two data items - byte offset 0

Two unsigned 16-bit values in two registers can be packed into a single word of memory very efficiently, as the following code fragment demonstrates:

ORR                    R3, R0, R1, LSL #16 ; Two unsigned 16-bit values

STR            R3, [R2, #0]                ; in R0 and R1 are packed into

; the word addressed by R2

; R3 is a temporary register

If the values in R0 and R1 are not needed after they are saved, R3 need not be used as a temporary register (one of R0 or R1 can be used instead).

The version for signed data is:

MOV         R3, R0, LSL #16          ; Two signed 16-bit values

MOV         R3, R3, LSR #16         ; in R0 and R1 are packed into

ORR                    R3, R3, R1, LSL #16  ; the word addressed by R2

STR            R3, [R2, #0]                ; R3 is a temporary register

Again, if the values in R0 and R1 are not needed after they are saved, R3 need not be used as a temporary register (R0 can be used instead).

Big-endian loading

Code fragments in this section transfer a single 16-bit data item to the least-significant 16 bits of an ARM register. The byte offset referred to is the byte offset within a word at the load address. For example, the address 0x4321 has a byte offset of 1.

One data item - any alignment (byte offsets 0,1,2 or 3)

The following code fragment loads a 16-bit value into a register using the load byte instruction (LDRB). The data may be byte, half word or word-aligned.

This code is also optimal for the common case where the 16-bit data is half word-aligned; i.e. at either byte offset 0 or 2 (but the same code is required to deal with both cases).

Optimizations can be made when it is known that the data is at byte offset 0, and also when it is known to be at byte offset 2 (but not when it could be at either offset).

LDRB         R0, [R2, #0]                ; 16-bit value is loaded from the

LDRB         R1, [R2, #1]                ; address in R2, and put in R0

ORR                    R0, R1, R0, LSL #8     ; R1 is a temporary register

; MOV R0, R0, LSL #16

The two MOV instructions are only required if the 16-bit value is signed, and it may be possible to combine the second MOV with another data-processing operation by specifying the second argument as R0, ASR, #16 rather than simply R0.

One data item - byte offset 0

If the data is aligned on a word boundary, the following code fragment can be used (which is clearly much more efficient than the general case given above):

LDRB         R0, [R2, #0]                ; 16-bit value is loaded from the

LDRB         R1, [R2, #1]                ; address in R2, and put in R0

ORR                    R0, R1, R0, LSL #8     ; R1 is a temporary register

; MOV R0, R0, LSL #16

The LSR should be replaced with ASR if the data is signed. Note that as in the previous example it may be possible to combine the MOV with another data-processing operation.

One data item - byte offset 2

If the data is aligned on a half word boundary, but not a word boundary (i.e. the byte offset is 2) the following code fragment can be used (again a significant improvement over the general case):

LDR            R0, [R2, #-2]               ; 16-bit value is loaded from the

MOV         R0, R0, LSL #16          ; address in R2 into R0. R2 is

MOV         R0, R0, LSR #16         ; aligned to byte offset 2

As before, LSR should be replaced with ASR if the data is signed. Also, it may be possible to combine the second MOV with another data-processing operation.

This code can be further optimized if non-word-aligned word-loads are permitted (i.e. alignment faults are not enabled). This makes use of the way in which the ARM rotates data into a register for non-word-aligned word loads:

LDR            R0, [R2, #0]                ; 16-bit value is loaded from the

MOV         R0, R0, LSR #16         ; address in R2 into R0. R2 is

; aligned to byte offset 2

Two data items - byte offset 0

Two 16-bit values stored in one word can be loaded more efficiently than two separate values.

The following code loads two unsigned 16-bit data items into two registers from a word-aligned address:

LDR            R0, [R2, #0]         ; 2 unsigned 16-bit values are

MOV         R1, R0, LSR #16 ; loaded from one word of memory.

BIC             R0, R0, R1, LSL #16 ; The 1st in R0, the 2nd in R1.

The version of this for signed data is:

LDR            R0, [R2, #0]          ; 2 signed 16-bit values are

MOV         R1, R0, ASR #16 ; loaded from one word of memory.

MOV         R0, R0, LSL #16  ; The 1st in R0, the 2nd in R1.

MOV         R0, R0, ASR #16 ; into R1.
Big-endian storing

The code fragment in this section which transfers a single 16-bit data item, transfers it from the least-significant 16 bits of an ARM register. The byte offset referred to be the byte offset from a word address of the store address; e.g. the address 0x4321 has a byte offset of 1.

One data item - any alignment (byte offsets 0,1,2 or 3)

The following code fragment saves a 16-bit value to memory, whatever the alignment of the data address:

STRB         R0, [R2, #1]                ; 16-bit value is stored to the

MOV         R0, R0, ROR #8          ; address in R2.

STRB         R0, [R2, #0]

; MOV       R0, R0, ROR #24

The second MOV instruction can be omitted if the data is no longer needed after being stored.

Unlike load operations, knowing the alignment of the destination address does not make optimizations possible.

Two data items - byte offset 0

Two unsigned 16-bit values in two registers can be packed into a single word of memory very efficiently, as the following code fragment demonstrates:

ORR                    R3, R0, R1, LSL #16 ; Two unsigned 16-bit values in

STR            R3, [R2, #0]                ; R0 and R1 are packed into the

 

; word addressed by R2

; R3 is used as a temporary register

If the values in R0 and R1 are not needed after they are saved, R3 need not be used as a temporary register (one of R0 or R1 can be used instead).

The version for signed data is:

MOV         R3, R0, LSL #16          ; Two signed 16-bit values in

MOV         R3, R3, LSR #16         ; R0 and R1 are packed into the

ORR                    R3, R3, R1, LSL #16 ; word addressed by R2.

STR            R3, [R2, #0]                ; R3 is a temporary register

If the values in R0 and R1 are not needed after they are saved, R3 need not be used as a temporary register.

Detecting overflow into the top 16 bits

If 16-bit data is converted to 32-bit data for use in the ARM, it may sometimes be necessary to check explicitly whether the result of a calculation has 'overflowed' into the top 16 bits of an ARM register.

This is likely to be necessary because the ARM does not set its processor status flags when this happens.

The following instruction sets the Z flag if the value in R0 is a 16-bit unsigned value. R1 is used as a temporary register:

MOVS R1, R0, LSR #16

The following instructions set the Z flag if the value in R0 is a valid 16-bit signed value (i.e. 15th bit is the same as the sign extended bits). R1 is used as a temporary register:

MOVS R1, R0, ASR #15

CMNNE R1, R1, #1

Using ARM registers as 16-bit registers

The final method of handling 16-bit data is to load it into the top 16 bits of the ARM registers, effectively making them 16-bit registers. This approach has several advantages:

  • Some 16-bit data load instruction sequences are shorter. The loading and storing sequences shown above will have to be modified, and in some cases shorter instruction sequences will be possible. In particular, handling signed data will often be more efficient, as the top bit does not have to be copied into the top 16 bits of the register. However, note that the bottom 16 bits must be clear at all times.
  • The ARM processor status flags will be set if the 'S' bit of a data processing instruction is set and overflow or carry occurs out of the 16-bit value. Thus, explicit 'overflow' checking instructions are not needed.
  • Pairs of signed 16-bit integers can be saved more efficiently than in the previous approach, since the sign-extended bits do not have to be cleared out before the two values are combined.

    There are also disadvantages:
  • Instructions such as add with carry cannot be used. For example, the instruction to increment R0 if Carry is set:

ADC R0, R0, #0

Must be replaced by:

ADDCS R0, R0, #&10000

Using this form of instruction reduces the chances of being able to combine several data-processing operations into one by making use of the barrel shifter.

  • Before two 16-bit values can be multiplied, they must be shifted into the bottom half of the register.
  • Before combining a 16-bit value with a 32-bit value, the 16-bit value must be shifted into the bottom half of the register.

Note, however, that this may cost nothing if the barrel shifter can be used in parallel.

Pseudo Random Number Generation

This section describes a 32-bit pseudo random number generator implemented efficiently in ARM Assembly Language.

It is often necessary to generate pseudo random numbers, and the most efficient algorithms are based on shift generators with exclusive-or feedback (rather like a cyclic redundancy check generator). Unfortunately, the sequence of a 32-bit generator needs more than one feedback tap to be maximal length (i.e. 2^32-1 cycles before repetition), so this example uses a 33 bit register with taps at bits 33 and 20.

The basic algorithm is:

• newbit:=bit33 EOR bit20

• shift left the 33 bit number

• put in newbit at the bottom

This operation is performed for all the new bits needed (i.e. 32 bits). The entire operation can be coded compactly by making maximal use of the ARM's barrel shifter:

; enter with seed in R0 (32 bits), R1 (1 bit in least significant bit)

; R2 is used as a temporary register.

; on exit the new seed is in R0 and R1 as before

; Note that a seed of 0 will always produce a new seed of 0.

; All other values produce a maximal length sequence.

;

TST R1, R1, LSR #1 ; top bit into Carry

MOVS R2, R0, RRX ; 33 bit rotate right

ADC R1, R1, R1 ; carry into lsb of R1

EOR R2, R2, R0, LSL #12 ; (involved!)

EOR R0, R2, R2, LSR #20 ; (similarly involved!)

Using this example

This random number generation code is provided as random.s in directory examples/ explasm. It is provided as ARM Assembly Language source which can be assembled and then linked with C modules.

The C test program randtest.c (also in directory examples/explasm) can be used to demonstrate this.

In the following commands:

-li indicates that the target ARM is little-endian

-apcs 3/32bit specifies that the 32-bit variant of the ARM Procedure Call Standard should be used.

These options can be omitted if the tools have already been configured appropriately.

First copy random.s and randtest.c from directory examples/explasm to your current directory, and enter the following commands to build an executable suitable for armsd:

armasm     random.s      -o   random.o   -li

armcc      -c   randtest.c -li    -apcs     3/32bit

armlink    randtest.o     random.o     -o

                       randtest    libpath/armlib.32l

where libpath is the path to the toolkit’s lib directory on your system.

armsd can be used to run this program as follows:

> armsd -li randtest

A.R.M. Source-level Debugger, version 4.10 (A.R.M.) [Aug 26 1992]

ARMulator V1.20, 512 Kb RAM, MMU present, Demon 1.01, FPE, Little endian.

Object program file randtest

armsd: go

randomnumber() returned 0b3a9965

randomnumber() returned ac0b1672

randomnumber() returned 6762ad4f

 

randomnumber() returned 1965a731

randomnumber() returned d6c1cef4

randomnumber() returned f78fa802

randomnumber() returned 8147fc15

randomnumber() returned 3f62adfc

randomnumber() returned b56e9da8

randomnumber() returned b36dc5e2

Program terminated normally at PC = 0x000082c8

0x000082c8: 0xef000011 .... : > swi 0x11

armsd: quit

Quitting

>

This section describes a code sequence which loads a word from memory at any byte alignment. Although loading 32-bit data from non word-aligned addresses should be avoided whenever possible, it may sometimes be necessary.

Aligned and misaligned data

The ARM Load and Store (single and multiple) instructions are designed to load word-aligned data. Unless there is a very good reason for doing so, it is best not to have data at non word aligned addresses, as neither the Load nor Store instruction can access such data unaided.

To deal with misaligned word fetches, two words must be read and the required data extracted from these two words. The code below performs this operation for a little-endian ARM, making good use of the barrel shifter:

; enter with address in R0

; R2 and R3 are used as temporary registers

; the word is loaded into R1

;

BIC             R2, R0, #3                   ; Get word-aligned address

LDMIA      R2, {R1, R3}                ; Get 64 bits containing data

AND          R2, R0, #3                   ; Get offset in bytes

 

MOVS       R2, R2, LSL #3            ; Get offset in bits

MOVNE   R1, R1, LSR R2 ; Extract data from bottom 32 bits

RSBNE      R2, R2, #32                 ; Get 32 - offset in bits

ORRNE     R1, R1, R3, LSL R2     ; Extract data from top 32 bits

; and combine with the other data

This code can easily be modified for use on a big-endian ARM; the LSR R2 and LSL R2 must be swapped over.

Byte Order Reversal

Changing the endianness of a word can be a common operation in certain applications— for example when communicating word-sized data as a stream of bytes to a receiver of the opposite endianness.

This operation can be performed efficiently on the ARM, using just four instructions. The word to be reversed is held in a1 both on entry and exit of this instruction sequence. IP is used as a temporary register:

EOR           ip, a1, a1, ror #16

BIC             ip, ip, #&ff0000

MOV         a1, a1, ror #8

EOR           a1, a1, ip, lsr #8

A demonstration program which should help explain how this works has been provided in source form in directory examples/explasm. To compile this program and run it under armsd, first copy bytedemo.c from directory examples/explasm to your current working directory, and then use the following commands:

>armcc bytedemo.c -o bytedemo -li -apcs 3/32bit

>armsd -li bytedemo

A.R.M. Source-level Debugger, version 4.10 (A.R.M.) [Aug 26 1992]

ARMulator V1.20, 512 Kb RAM, MMU present, Demon 1.01, FPE, Little endian.

Object program file bytedemo

armsd: go

Note This program uses ANSI control codes, so should work on most terminal types under UNIX and also on the PC. It will not work on HP-UX if the terminal emulator used is an HPTERM. An XTERM should be used to run this program on the HP-UX.

ARM Assembly Programming Performance Issues

This section outlines many performance-related issues of which the ARM Assembly Language Programmer should be aware. It also provides useful background for C programmers using armcc, as some of these issues can also apply to programming in C.

LDM / STM

Use LDM and STM instead of a sequence of LDR or STR instructions wherever possible.

This provides several benefits:

  • The code is smaller.
  • An instruction fetch cycle and a register copy back cycle is saved for each LDR or STR eliminated.
  • On an un-cached ARM processor (for LDM) or an un-buffered ARM processor (for STM), non-sequential memory cycles can be turned into faster memory sequential cycles.

Conditional execution

In many situations, branches around short pieces of code can be avoided by using conditionally executed instructions. This reduces the size of code and may avoid a pipeline break.

Using the barrel shifter

Combining shift operations with other operations can significantly increase the code density (and thus performance) of much ARM code.

Addressing modes

The ARM instruction set provides a useful selection of addressing modes, which can often be used to improve the performance of code; e.g. using LDR or STR pre- or post-indexed with a non-zero offset increments the base register and performs the data transfer.

Multiplication

Be aware of the time taken by the ARM multiply and multiply accumulate instructions. When multiplying by a constant value note that using the multiply instruction is often not the optimal solution.

Optimizing register usage

Examine your code and see if registers can be reused for another value during parts of a long calculation which uses many registers. Doing this will reduce the amount of 'register spillage'.

Because much can be achieved in a single data-processing instruction, keeping a calculated result in a register for use a considerable time later may be less efficient than recalculating it when it is next needed. This is because it may allow the freed register to be used for another purpose in the meantime, thus reducing the amount of register spillage.

Loop unrolling

Loop unrolling can be a useful technique, but detailed analysis is often necessary before using it. In some situations can reduce performance.

Loop unrolling involves using more than one copy of the inner loop of an algorithm. The following benefits may be gained by loop unrolling:

  • The branch back to the beginning of the loop is executed less frequently.
  • It may be possible to combine some of iteration with some of the next iteration, and thereby significantly reduce the cost of iteration.

A common case of this is combining LDR or STR instructions from two or more iterations into single LDM or STM instructions. This reduces code size, the number of instruction fetches, and in the case of LDM, the number of register write back cycles.

Consider calculating the following over an array: x[i] = y[i] - y[i+1]. Below is a code fragment which performs this:

LDR R2, [R0] ; Preload y[i]

Loop

LDR R3, [R0, #4]!! ; Load y[i+1]

SUB R2, R2, R3 ; x[i] = y[i] - y[i+1]

STR R2, [R1], #4 ; Store x[i]

MOV R2, R3 ; y[i+1] is the next y[i]

CMP R0, R4 ; Finished ?

BLT Loop

IF stands for Instruction Fetch

WB stands for Register Write Back

R stands for Read

W stands for Write

The floating-point emulator

Note This advice is not applicable to systems which use the ARM FPA co-processor or to code using the software floating-point library.

If the software-only floating-point emulator is being used, floating-point instructions should placed sequentially, as the floating-point emulator will detect that the next instruction is also a floating-point instruction, and will emulate it without leaving the undefined instruction code.

Stalling the write buffer

On ARM processors with a write buffer, performance can be maximized by writing code which avoids stalling due to the write buffer. For a write buffer with 2 tags and 8 words such as the ARM610, no three STR or STM instructions should be close together. Similarly no two STR or STM instructions which together store more than 8 words should be close together, as the second will be stalled until there is space in the write buffer.

16-bit data

Note The following applies to ARM Architecture 3 only.

If possible treat 16-bit data as 32-bit data. However, if this cannot be done, be aware that you can make use of the barrel shifter and non-word-aligned LDRs in order to make working with 16-bit data more efficient.

8-bit data

When processing a sequence of Byte-sized objects (e.g. strings), the number of loads and stores can be reduced if the data is loaded a word at a time and then processed a byte at a time by extracting the bytes using the barrel shifter.

Minimizing non-sequential cycles

Note This technique is only appropriate to un-cached ARM processors, and is intended for memory systems in which non-sequential memory accesses take longer than sequential memory accesses.

Consider a system where the length of memory bursts is B. That is, if executing a long sequence of data operations, the memory accesses which result are: one non-sequential memory cycle followed by B – 1 sequential memory cycles. An example of this is DRAM controlled by the ARM memory manager MEMC1a.

This sequence of memory accesses will be broken up by several ARM instruction types:

  • Load or Store (single or multiple)
  • Data Swap, Branch instructions
  • SWIs
  • Other instructions which modify the PC

By placing these instructions carefully, so that they break up the normal sequence of memory cycles only where a non-sequential cycle was about to occur anyway, the number of sequential cycles which are turned into longer non-sequential cycles can be minimized.

For a memory system which has memory bursts of length B, the optimal position for instructions which break up the memory cycle sequence is 3 words before the next B-word boundary.