ASSEMBLY LANGUAGE
An assembly language, abbreviated asm, is a low-level programming language for a computer, or other
programmable device, in which there is a very strong correspondence between the
language and the architecture's machine code instructions.
Why Assembly for
DSP?
In general for real time application the execution time of
program plays an important role in determining stability and failure rate of
system. So DSP algorithms like Audio Processing algorithms are performance
critical, because they involve in high floating point operations. So to achieve
the performance we need to optimize the DSP algorithms using ARM-NEON SIMD
Assembly language, converting the C code algorithm to assembly language by
using various optimizing methods such as vectorization using SIMD, Loop
unrolling etc.
In the following sections I am detailing about the main Instructions
related to the ARM-NEON Assembly.
Instruction Set for ARM
Assembly Language
Properties of ARM Instruction set:
1.
All instructions are 32 bits long.
2.
Most instructions execute in a single
cycle.
3.
Most instructions can be conditionally
executed.
4.
Load/store architecture Data processing
instructions act only on registers.
5.
Three operand format.
6.
Combined ALU and shifter for high speed
bit manipulation.
7.
Specific memory access instructions with
powerful auto‐indexing
addressing modes.
8.
Flexible multiple register load and store instructions.
I
am just listing the ARM and Neon Instruction set. For detail information please
refer the ARM technical reference manual.
Basic
Operational instructions:
Arithmetic
Operations
·
ADD, ADDC, SUB, SUBC, RSB, RSC.
Syntax:
<neumonic>{S}<c><q> {<Rd>,} <Rn>, <Rm>
Rd:
Destination Register, Rn & Rm: Source Registers, c: Condition flag, neumonic: Operation, If S
is present, the instruction updates the flags.
Logical
Operations
·
AND, ORR, EOR, BIC, ORN.
Syntax: <neumonic>{S}<c><q>
{<Rd>,} <Rn>, <Rm>
Rd:
Destination Register, Rn & Rm: Source Registers, c: Condition flag, neumonic: Operation, If S
is present, the instruction updates the flags.
Multiplication
and division Operations.
·
MUL, MULL, MLA, MLAL
Syntax:
<neumonic>{S}<c><q> {<Rd>,} <Rn>, <Rm>
Rd:
Destination Register, Rn & Rm: Source Registers, c: Condition flag, neumonic: Operation, If S
is present, the instruction updates the flags.
Data
Movement and Stack Operations.
·
MOV, MVN, PUSH, POP, MSR, MRS.
Syntax:
<neumonic>{S}<c><q> {<Rd>,} <Rn>
Rd:
Destination Register, Rn & Rm: Source Registers, c: Condition flag, neumonic: Operation, If S
is present, the instruction updates the flags.
Compare,
Conditional and Convert Instructions.
·
TST, TEQ, CMP, CMN.
Syntax:
<neumonic>{S}<c><q> {<Rd>,} <Rn>
Rd:
Destination Register, Rn & Rm: Source Registers, c: Condition flag, neumonic: Operation, If S
is present, the instruction updates the flags.
Load
and Store Instructions.
·
LDR, STR, LDM, STM, LDMIA, LDMIB, STMIA, STMIB.
Syntax:<
neumonic >{<cond>}{<size>} Rd, <address>.
Rd:
Destination Register, address: Data address. cond: Condition flag, neumonic: Operation,
Size: Size of data.
Other
Instructions.
·
SWI, REV, ROR, LSL, LSR, ASR, RRX, B, BL,
BX, BLX.
Syntax:
<neumonic>{S}<c><q> {<Rd>,} <Rn>
Rd: Destination Register, Rn & Rm: Source
Registers, c: Condition flag, neumonic:
Operation, If S is present, the instruction updates the flags.
Single Register data transfer:
·
The basic load and store instructions are:
Load and Store Word or Byte LDR / STR / LDRB / STRB.
·
ARMv4 adds support for Half words and
signed data: LDRH / STRH.
·
Load Signed Byte or Halfword ‐ Value and sign extend to
32 bits: LDRSB / LDRSH.
·
Conditionally executed by appropriate condition
code STR / LDR: LDREQB
·
Syntax:<LDR|STR>{<cond>}{<size>}
Rd, <address>.
Block Register data transfer.
·
Load and Store Multiple instructions (LDM
/ STM) allow transfer to or from memory.
·
Any subset of current bank of registers
(default).
·
Whole Register bank or subset copied with
single Instruction.
·
Appending ‘!’ can update Base register.
·
Operated in Little Endian.
·
Very efficient for saving and restoring context,
Moving large blocks of data.
Instruction Set for NEON
Assembly
Arithmetic
Operations:
·
VABA, VABD, VABS, VADD, VADDHN, VHADD,
VADDL, VADDW, VSUB, VSUBHN, VHSUBB, VSUBL, VSUBH, VPADD, VPADDL.
Syntax:<neumonic><c><q>.<dt>
{<Qd>,} <Qn>, <Qm>
Qd:
Destination Register, Qn & Qm: Source Registers, c: Condition flag, neumonic: Operation, dt:
Datatype.
Logical
Operations
·
VAND, VORR, VORN, VEOR, VNEG, VBIC, VBIF,
VBIT, VBSL.
Syntax:<neumonic><c><q>.<dt>
{<Qd>,} <Qn>, <Qm>
Qd:
Destination Register, Qn & Qm: Source Registers, c: Condition flag, neumonic: Operation, dt:
Datatype.
Multiplication
and division Operations.
·
VMUL, VMULL, VMLA, VMLAL, VMLS, VMLSL, VDIV.
Syntax:<neumonic><c><q>.<dt>
{<Qd>,} <Qn>, <Qm>
Qd:
Destination Register, Qn & Qm: Source Registers, c: Condition flag, neumonic: Operation, dt:
Datatype.
Data
Movement and Stack Operations.
·
VMOV, VMOVL, VMOVN, VMVN, VMSR, VMRS.
·
VPUSH, VPOP
Syntax:<neumonic><c><q>.<dt>
{<Qd>,} <Qn>, <Qm>
Qd:
Destination Register, Qn & Qm: Source Registers, c: Condition flag, neumonic: Operation, dt:
Datatype.
Compare,
Conditional and Convert Instructions.
·
VMAX, VMIN, VCMP, VCGE, VCGT, VCLT, VCLE,
VCLZ, VCLS, VCVT.
Syntax:<neumonic><c><q>.<dt>
{<Qd>,} <Qn>, <Qm>
Qd:
Destination Register, Qn & Qm: Source Registers, c: Condition flag, neumonic: Operation, dt:
Datatype.
Load
and Store Instructions.
·
VLDM, VLDR, VLD1, VLD2, VLD3, VLD4, VTSM,
VSTR, VST1, VST2, VST3, VST4.
Syntax:<neumonic><c><q>.<dt>
{<Qd>,} <Qn>, <Qm>
Qd:
Destination Register, Qn & Qm: Source Registers, c: Condition flag, neumonic: Operation, dt:
Datatype.
Other
Instructions.
·
VTRN, VREV, VSHL, VSHR, VSLI, VSRI, VSRA,
VTBL, VTBX, VZIP, VUZP, VSWP, VEXT.
Syntax:<neumonic><c><q>.<dt>
{<Qd>,} <Qn>, <Qm>
Qd:
Destination Register, Qn & Qm: Source Registers, c: Condition flag, neumonic: Operation, dt:
Datatype.
Here
For neon, based on the requirement we can also use Dn & Sn instead of Qn
registers.
OPTIMIZATION TECHNIQUES
·
Loop Unrolling
·
Avoiding Unnecessary loads/stores
·
Moving out Loop invariants
·
Conditional execution
·
Use Shift to multiply/divide
·
Multiple-load/store instructions
·
Auto Vectorization using NEON
·
Instruction Scheduling
·
Cross Jump Elimination
·
In-lining
·
Constant Folding
·
Avoiding Unnecessary Loop Iterations.
Supported Data Types:
In
ARM and NEON,
·
4x 32-bit integer add
·
4x 32-bit integer add with signed
saturation
·
4x 32-bit integer add with unsigned
saturation
·
8, 16, 32, 64 bits available (not 128)
·
Also F32, F64 for float
FIXED-POINT AND FLOATING-POINT
NUMBERS
Fixed Point Numbers:
·
Real data type - fixed number of digits
after decimal point.
·
Provides improved performance or accuracy
for the application.
·
Fixed point data type is an integer scaled
by an implicit specific factor.
·
Fixed point representation.
·
1 sign bit.
·
N1 (Integer word length)(iwl) - Before the
decimal point.
·
N2 (Fractional word length)(fwl) - After the decimal point.
·
In 16-point DSP
·
1 sign bit, N1 = 0, N2 = 15.
- For
a particular Q-format
·
q – step difference = 2^-(fwl).
·
[-L:L] – Range of values that can be
represented = [-2^(iwl-1):2^(iwl-1)-q].
- For
Q15.1 format
·
q = 2^-1.
·
-L:L = [-2^14 : 2^14 - q].
Floating
Point Numbers:
·
Real Number having fractional part.
·
No Fixed number of digits.
·
Slower and Less accurate, but handle large
range of numbers.
·
Float Point Representation
·
N = (-1)^S M 2^E.
·
S is sign bit.
·
M is Fractional Mantissa part
·
Exponent(biased)
- For
32-bit systems
·
Single Precision - Ne = 8, Nm = 23
·
Double Precision – Ne = 11, Nm= 52
- Range
is
·
Single Precision = ±1.18×10−38 to
±3.4×1038
·
Double Precision = ±2.23×10−308 to
±1.80×10308
FIXED TO FLOAT CONVERSION
AND VICEVERSA
Float to Fixed Conversion
- Multiply
the floating number by 2^n.
- Round
the result to the nearest integer.
Fixed to Float Conversion
- Multiply
the Fixed value by 2^(-n).
Here n is number of
digits in the fractional part.
- Example
·
4.25 - Float to Fixed conversion : 4.25 x
2^2 = 4.25 x 4 = 16.20
·
Rounding to integer = 16.20 = 17
·
17 - Fixed to Float conversion : 17 / 2^2
= 17 / 4 = 4.25


