Monday, January 29, 2018

ARM ASSEMBLY INTRODUCTION AND OPTIMIZATION TECNIQUES

ASSEMBLY LANGUAGE
An assembly language, abbreviated asm, is a low-level programming language for a computer, or other programmable device, in which there is a very strong correspondence between the language and the architecture's machine code instructions.
Why Assembly for DSP?             
In general for real time application the execution time of program plays an important role in determining stability and failure rate of system. So DSP algorithms like Audio Processing algorithms are performance critical, because they involve in high floating point operations. So to achieve the performance we need to optimize the DSP algorithms using ARM-NEON SIMD Assembly language, converting the C code algorithm to assembly language by using various optimizing methods such as vectorization using SIMD, Loop unrolling etc.

In the following sections I am detailing about the main Instructions related to the ARM-NEON Assembly.  
Instruction Set for ARM Assembly Language
Properties of ARM Instruction set:
1.      All instructions are 32 bits long.
2.      Most instructions execute in a single cycle.
3.      Most instructions can be conditionally executed.
4.      Load/store architecture Data processing instructions act only on registers.
5.      Three operand format.
6.      Combined ALU and shifter for high speed bit manipulation.
7.      Specific memory access instructions with powerful autoindexing addressing modes.
8.      Flexible multiple register load and store instructions.
I am just listing the ARM and Neon Instruction set. For detail information please refer the ARM technical reference manual.
Basic Operational instructions:
Arithmetic Operations
·         ADD, ADDC, SUB, SUBC, RSB, RSC.
Syntax: <neumonic>{S}<c><q> {<Rd>,} <Rn>, <Rm>
Rd: Destination Register, Rn & Rm: Source Registers,   c: Condition flag, neumonic: Operation, If S is present, the instruction updates the flags.
Logical Operations
·         AND, ORR, EOR, BIC, ORN.
      Syntax: <neumonic>{S}<c><q> {<Rd>,} <Rn>, <Rm>
Rd: Destination Register, Rn & Rm: Source Registers,   c: Condition flag, neumonic: Operation, If S is present, the instruction updates the flags.

Multiplication and division Operations.
·         MUL, MULL, MLA, MLAL
Syntax: <neumonic>{S}<c><q> {<Rd>,} <Rn>, <Rm>
Rd: Destination Register, Rn & Rm: Source Registers,   c: Condition flag, neumonic: Operation, If S is present, the instruction updates the flags.

Data Movement and Stack Operations.
·         MOV, MVN, PUSH, POP, MSR, MRS.
Syntax: <neumonic>{S}<c><q> {<Rd>,} <Rn>
Rd: Destination Register, Rn & Rm: Source Registers,   c: Condition flag, neumonic: Operation, If S is present, the instruction updates the flags.
Compare, Conditional and Convert Instructions.
·         TST, TEQ, CMP, CMN.
Syntax: <neumonic>{S}<c><q> {<Rd>,} <Rn>
Rd: Destination Register, Rn & Rm: Source Registers,   c: Condition flag, neumonic: Operation, If S is present, the instruction updates the flags.
Load and Store Instructions.
·         LDR, STR, LDM, STM, LDMIA, LDMIB, STMIA, STMIB.
Syntax:< neumonic >{<cond>}{<size>} Rd, <address>.
Rd: Destination Register, address: Data address.   cond: Condition flag, neumonic: Operation, Size: Size of data.
Other Instructions.
·         SWI, REV, ROR, LSL, LSR, ASR, RRX, B, BL, BX, BLX.
Syntax: <neumonic>{S}<c><q> {<Rd>,} <Rn>
 Rd: Destination Register, Rn & Rm: Source Registers,   c: Condition flag, neumonic: Operation, If S is present, the instruction updates the flags.
Single Register data transfer:
·         The basic load and store instructions are: Load and Store Word or Byte LDR / STR / LDRB / STRB.
·         ARMv4 adds support for Half words and signed data: LDRH / STRH.
·         Load Signed Byte or Halfword Value and sign extend to 32 bits: LDRSB / LDRSH.
·         Conditionally executed by appropriate condition code STR / LDR: LDREQB
·         Syntax:<LDR|STR>{<cond>}{<size>} Rd, <address>.

Block Register data transfer.
·         Load and Store Multiple instructions (LDM / STM) allow transfer to or from memory.
·         Any subset of current bank of registers (default).
·         Whole Register bank or subset copied with single Instruction.
·         Appending ‘!’ can update Base register.
·         Operated in Little Endian.
·         Very efficient for saving and restoring context, Moving large blocks of data.

Instruction Set for NEON Assembly
Arithmetic Operations:
·         VABA, VABD, VABS, VADD, VADDHN, VHADD, VADDL, VADDW, VSUB, VSUBHN, VHSUBB, VSUBL, VSUBH, VPADD, VPADDL.
Syntax:<neumonic><c><q>.<dt> {<Qd>,} <Qn>, <Qm>
Qd: Destination Register, Qn & Qm: Source Registers,   c: Condition flag, neumonic: Operation, dt: Datatype.
Logical Operations
·         VAND, VORR, VORN, VEOR, VNEG, VBIC, VBIF, VBIT, VBSL.
Syntax:<neumonic><c><q>.<dt> {<Qd>,} <Qn>, <Qm>
Qd: Destination Register, Qn & Qm: Source Registers,   c: Condition flag, neumonic: Operation, dt: Datatype.
Multiplication and division Operations.
·         VMUL, VMULL, VMLA, VMLAL, VMLS, VMLSL, VDIV.
Syntax:<neumonic><c><q>.<dt> {<Qd>,} <Qn>, <Qm>
Qd: Destination Register, Qn & Qm: Source Registers,   c: Condition flag, neumonic: Operation, dt: Datatype.
Data Movement and Stack Operations.
·         VMOV, VMOVL, VMOVN, VMVN, VMSR, VMRS.
·         VPUSH, VPOP
Syntax:<neumonic><c><q>.<dt> {<Qd>,} <Qn>, <Qm>
Qd: Destination Register, Qn & Qm: Source Registers,   c: Condition flag, neumonic: Operation, dt: Datatype.
Compare, Conditional and Convert Instructions.                                                                              
·         VMAX, VMIN, VCMP, VCGE, VCGT, VCLT, VCLE, VCLZ, VCLS, VCVT.
Syntax:<neumonic><c><q>.<dt> {<Qd>,} <Qn>, <Qm>
Qd: Destination Register, Qn & Qm: Source Registers,   c: Condition flag, neumonic: Operation, dt: Datatype.
Load and Store Instructions.
·         VLDM, VLDR, VLD1, VLD2, VLD3, VLD4, VTSM, VSTR, VST1, VST2, VST3, VST4.
Syntax:<neumonic><c><q>.<dt> {<Qd>,} <Qn>, <Qm>
Qd: Destination Register, Qn & Qm: Source Registers,   c: Condition flag, neumonic: Operation, dt: Datatype.
Other Instructions.
·         VTRN, VREV, VSHL, VSHR, VSLI, VSRI, VSRA, VTBL, VTBX, VZIP, VUZP, VSWP, VEXT.
Syntax:<neumonic><c><q>.<dt> {<Qd>,} <Qn>, <Qm>
Qd: Destination Register, Qn & Qm: Source Registers,   c: Condition flag, neumonic: Operation, dt: Datatype.
Here For neon, based on the requirement we can also use Dn & Sn instead of Qn registers.
OPTIMIZATION TECHNIQUES
·         Loop Unrolling
·         Avoiding Unnecessary loads/stores
·         Moving out Loop invariants
·         Conditional execution
·         Use Shift to multiply/divide
·         Multiple-load/store instructions
·         Auto Vectorization using NEON
·         Instruction Scheduling
·         Cross Jump Elimination
·         In-lining
·         Constant Folding
·         Avoiding Unnecessary Loop Iterations.
Supported Data Types:
In ARM and NEON,
·         4x 32-bit integer add
·         4x 32-bit integer add with signed saturation
·         4x 32-bit integer add with unsigned saturation
·         8, 16, 32, 64 bits available (not 128)
·         Also F32, F64 for float

FIXED-POINT AND FLOATING-POINT NUMBERS
Fixed Point Numbers:
·         Real data type - fixed number of digits after decimal point.
·         Provides improved performance or accuracy for the application.
·         Fixed point data type is an integer scaled by an implicit specific factor.
·         Fixed point representation.
·         1 sign bit.
·         N1 (Integer word length)(iwl) - Before the decimal point.
·         N2 (Fractional word length)(fwl)  - After the decimal point.
·         In 16-point DSP
·         1 sign bit, N1 = 0, N2 = 15.
  • For a particular Q-format
·         q – step difference = 2^-(fwl).
·         [-L:L] – Range of values that can be represented = [-2^(iwl-1):2^(iwl-1)-q].
  • For Q15.1 format
·         q = 2^-1.
·         -L:L = [-2^14 : 2^14 - q].

Floating Point Numbers:
·         Real Number having fractional part.
·         No Fixed number of digits.
·         Slower and Less accurate, but handle large range of numbers.
·         Float Point Representation
·         N = (-1)^S M 2^E.
·         S is sign bit.
·         M is Fractional Mantissa part
·         Exponent(biased)
  • For 32-bit systems
·         Single Precision - Ne = 8, Nm = 23
·         Double Precision – Ne = 11, Nm= 52
  • Range is
·         Single Precision = ±1.18×10−38 to ±3.4×1038 
·         Double Precision = ±2.23×10−308 to ±1.80×10308
FIXED TO FLOAT CONVERSION AND VICEVERSA
Float to Fixed Conversion
  • Multiply the floating number by 2^n.
  • Round the result to the nearest integer.
Fixed to Float Conversion
  • Multiply the Fixed value by 2^(-n).
Here n is number of digits in the fractional part.
  • Example
·         4.25 - Float to Fixed conversion : 4.25 x 2^2 = 4.25 x 4 = 16.20
                  ·         Rounding to integer = 16.20 = 17
·         17 - Fixed to Float conversion : 17 / 2^2 = 17 / 4 = 4.25

Friday, January 5, 2018

NEON ARCHITECTURE

NOEN is the extension support provided to ARM architecture, for parallel operations for large data processing. Large data processing is nothing but the audio, video and multimedia processing operations. Initially parallel operations referred and introduced as ARM SIMD.
Before moving to NOEN completely a brief introduction to ARM SIMD. In modern days software such as codes and graphic accelerators operate on the large amount of which is less than a word size. Such as data size of digital audio and video is respectively 16 and 8 bits. When performing these operations on a 32-bit microprocessor, parts of the computation units are unused, but continue to consume power. To make better use SIMD perform parallel the 4 operations by dividing the 32 bit register into 4 parts. SIMD technology uses a single instruction to perform the same operation in parallel on multiple data elements of the same type and size.
In ARMv6 (There are many versions in ARM architectures. Currently trending is ARMv7, ARMv8) SIMD instruction are introduced which are operating on the 16-bit, 8-bit data units packed in to 32bit General Purpose Registers. This permits to execute certain operations to perform twice or four times quicker.  
Following figure shows the 4 parallel 8 bit addition operations. Here four lanes of 8-bit data is packed into Vector registers R1, R2 and place the result into R0.
The Instruction to this is UADD8 R0, R1, R2.
Fig N.1: 4-way 8-bit unsigned integer add operation
ARMv7 Architecture introduced the ARM SIMD Extension. It extends the SIMD to operate on the 64-bit doubleword and 128-bit quadword vector registers by defining the group of instructions to operate. The implementation of the Advanced SIMD extension used in ARM processors is called NEON.
NEON technology is implemented on all current ARM Cortex-A series processors. NEON instructions are executed as part of the ARM or Thumb instruction stream, this leads to simplification of software development, debugging, and integration. Traditional ARM or Thumb instructions manage all program flow and synchronization. The NEON instructions perform:
·         memory accesses
·         data copying between NEON and general purpose registers
·         data type conversion
·         data processing.
NEON provides the standardized acceleration, media and signal processing applications.

Following figure shows the eight lane NEON addition operation on 128-bit quadword for 16-bit data. The Instruction for this is VADD.I16 Q0, Q1, Q2.
Fig N.2: 8-way 16-bit integer add operation

SUPPORTED DATA TYPES
NEON instructions support 8-bit, 16-bit, 32-bit, and 64-bit signed and unsigned integers.
NEON also supports 32-bit single-precision floating point elements, and 8-bit and 16-bit polynomials.
The VCVT instruction converts elements between single-precision floating-point and:
• 32-bit integer
• Fixed-point
• Half-precision floating point, if the processor implements the half-precision extensions.

NEON REGISTERS
NEON Register bank consists of 32 64-bit registers. For botn SIMD and VFP(Vector Floating Point Operations) these register r are shared. This bank of registers are also viewed as
  • sixteen 128-bit quadword registers, Q0-Q15
  • Thirty-two 64-bit doubleword registers, D0-D31.

NEON D0-D31 registers are the same as the VFPv3 D0-D31 registers and each of the Q0-Q15 registers map onto a pair of D registers. Figure following shows the different views of the shared NEON and VFP register bank. All of these are accessible at any time.
Fig N.3: NEON and VFP register set