Chapter 3 Arithmetic for Computers 1 FP Example: Array Multiplication C = C +A B All 32 32 matrices, 64-bit double-precision elements C code: void mm (double c[][],double a[][], double b[][]) { size_t i, j, k; for (i = 0; i < 32; i = i + 1) for (j = 0; j < 32; j = j + 1) for (k = 0; k < 32; k = k + 1) c[i][j] = c[i][j] + a[i][k] * b[k][j]; } Addresses of c, a, b in X0, X1, X2, and i, j, k in X19, X20, X21 2 FP Example: Array Multiplication
LEGv8 code: mm:... LDI X10, 32 // X10 = 32 (row size/loop end) LDI X19, 0 // i = 0; initialize 1st for loop L1: LDI X20, 0 // j = 0; restart 2nd for loop L2: LDI X21, 0 // k = 0; restart 3rd for loop LSL X11, X19, 5

// X11 = i * 25 (size of row of c) ADD X11, X11, X20// X11 = i * size(row) + j LSL X11, X11, 3 // X11 = byte offset of [i][j] ADD X11, X0, X11 // X11 = byte address of c[i][j] LDURD D4, [X11,#0] L3: LSL X9, X21, 5 // D4 = 8 bytes of c[i][j] // X9 = k * 25 (size of row of b) ADD X9, X9, X20 // X9 = k * size(row) + j LSL X9, X9, 3

// X9 = byte offset of [k][j] ADD X9, X2, X9 // X9 = byte address of b[k][j] LDURD D16, [X9,#0] // D16 = 8 bytes of b[k][j] LSL X9, X19, 5 // X9 = i * 25 (size of row of a) 3 FP Example: Array Multiplication ADD X9, X9, X21 // X9 = i * size(row) + k

LSL X9, X9, 3 // X9 = byte offset of [i][k] ADD X9, X1, X9 // X9 = byte address of a[i][k] LDURD D18, [X9,#0] // D18 = 8 bytes of a[i][k] FMULD D16, D18, D16 // D16 = a[i][k] * b[k][j] FADDD D4, D4, D16 // f4 = c[i][j] + a[i][k] * b[k][j]

ADDI X21, X21, 1 // $k = k + 1 CMP X21, X10 // test k vs. 32 B.LT L3 // if (k < 32) go to L3 STURD D4, [X11,0] // = D4 ADDI X20, X20, #1 // $j = j + 1 CMP X20, X10
// test j vs. 32 B.LT L2 // if (j < 32) go to L2 ADDI X19, X19, #1 // $i = i + 1 CMP X19, X10 // test i vs. 32 B.LT L1 // if (i < 32) go to L1 4
Accurate Arithmetic IEEE Std 754 specifies additional rounding control Extra bits of precision (guard, round, sticky) Choice of rounding modes Allows programmer to fine-tune numerical behavior of a computation Not all FP units implement all options Most programming languages and FP libraries just use defaults Trade-off between hardware complexity, performance, and market requirements 5 Rounding with Guard Digits 2.56ten 100 + 2.34ten 102, with three significant decimal digits Align the exponents: 0.0256 x 102, guard digit = 5, round digit = 6 Add significands: 0.0256 + 2.3400 = 2.3656 Sum: 2.3656 x 102, two digits to round gives 2.37 x 102 Without guard and round digits: 2.34 + 0.02 = 2.36 Sum = 2.36 x 102, off by 1 in the last digit

number of units in the last place (ulp): measure of accuracy in floating point in terms of the number of bits in error in the least significant bits of the significand If a number were off by 2 in the least significant bits off by 2 ulps IEEE 754 guarantees that the computer uses the number that is within one-half ulp (provided no overflow, underflow, or invalid operation exceptions) 6 Graphics and audio applications can take advantage of performing simultaneous operations on short vectors Example: 128-bit adder: Sixteen 8-bit adds Eight 16-bit adds Four 32-bit adds Also called data-level parallelism, vector parallelism, or Single Instruction, Multiple Data (SIMD) 7 3.6 Parallelism and Computer Arithmetic: Subword Parallelism

Subword Parallellism ARMv8 SIMD 32 128-bit registers (V0, , V31) and more than 500 machine-language instructions to support subword parallelism Works with integer and FP ARMv8 assembler uses different suffixes for the SIMD registers to represent different widths, which allows it to pick the proper machine language opcode. The suffixes are B (byte) for 8-bit operands, H (half) for 16-bit operands, S (single) for 32-bit operands, D (double) for 64-bit operands, and Q (quad) for 128-bit operands. Examples: 16 parallel 8-bit integer adds: ADD V1.16B, V2.16B, V3.16B 4 parallel 32-bit FP adds: FADD V1.4S, V2.4S, V3.4S 8 ARMv8 SIMD Instructions

9 Streaming SIMD Extension 2 (SSE2) 8 x 64-bit registers that can be used for floating-point operands. expanded to 16 registers, called XMM, as part of AMD64, relabeled by Intel as EM64T The 16 floating-point registers for SSE2 are actually 128 bits wide Multiple floating-point operands can be packed into a single 128-bit SSE2 register: 2 64-bit double precision 4 32-bit double precision Packed floating-point format is supported by arithmetic operations that can compute simultaneously on four singles (PS) or two doubles (PD). 10 Streaming SIMD Extension 2 (SSE2) The SSE/SSE2 floating-point instructions of the x86

xmm: one operand is a 128-bit SSE2 register, {mem|xmm}: the other operand is either in memory or it is an SSE2 register. MOV[AU]{SS|PS|SD|PD} - MOVASS, MOVAPS, MOVASD, MOVAPD, MOVUSS, MOVUPS, MOVUSD, MOVUPD. A means the 128-bit operand is aligned in memory; U means the 128-bit operand is unaligned in memory; H means move the high half of the 128-bit operand; and L means move the low half of the 128-bit operand. SS: Scalar Single precision floating point, or one 32-bit operand in a 128-bit register; PS: Packed Single precision floating point, or four 32-bit operands in a 128-bit register; SD: Scalar Double precision floating point, or one 64-bit operand in a 128-bit register; PD: Packed Double precision floating point, or two 64-bit operands in a 128-bit register. 11 Streaming SIMD Extension 2 (SSE2) YMM (256-bit registers), with Advanced Vector Extensions (AVX). Single operation can specify 8 x 32-bit floating-point operations or 4 x 64-bit floatingpoint operations. The legacy SSE/SSE2 instructions now operate on the lower 128 bits of the YMM registers. 256-bit operations: prepend the letter v (for vector) in front of the SSE2 assembly language operations and use the YMM register names instead of the XMM register name. For example, the SSE2 instruction to perform two 64-bit floating-point additions

addpd %xmm0, %xmm4 becomes vaddpd %ymm0, %ymm4 which now produces four 64-bit floating-point adds. Intel plans to widen the AVX registers to first 512 bits and to 1024 bits in later editions of the x86 architecture. 12 Full ARMv8 Integer and Floating-point Arithmetic Instructions 63 assembly-language integer arithmetic and floating-point instructions in the full ARMv8 instruction set, 15 ARMv8 arithmetic core instructions highlighted in bold. Pseudoinstructions are italicized 13

Other ARMv8 Features 245 SIMD instructions, including: Square root Fused multiply-add, multiply-subtract Convertion and scalar and vector round-to-integral Structured (strided) vector load/stores Saturating arithmetic Many SIMD instructions offer three versions: Wide means the width of the elements in the destination register and the first source registers is twice the width of the elements in the second source register. Long means the width of the elements in the destination register is twice the width of the elements in all source registers. Narrow means the width of the elements in the destination register is half the width of the elements in all source registers. 14 3.7 Real Stuff: Streaming SIMD Extensions and AVX in x86 x86 FP Architecture

Originally based on 8087 FP coprocessor 8 80-bit extended-precision registers Used as a push-down stack Registers indexed from TOS: ST(0), ST(1), FP values are 32-bit or 64 in memory Converted on load/store of memory operand Integer operands can also be converted on load/store Very difficult to generate and optimize code Result: poor FP performance 15 x86 FP Instructions Data transfer Arithmetic Compare

Transcendental FILD mem/ST(i) FISTP mem/ST(i) FLDPI FLD1 FLDZ FIADDP FISUBRP FIMULP FIDIVRP FSQRT FABS FRNDINT FICOMP FIUCOMP FSTSW AX/mem FPATAN

F2XMI FCOS FPTAN FPREM FPSIN FYL2X mem/ST(i) mem/ST(i) mem/ST(i) mem/ST(i) Optional variations I: integer operand P: pop operand from stack R: reverse operand order But not all combinations allowed 16 3.8 Going Faster: Subword Parallelism and Matrix Multiply

Subword Parallelism and Matrix Multiply Double precision GEneral Matrix Multiply (DGEMM) Performance benefit of adapting software to the underlying hardware, in this case the Sandy Bridge version of the Intel Core i7 microprocessor Unoptimized code: 1. void dgemm (int n, double* A, double* B, double* C) 2. { 3. for (int i = 0; i < n; ++i) 4. for (int j = 0; j < n; ++j) 5. { 6. double cij = C[i+j*n]; /* cij = C[i][j] */ 7. for(int k = 0; k < n; k++ ) 8. cij += A[i+k*n] * B[k+j*n]; /* cij += A[i][k]*B[k][j] */ 9.
C[i+j*n] = cij; /* C[i][j] = cij */ 10. } 11. } 17 x86 assembly code for the body of the nested loops generated by compiling the unoptimized C code : 1. vmovsd (%r10),%xmm0 2. mov %rsi,%rcx 3. xor %eax,%eax 4. vmovsd (%rcx),%xmm1 5. add %r9,%rcx 6. vmulsd (%r8,%rax,8),%xmm1,%xmm1 7. add $0x1,%rax 8. cmp %eax,%edi 9. vaddsd %xmm1,%xmm0,%xmm0 10. jg 30 11. add $0x1,%r11d 12. vmovsd %xmm0,(%r10)

# # # # # # # # # # # # Load 1 element of C into %xmm0 register %rcx = %rsi register %eax = 0 Load 1 element of B into %xmm1 register %rcx = %rcx + %r9 Multiply %xmm1, element of A register %rax = %rax + 1

compare %eax to %edi Add %xmm1, %xmm0 jump if %eax > %edi register %r11 = %r11 + 1 Store %xmm0 into C element 18 3.8 Going Faster: Subword Parallelism and Matrix Multiply Subword Parallelism and Matrix Multiply 3.8 Going Faster: Subword Parallelism and Matrix Multiply Subword Parallelism and Matrix Multiply Optimized C code using C intrinsics to generate the AVX subword-parallel instructions for the x86: 1. #include 2. void dgemm (int n, double* A, double* B, double* C) 3. { 4. for ( int i = 0; i < n; i+=4 ) 5.
for ( int j = 0; j < n; j++ ) { 6. __m256d c0 = _mm256_load_pd(C+i+j*n); /* c0 = C[i][j] */ 7. for( int k = 0; k < n; k++ ) 8. c0 = _mm256_add_pd(c0, /* c0 += A[i][k]*B[k][j] */ 9. _mm256_mul_pd(_mm256_load_pd(A+i+k*n), 10. _mm256_broadcast_sd(B+k+j*n))); 11. _mm256_store_pd(C+i+j*n, c0); /* C[i][j] = c0 */ 12. } 13. } 19 3.8 Going Faster: Subword Parallelism and Matrix Multiply Subword Parallelism and Matrix Multiply Optimized x86 assembly code:

1. vmovapd (%r11),%ymm0 # 2. mov %rbx,%rcx # 3. xor %eax,%eax # 4. vbroadcastsd (%rax,%r8,1),%ymm1 # 5. add $0x8,%rax # 6. vmulpd (%rcx),%ymm1,%ymm1 # 7. add %r9,%rcx # 8. cmp %r10,%rax # 9. vaddpd %ymm1,%ymm0,%ymm0 # 10. jne 50 # 11. add $0x1,%esi

# 12. vmovapd %ymm0,(%r11) # Load 4 elements of C into %ymm0 register %rcx = %rbx register %eax = 0 Make 4 copies of B element register %rax = %rax + 8 Parallel mul %ymm1,4 A elements register %rcx = %rcx + %r9 compare %r10 to %rax Parallel add %ymm1, %ymm0 jump if not %r10 != %rax register % esi = % esi + 1 Store %ymm0 into 4 C elements 20 Subword Parallelism and Matrix Multiply For 32X32 matrices, the unoptimized DGEMM runs at 1.7 GigaFLOPS (FLoating point Operations Per Second) on one

core of a 2.6 GHz Intel Core i7 (Sandy Bridge). The optimized code performs at 6.4 GigaFLOPS. The AVX version is 3.85 times as fast, which is very close to the factor of 4.0 increase from performing 4 times as many operations at a time by using subword parallelism. 21 3.9 Fallacies and Pitfalls Fallacy: Right Shift same as an integer division by a power of 2 Left shift by i places multiplies an integer by 2i Right shift divides by 2i? Only for unsigned integers For signed integers Arithmetic right shift: replicate the sign bit e.g., 5 / 4 111110112 >> 2 = 111111102 = 2 Rounds toward

c.f. 111110112 >>> 2 = 001111102 = +62 22 Pitfall: Floating-point addition is not associative Floating-point numbers are approximations of real numbers and computer arithmetic has limited precision Assume c=1.5ten 1038, a = 1.5ten 1038, and b = 1.0, all single precision numbers c + (a + b) = 1.5ten 1038 + (1.5ten 1038 + 1.0) = 1.5ten 1038 + (1.5ten 1038) = 0.0 (c + a) + b = (1.5ten 1038 + 1.5ten 1038) + 1.0 = (0.0ten) + 1.0 = 1.0 Since 1.5ten 1038 is so much larger than 1.0ten that 1.5ten 1038 + 1.0 is 23 still 1.5ten 1038. Fallacy: Parallel execution strategies that work for

integer data types also work for floating-point data types Since floating-point addition is not associative, using one processor against using 1000 processors will not give the same results Operating system scheduler may use a different number of processors every time an addition is to be done Varying number of processors from each run would cause the floating-point sums to be calculated in different orders, getting slightly different answers each time despite running identical code with identical input Parallel code with floating-point numbers need to be verified whether the results are credible, even if they dont give the exact same answer as the sequential code 24 Fallacy: Only theoretical mathematicians care about floating-point accuracy Newspaper headlines of November 1994 prove this statement is a fallacy

25 Fallacy: Only theoretical mathematicians care about floating-point accuracy Important for scientific code But for everyday consumer use? My bank balance is out by 0.0002! The Pentium uses a standard floating-point divide algorithm that generates multiple quotient bits per step, using the most significant bits of divisor and dividend to guess the next 2 bits of the quotient. The guess is taken from a lookup table containing 2, 1, 0, +1, or +2. The guess is multiplied by the divisor and subtracted from the remainder to generate a new remainder. Like nonrestoring division, if a previous guess gets too large a remainder, the partial remainder is adjusted in a subsequent pass. Evidently, there were five elements of the table from the 80486 that Intel engineers thought could never be accessed, and they optimized the PLA to return 0 instead of 2 in these situations on the Pentium. Intel was wrong: while the first 11 bits were always correct, errors would show up

occasionally in bits 12 to 52, or the 4th to 15th decimal digits. 26 3.9 Concluding Remarks Concluding Remarks Bits have no inherent meaning Interpretation depends on the instructions applied Computer representations of numbers Finite range and precision Need to account for this in programs 27 Concluding Remarks ISAs support arithmetic Signed and unsigned integers Floating-point approximation to reals Bounded range and precision

Operations can overflow and underflow 28 The LEGv8 instruction set 29 The frequency of the LEGv8 instructions for SPEC CPU2006 integer and floating point 30