

## Write-Avoiding Algorithms

Erin Carson, James Demmel, Laura Grigori, Nicholas Knight, Penporn Koanantakool, Oded Schwartz, Harsha Vardhan Simhadri

## Write-Avoiding Linear Algebra and N-Body Algorithms

 There are sequential W-A algorithms **N** Particles for classical matrix multiplication, Triangular Solve, Cholesky factorization and direct N-body algos. Based on appropriate loop reordering All use Explicit blocking based on cache size extend to multiple levels of memory > assume explicit control over data movement special cases of CA algorithms Direct N-body Not all communication avoiding algorithms are W-A Loads to fast memory: Conjecture: should work for many other familiar algorithms Stores to slow memory: Hardware Perf Counters & **Cache Replacement Policy**  W-A algorithms above assume explicit control over data movement (Ex: User accesses Flash over PCIe) Do cache policies like LRU enable WA algorithms? Experimental Data for 4000 x n x 4000 matrix mult ♦ Intel Nehalem-EX Xeon 7560, MESIF coherence, pseudo-LRU L3 ♦ 24MB L3, 256KB L2, 32KB L1 (Output: 122MB = 2.0mln cache lines) hugectl used to allocate huge pages, avoid TLB misses C-box events accessed via customized Intel PCM 2.4 ♦ L3\_VICTIMS.M = #stores from L3 to DRAM (Modified L3 evictions)  $\diamond$  L3\_VICTIMS.E = #L3 evictions without DRAM writes (Exclusive lines) LLC\_S\_FILLS.E = #loads from DRAM to L3 in Exclusive state Blocking Sizes 

0.4
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0 X-Axis : n (middle dim.) Y-Axis: millions of cache lines (64Bytes each) Measurements for Cache-Oblivious, MKL and W-A Cache-Oblivious optimizes for reads, but not writes to DRAM MKL optimizes for neither (better time though) W-A at L3-DRAM (with b=1023, 3 blocks fit in L3) + MKL for L1 and L2 does fewer writes, but still not very close to optimal. Gap between lower bound and writes due to replacement policy. **Proposition 6.1**: If 5 blocks fit in L3 cache (not just 3) blocks required in explicit blocking), LRU matches Iower bound for any instruction order of block multiplication within L3 cache. If we do not require W-A at each cache level, there exists an instruction order for which 3 blocks can fill L3 and be W-A Blocking Sizes locking Sizes L2: 10

-L3\_VICTIMS.M 2 2.1 1.9 2 2.5 3 3.7 4.8 8

L3\_VICTIMS.E 0.3 0.8 1.7 4 11.4 25.6 51.4 102.7 207.4

LLC\_S\_FILLS.E 2.4 2.8 3.9 6.3 14.2 29 55.5 107.9 216.1

 Write L.B.
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 <th2

Multi-level Write-Avoiding

(left column)

L3\_VICTIMS.E 0.4 0.8 1.8 4 11.3 25.3 50.9 101.9 204.5 LLC\_S\_FILLS.E 2.4 2.9 4 6.3 13.8 28 53.7 105.2 208.4 
 Write L.B.
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 <th2 W-A at L3-DRAM only, different order at L2. LRU is W-A with 1023 and 700 block sizes.

L3\_VICTIMS.M 1.9 2 1.9 1.9 2.1 2.3 2.4 2.8 3.3



When does 'single node + NVM' beat 'cluster + DRAM'?