

# Emerging memory technologies for improved energy efficiency

Martin Wenzel Advanced Seminar WS2015

#### **Memory Bandwidth**





| Technology    | BW GB/s |
|---------------|---------|
| DDR3-1333 2GB | 10,66   |
| DDR4-2667 4GB | 21,34   |

Hennessy, Patterson, Computer Architecture, A quantitative Approach <u>http://www.extremetech.com/computing/197720-beyond-ddr4-understand-the-differences-between-wide-io-hbm-and-hybrid-memory-cube</u>

#### **Power Consumption**





# Stacking

Bump



4



# **Stacked Memory Hybrid Memory Cube**





- 32 Vaults
  - Vertical Memory partitions
  - Vault Logic
    - **DRAM** Controller
    - Packetized Interconnect
    - Support for Atomics
      - Arithmetic
      - Bitwise swap / write
      - Boolean
      - Compare and Swap

# Hybrid Memory Cube Interconnect





| Technology    | BW GB/s |
|---------------|---------|
| DDR3-1333 2GB | 10,66   |
| DDR4-2667 4GB | 21,34   |

Μ

Logic

# **Processing in Memory (PIM) Instruction Offloading**

HMC DDR CPU M Μ Μ Μ Logic Μ м Compare and Swap Conventionell ullet

ReadCacheline(PTR) CAS(PTR,CompVal,New) WriteCacheline(PTR)

۲



64B Data



- **Problematic Workload** 
  - Low Computation Intensity
  - Low Locality •
- Expectation •
  - Efficient Bandwidth Usage

Atomic CAS Request\_CAS(PTR, CompVal, New) 16B Data Response 16B Data

# Example Workload: Graph Computing Graph Search





- Breadth-first Search
  - Check all Neighbors
  - Move to the next level

# Processing in Memory Offloading









# Processing in Memory Application Offloading – Tesseract





- Problematic Workload
  - Low Computation Intensity
  - Low Locality
- Expectation
  - Efficient Bandwidth Usage
  - High Energy Efficiency
  - Scalability

#### Processing in Memory Tesseract





- Single HMC
  - Max Interconnect Bandwidth: 160 GB/s
  - Max Memory Bandwidth: 256 GB/s
- Tesseract
  - PU in every Vault
  - 16 HMC in Network
  - Max Interconnect Bandwidth: 160 GB/s
  - Max Memory Bandwidth: 4 TB/s

#### Processing in Memory Tesseract Core Architecture





- Distributed Memory Architecture
  - No Cache Coherence
  - Remote Function Call
- List Prefetcher
  - Prefetch Stride (Cache Lines)
- Message Triggered Prefetcher
  - Preload Data before Message handling

# Processing in Memory Tesseract – Speedup





Figure 6: Performance comparison between conventional architectures and Tesseract (normalized to DDR3-OoO).

- HMC-OoO Architecture
  - 32 Performance Cores
  - 16 HMCs
  - 320GB/s Memory Bandwidth •
- HMC-MC Architecture
  - 512 low-power Cores
  - 16 HMCs
  - 320GB/s Memory Bandwidth •
- Tesseract
  - 512 low-power Cores
  - 16 HMCs
  - 4TB/s Memory Bandwidth

# Processing in Memory Tesseract – Energy Efficiency





#### Processing in Memory Tesseract – Scalability





#### **Conclusion Processing in Memory**



- High Speedup
- Highly Energy Efficient
- Scales proportional to Memory Capacity
- Currently usable via Instruction Offloading
- Current Designs optimized for Graph Computing

#### **Future Work**



- Additional Workloads
- Processing Units
  - Internode Communication
  - Application specific
  - General Purpose
  - FPGA technology?

Further Information
<u>MEMSYS International Symposium on Memory Systems</u>

#### Through – Silicon Via





#### **Processing in Memory Tesseract Core Architecture**





- Distributed Memory Architecture
  - No Coherence Traffic
  - Message / Instruction Passing
- Optional List Prefetcher
  - Optimize Locality
- Message Triggered Prefetcher
  - Preload Data before Message handling

#### Processing in Memory Tesseract – Latency



