IBM has detailed its next-generation Telum chip which is a part of the Z processor lineup at HotChips 33. The Telum chip contains a model new core structure design that is geared for AI acceleration.
IBM’s Subsequent-Gen Z Processor: 7nm Telum Chip With 22.5 Billion Transistors, 8 Cores, 5 GHz+ Clocks & 6+ TFLOPs AI Acceleration
In response to IBM, the newly optimized Z core together with its model new cache and multi-chip material hierarchy allows over 40% per socket efficiency progress. The Telum chip is comprised of a complete of 8 cores that characteristic their devoted L2 cache. The chip options SMT2 so which provides 16 threads on the chip whereas a most configuration of 32 core and 64 threads is feasible with a 4-drawer system.
Clock speeds are stated to be greater than 5 GHz whereas the Telum Z chip comes with a re-designed department prediction with built-in 1st/2nd stage BTB, Dynamic BTB entry reconfiguration, & greater than 270K department goal desk entries. The personal L2 cache has a measurement of 32 MB and contains a 19 cycle load-use latency (~3.8 ns together with TLB entry).
Transferring over to L3 and L4 caches that are shared throughout the 8 cores, the IBM Z Telum chip packs digital on-chip 256 MB L3 cache and digital 2 GB L4 cache throughout as much as 8 chips. The L2 cache makes use of a 320 GB/s dual-direction ring interconnect topology whereas the L3 cache is distributed via L2 cooperation and has a mean latency of 12ns. The digital L3 and L4 cache present 1.5x cache per core.
Efficiency in AI Acceleration is rated at over 6 TFLOPs per chip & over 200 TFLOPs in a 4-drawer system that packs 4 IBM Z chips. The inner Matrix array options 128 tiles with 8-way FP-16 SIMD, high-density multiply, and accumulates FPUs whereas the Activation Array consists of 32 tiles with 8-way FP16/FP-32 SIMD. A dual-chip configuration yields 116,000 inferences (1.1ms) whereas a 32-chip configuration yields 3,600,000 inferences (1.2ms).
IBM Z Telum chips may be scaled up for much more efficiency as there are each single-chip and dual-chip modular designs. The two-chip configuration contains a chiplet design with 2 Telum chips and affords 16 cores, 32 threads, and 512 MB of cache.
The AI accelerator on the IBM Z Telum chip offers:
- Very low and constant inference latency
- Compute capability for utilization at scale
- Number of AI fashions starting from conventional ML to RNNs and CNNs
- Safety – present enterprise-grade reminiscence virtualization and safety
- Extensibility with future firmware and {hardware} updates
The IBM Z Telum Chip goes to be fabricated on the 7nm Samsung course of node and can characteristic a die measurement of 530mm2. The chip will home 22.5 Billion transistors and will likely be aimed toward enterprise & embedded workloads.