World’s Fastest Training Accelerator for Machine Intelligence and Deep Learning 1

Notify Me

World’s Most Advanced GPU Memory Architecture and Next-Generation Compute Engine

Powered by the "Vega" Architecture

 

64 nCU

COMPUTE UNITS
4096 Stream Processors
 

24.6/12.3

TFLOPS
FP16 / FP32 Performance
 

16GB

HBM2
 
 

484GB/s

MEMORY BANDWIDTH
 

PERFORMANCE

Unmatched Half and Single Precision Floating-Point Performance1 1

  • 24.6 TFLOPS FP16 or 12.3 TFLOPS FP32 peak GPU compute performance.

     

  • With 24.6 TFLOPS FP16 or 12.3 TFLOPS FP32 peak GPU compute performance on a single board, the Radeon Instinct MI25 server accelerator provides single precision performance leadership for compute intensive machine intelligence and deep learning training applications.1 The MI25 provides a powerful solution for the most parallel HPC workloads. The MI25 also provides 768 GFLOPS peak double precision (FP64) at 1/16th rate. 1 The MI25 provides a powerful solution for the most parallel HPC workloads. The MI25 also provides 768 GFLOPS peak double precision (FP64) at 1/16th rate.
  • 16GB ultra high-bandwidth HBM2 ECC 2 GPU memory.

    With 2X data-rate improvements over previous generations on a 512-bit memory interface, next generation High Bandwidth Cache and controller, and ECC memory reliability; the Radeon Instinct MI25’s 16GB of HBM2 GPU memory provides a professional-level accelerator solution capable of handling the most demanding data intensive machine intelligence and deep learning training applications. 3

  • Up to 82 GFLOPS/watt FP16 or 41 GFLOPS/watt FP32 peak GPU compute performance.

    With up to 82 GFLOPS/watt FP16 or 41 GFLOPS/watt FP32 peak GPU compute performance, the Radeon Instinct MI25 server accelerator provides unmatched performance per watt for machine intelligence and deep learning training applications in the datacenter where performance and efficient power usage is crucial to ROI. 4 The MI25 also provides 2.5 GFLOPS/watt of FP64 peak performance.

  • 64 Compute Units each with 64 Stream Processors.

    The Radeon Instinct™ MI25 server accelerator has 64 Compute Units, each consisting of 64 stream processors, for a total of 4,096 stream processors and is based on the next generation “Vega” architecture with a newly designed compute engine built on flexible new compute units (nCUs) allowing 16-bit, 32-bit and 64-bit processing at higher frequencies to supercharge today’s emerging dynamic workloads. The Radeon Instinct MI25 provides superior single-precision performance and flexibility for the most demanding compute intensive parallel machine intelligence and deep learning applications in an efficient package.

FEATURES

Built on AMD’s Next-Generation “Vega” Architecture with World’s Most Advanced GPU Memory

  • Passively cooled GPU server accelerator based on next-generation “Vega” architecture using a 14nm FinFET Process. The Radeon Instinct MI25 server accelerator, based on the new “Vega” architecture with a 14nm FinFET process, is a professional-grade accelerator designed for compute density optimized for datacenter server deployments. The MI25 server accelerator is the ideal solution for single-precision compute intensive training applications in machine intelligence and deep learning and other HPC-class workloads, where performance per watt is important.
  • 300W TDP board power, full-height, dual-slot, 10.5” PCIe® Gen 3 x16 GPU server card. The Radeon Instinct MI25 server PCIe® Gen 3 x16 GPU card is a full-height, dual-slot card designed to fit in most standard server designs providing a performance driven server solution for heterogeneous machine intelligence and deep learning training and HPC-class system deployments.
  • Ultra high-bandwidth HBM2 ECC 5 memory with up to 484 GB/s memory bandwidth. The Radeon Instinct MI25 server accelerator is designed with 16GB of the latest high bandwidth HBM2 memory for handling the larger data set requirements of the most demanding machine intelligence and deep learning neural network training systems efficiently. The MI25 accelerator’s 16GB of ECC HBM2 memory also makes it an ideal solution for data intensive HPC-class workloads. 2
  • MxGPU SR-IOV Hardware Virtualization. The Radeon Instinct MI25 server accelerator is designed with support of AMD’s MxGPU SRIOV hardware virtualization technology to drive greater utilization and capacity in the data center.
  • Updated Remote Manageability Capabilities. The Radeon Instinct MI25 accelerator has advanced out-of-band manageability circuitry for simplified GPU monitoring in large scale systems. The MI25’s manageability capabilities provide accessibility via I2C, regardless of what state the GPU is in, providing advanced monitoring of a range of static and dynamic GPU information using PMCI compliant data structures including board part detail, serial numbers, GPU temperature, power and other information.

USE CASES

Machine Intelligence & Deep Learning Neural Network Training

Training techniques used today on neural networks in machine intelligence and deep learning applications in data centers have become very complex and require the handling of massive amounts of data when training those networks to recognize patterns within that data. This requires lots of floating point computation spread across many cores, and traditional CPUs can’t handle this type of computation as efficiently as GPUs handle it. What can take CPUs weeks to compute, can be handled in days with the use of GPUs. The Radeon Instinct MI25, combined with AMD’s new Epyc server processors and our ROCm open software platform, deliver superior performance for machine intelligence and deep learning applications.

The MI25’s superior 24.6 TFLOPS of native half-precision (FP16) or 12.3 TFLOPS single-precision (FP32) peak floating point performance running across 4,096 stream processors; combined with its advanced High Bandwidth Cache (HBC) and controller and 16GB of high-bandwidth HBM2 memory, brings customers a new level of computing capable to meet today’s demanding system requirements of handling large data efficiently for training these complex neural networks used in deep learning. 1 The MI25 accelerator, based on AMD’s Next-Gen “Vega” architecture with the world’s most advanced memory architecture, is optimized for handling large sets of data and has vast improvements in throughput-per clock over previous generations delivering up to 82 GFLOPS per watt of FP16 or 41 GFLOPS per watt of FP32 peak GPU compute performance for outstanding performance per watt for machine intelligent deep learning training deployments in the data center where performance and efficiency are mandatory. 4

Benefits for Machine Intelligence & Deep Learning Neural Network Training:

  • Unmatched FP16 and FP32 Floating-Point Performance 1
  • Open Software ROCm Platform for HPC-Class Rack Scale
  • Optimized MIOpen Deep Learning Framework Libraries
  • Large BAR Support for mGPU peer to peer
  • Configuration advantages with Epyc server processors
  • Superior compute density and performance per node when combining new AMD Epyc™ processor-based servers and Radeon Instinct “Vega” based products
  • MxGPU SR-IOV Hardware Virtualization Driving enabling greater utilization and capacity in data center

 

HPC Heterogeneous Compute

The HPC industry is creating immense amounts of unstructured data each year and a portion of HPC system configurations are being reshaped to enable the community to extract useful information from that data. Traditionally, these systems were predominantly CPU based, but with the explosive growth in the amount and different types of data being created, along with the evolution of more complex codes, these traditional systems don’t meet all the requirements of today’s data intensive HPC workloads. As these types of codes have become more complex and parallel, there has been a growing use of heterogeneous computing systems with different mixes of accelerators including discrete GPUs and FPGAs. The advancements of GPU capabilities over the last decade have allowed them to be used for a growing number of these parallel codes like the ones being used for training neural networks for deep learning. Scientists and researchers across the globe are now using accelerators to more efficiently process HPC parallel codes across several industries including life sciences, energy, financial, automotive and aerospace, academics, government and defense.

The Radeon Instinct MI25, combined with AMD’s new “Zen”-based Epyc server CPUs and our revolutionary ROCm open software platform provide a progressive approach to open heterogeneous compute from the metal forward. AMD’s next-generation HPC solutions are designed to deliver maximum compute density and performance per node with the efficiency required to handle today’s massively parallel data-intensive codes; as well as, to provide a powerful, flexible solution for general purpose HPC deployments. The ROCm software platform brings a scalable HPC-class solution that provides fully open-source Linux drivers, HCC compilers, tools and libraries to give scientists and researchers system control down to the metal. The Radeon Instinct’s open ecosystem approach supports various architectures including x86, Power8 and ARM, along with industry standard interconnect technologies providing customers with the ability to design optimized HPC systems for a new era of heterogeneous compute that embraces the HPC community’s open approach to scientific advancement. 5

Key Benefits for HPC Heterogeneous Compute:

  • Outstanding Compute Density and Performance Per Node
  • Open Software ROCm Platform for HPC Class Rack Scale
  • Open Source Linux Drivers, HCC Compiler, Tools and Libraries from the Metal Forward
  • Open Industry Standard Support of Multiple Architectures and Industry Standard Interconnect Technologies 5

Download the Radeon Instinct™ MI25 Data Sheet

Radeon Intinct™ MI25 DataSheet

Discover the Radeon Instinct™ MI Series

Radeon Intinct™ MI Series

Radeon Instinct™ MI25 DETAILS

In-Depth Look at the Specifications

Compute Units64 nCU
Peak Half Precision Compute Performance24.6TFLOPS
Peak Single Precision Compute Performance12.3TFLOPS
Peak Double Precision Compute Performance768GFLOPS
Stream Processors4096
Typical Board Power300W
Required PCI Slots2
Memory Data Rate1.89Gbps
Memory Speed945MHz
Memory Size16GB
Memory TypeHBM2
Memory Interface2048-bit
Memory Bandwidth484GB/s
AMD PowerTune Technology
Error-Correcting Code Memory (ECC)
Product FamilyRadeon Instinct™
Product LineRadeon Instinct MI Series
ModelMI25
PlatformServer
Form factor and CoolingPassive, Dual Slot
OS SupportLinux® 64-bit
Software PlatformROCm Software Ecosystem Compatible
  1. Measurements conducted by AMD Performance Labs as of June 2, 2017 on the Radeon Instinct™ MI25 “Vega” architecture based accelerator. Results are estimates only and may vary. Performance may vary based on use of latest drivers. PC/system manufacturers may vary configurations yielding different results. The results calculated for Radeon Instinct MI25 resulted in 24.6 TFLOPS peak half precision (FP16) and 12.3 TFLOPS peak single precision (FP32) floating-point performance. AMD TFLOPS calculations conducted with the following equation: FLOPS calculations are performed by taking the engine clock from the highest DPM state and multiplying it by xx CUs per GPU. Then, multiplying that number by xx stream processors, which exist in each CU. Then, that number is multiplied by 2 FLOPS per clock for FP32. To calculate TFLOPS for FP16, 4 FLOPS per clock were used. The FP64 TFLOPS rate is calculated using 1/16th rate. External results on the NVidia Tesla P100-16 (16GB card) GPU Accelerator resulted in 18.7 TFLOPS peak half precision (FP16) and 9.3 TFLOPS peak single precision (FP32) floating-point performance. Results found at: https://images.nvidia.com/content/tesla/pdf/nvidia-tesla-p100-PCIe-datasheet.pdf. External results on the NVidia Tesla P100-SXM2 GPU Accelerator resulted in 21.2 TFLOPS peak half precision (FP16) and 10.6 TFLOPS peak single precision (FP32) floating-point performance. Results found at: http://www.nvidia.com/object/tesla-p100.html AMD has not independently tested or verified external/third party results/data and bears no responsibility for any errors or omissions therein. RIV-1
  2. ECC support is limited to the HBM2 memory and ECC protection is not provided for internal GPU structures.
  3. HBM2 has 2X memory bandwidth per pin performance over HBM increasing from 1GB/s to 2GB/s per pin. HBM2 also doubles the capacity per die providing increased performance with the use of less power. http://wccftech.com/amd-vega-gpu-pictures-hbm2-official/
  4. Measurements conducted by AMD Performance Labs as of June 2, 2017 on the Radeon Instinct™ MI25 “Vega” architecture based accelerator. Results are estimates only and may vary. Performance may vary based on use of latest drivers. PC/system manufacturers may vary configurations yielding different results. The results calculated for Radeon Instinct MI25 resulted in 82 GFLOPS/watt peak half precision (FP16) or 41 GFLOPS/watt peak single precision (FP32) floating-point performance. AMD GFLOPS per watt calculations conducted with the following equation: FLOPS calculations are performed by taking the engine clock from the highest DPM state and multiplying it by xx CUs per GPU. Then, multiplying that number by xx stream processors, which exist in each CU. Then, that number is multiplied by 2 FLOPS per clock for FP32. To calculate TFLOPS for FP16, 4 FLOPS per clock were used. The FP64 TFLOPS rate is calculated using 1/16th rate. Once the TFLOPs are calculated, the number is divided by the xxx watts TDP power and multiplied by 1,000 to determine the GFLOPS per watt.Calculations conducted by AMD Performance Labs as of June 2, 2017 on the NVidia Tesla P100-16 (16GB card) GPU Accelerator to determine GFLOPS/watt by dividing TFLOPS results by 250 watts TDP resulted in 75 GFLOPS per watt peak half precision (FP16) and 37 GFLOPS per watt peak single precision (FP32) floating-point performance. Sources: https://images.nvidia.com/content/tesla/pdf/nvidia-tesla-p100-PCIe-datasheet.pdf Calculations conducted by AMD Performance Labs as of June 2, 2017 on the NVidia Tesla P100-SXM2 GPU Accelerator to determine GFLOPS/watt by dividing TFLOPS results by 300 watts TDP resulted in 71 GFLOPS per watt peak half precision (FP16) and 35 GFLOPS per watt peak single precision (FP32) floating-point performance. Sources: http://www.nvidia.com/object/tesla-p100.html AMD has not independently tested or verified external/third party results/data and bears no responsibility for any errors or omissions therein. RIV-4
  5. Planned support for multiple architectures including x86, Power8 and ARM AMD also supports current interconnect technologies and has planned support for future industry standard interconnect technologies including GenZ, CCIX, and OpenCAPI™. Timing and availability of supported architectures and industry standard interconnect technologies will vary. Check with your system vendor to see whether your specific system has architecture/technology support.

The information contained herein is for informational purposes only, and is subject to change without notice. While every precaution has been taken in the preparation of this document, it may contain technical inaccuracies, omissions and typographical errors, and AMD is under no obligation to update or otherwise correct this information. Advanced Micro Devices, Inc. makes no representations or warranties with respect to the accuracy or completeness of the contents of this document, and assumes no liability of any kind, including the implied warranties of non-infringement, merchantability or fitness for particular purposes, with respect to the operation or use of AMD hardware, software or other products described herein. “Vega” and “Vega10” are AMD internal codenames for the architecture only and not product names. No license, including implied or arising by estoppel, to any intellectual property rights is granted by this document. Terms and limitations applicable to the purchase or use of AMD’s products are as set forth in a signed agreement between the parties or in AMD’s Standard Terms and Conditions of Sale. GD-18

© 2017 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, Radeon, and combinations thereof are trademarks of Advanced Micro Devices, Inc. OpenCL is a trademark of Apple Inc. used by permission by Khronos. Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.