Cost-Sensitive, Scalable Accelerator for Machine and Deep Learning Inference Applications

Notify Me

Ideal for Datacenter Deployments of Inference Applications for Machine Intelligence and Deep Learning

Powered by the "Fiji" Architecture

 

64

COMPUTE UNITS
4096 Stream Processors
 

8.2

TFLOPS
FP16 and FP32Performance
 

4GB

HBM1
 
 

512GB/s

MEMORY BANDWIDTH
 

PERFORMANCE

8.2 TFLOPS of Peak Half or Single Precision Performance with 4GB HBM1 1

  • 8.2 TFLOPS peak FP16 | FP32 GPU compute performance.

    With 8.2 TFLOPS peak compute performance on a single board, the Radeon Instinct MI8 server accelerator provides superior single-precision performance per dollar for machine and deep learning inference applications, along with providing a cost-effective solution for HPC development systems. 1

  • 4GB high-bandwidth HBM1 GPU Memory on 512-bit memory interface.

    With 4GB of HBM1 GPU memory and up to 512GB/s of memory bandwidth, the Radeon Instinct MI8 server accelerator provides the perfect combination of single-precision performance and memory system performance to handle the most demanding machine intelligence and deep learning inference applications to abstract meaningful results from new data applied to trained neural networks in a cost-effective, efficient manner.

  • 47 GFLOPS/watt peak FP16|FP32 GPU compute performance.

    With up to 47 GFLOPS/watt peak FP16|FP32 GPU compute performance, the Radeon Instinct MI8 server accelerator provides superior performance per watt for machine intelligence and deep learning inference applications. 2

  • 64 Compute Units (4,096 Stream Processors).

    The Radeon Instinct MI8 server accelerator has 64 Compute Units each containing 64 stream processors, for a total of 4,096 stream processors that are available for running many smaller batches of data simultaneously against a trained neural network to get answers back quickly. Single-precision performance is crucial to these types of system installations, and MI8 accelerator provides superior single-precision performance in a single GPU card.

FEATURES

Passively Cooled Accelerator Using <175 Watts TDP for Scalable Server Deployments

  • Passively cooled server accelerator based on “Fiji” Architecture. The Radeon Instinct MI8 server accelerator, based on the “Fiji” architecture with a 28nm HPX process and is designed for highly-efficient, scalable server deployments for single-precision inference applications in machine intelligence and deep learning. This GPU server accelerator provides customers with great performance while consuming only 175W TDP board power.
  • 175W TDP board power, dual-slot, 6” GPU server card. The Radeon Instinct MI8 server PCIe® Gen 3 x16 GPU card is a full-height, dual-slot card designed to fit in most standard server designs providing a highly-efficient server solution for heterogeneous machine intelligence and deep learning inference system deployments.
  • High Bandwidth Memory (HBM1) with up to 512GB/s memory bandwidth. The Radeon Instinct MI8 server accelerator is designed with 4GB of high bandwidth HBM1 memory allowing numerous batches of data to be quickly handled simultaneously for the most demanding machine intelligence and deep learning inference applications, allowing meaningful results to be quickly abstracted from new data applied to trained neural networks.
  • MxGPU SR-IOV HW Virtualization. The Radeon Instinct MI8 server accelerator is designed with support of AMD’s MxGPU SR-IOV hardware virtualization technology designed to drive greater utilization and capacity in the data center.

USE CASES

Inference for Deep Learning

Today’s exponential data growth and dynamic nature of that data has reshaped the requirements of data center system configurations. Data center designers need to build systems capable of running workloads more complex and parallel in nature, while continuing to improve system efficiencies. Improvements in the capabilities of discrete GPUs and other accelerators over the last decade are providing data center designers with new options to build heterogeneous computing systems that help them meet these new challenges.

 

Datacenter deployments running inference applications, where lots of new smaller data set inputs are being run at half precision (FP16) or single precision (FP32) against trained neural networks to discover new knowledge, require parallel compute capable systems that can quickly run data inputs across lots of smaller cores in a power-efficient manner.

 

The Radeon Instinct™ MI8 accelerator is an efficient, cost-sensitive solution for machine intelligent and deep learning inference deployments in the datacenter delivering 8.2 TFLOPS of peak half or single precision (FP16|FP32) floating point performance in a single 175 watt TDP card. 1 The Radeon Instinct™ MI8 accelerator, based on AMD’s “Fiji” architecture with 4GB high-bandwidth HBM1 memory and up to 512 GB/s bandwidth, combined with the Radeon Instinct’s open ecosystem approach with the ROCm platform, provides data center designers with a highly-efficient, flexible solution for inference deployments.

Key Benefits for Inference:

  • 8.2 TFLOPS peak half or single precision compute performance 1
  • 47 GFLOPS/watt peak half or single precision compute performance 2
  • 4GB HBM1 on 512-bit memory interface provides high bandwidth memory performance
  • Passively cooled accelerator using under 175 watts TDP for scalable server deployments
  • ROCm software platform provides open source Hyperscale platform
  • Open source Linux drivers, HCC compiler, tools and libraries for full control from the metal forward
  • Optimized MIOpen Deep Learning framework libraries 3
  • Large BAR Support for mGPU peer to peer
  • MxGPU SR-IOV hardware virtualization for optimized system utilizations
  • Open industry standard support of multiple architectures and open standard interconnect technologies 4

 

Heterogeneous Compute for HPC General Purpose and Development

The HPC industry is creating immense amounts of unstructured data each year and a portion of HPC system configurations are being reshaped to enable the community to extract useful information from that data. Traditionally, these systems were predominantly CPU based, but with the explosive growth in the amount and different types of data being created, along with the evolution of more complex codes, these traditional systems don’t meet all the requirements of today’s data intensive HPC workloads. As these types of codes have become more complex and parallel, there has been a growing use of heterogeneous computing systems with different mixes of accelerators including discrete GPUs and FPGAs. The advancements of GPU capabilities over the last decade has allowed them to be used for a growing number of these mixed precision parallel codes like the ones being used for training neural networks for deep learning. Scientists and researchers across the globe are now using accelerators to more efficiently process HPC parallel codes across several industries including life sciences, energy, financial, automotive and aerospace, academics, government and defense.

 

The Radeon Instinct™ MI8 accelerator, combined with AMD’s revolutionary ROCm open software platform, is an efficient entry-level heterogeneous computing solution delivering 8.2 TFLOPS peak single precision compute performance in an efficient GPU card with 4GB of high-bandwidth HBM1 memory. 1 The MI8 accelerator is the perfect open solution for cost-effective general purpose and development systems being deployed in the Financial Services, Energy, Life Science, Automotive and Aerospace, Academic (Research & Teaching), Government Labs and other HPC industries.

Key Benefits for HPC:

  • 8.2 TFLOPS peak half or single precision compute performance for a range of HPC workloads 1
  • 47 GFLOPS/watt peak half or single precision compute performance 2
  • 512 GFLOPS peak (FP64) double precision compute performance with 4GB HBM1
  • 2.9 GFLOPS/watt peak FP64 compute performance
  • 4GB HBM1 on 512-bit memory interface provides high bandwidth memory performance
  • Passively cooled accelerator using under 175 watts TDP for scalable server deployments
  • ROCm software platform provides open source HPC-Class platform
  • Open source Linux drivers, HCC compiler, tools and libraries for full control from the metal forward
  • MxGPU SR-IOV hardware virtualization for optimized system utilizations
  • Open industry standard support of multiple architectures and industry standard interconnect technologies 4

Download the Radeon Instinct™ MI8 Data Sheet

Radeon Intinct™ MI8 DataSheet

Discover the Radeon Instinct™ MI Series

Radeon Intinct™ MI Series

Radeon Instinct™ MI8 DETAILS

In-Depth Look at the Specifications

Compute Units64
Thermal (active/passive, #slots)Passive, Dual Slot
Peak Half Precision Compute Performance8.2TFLOPS
Peak Single Precision Compute Performance8.2TFLOPS
Peak Double Precision Compute Performance512GFLOPS
Stream Processors4096
Typical Board Power175W
Required PCI Slots2
Memory Data Rate1Gbps
Memory Speed500MHz
Memory Size4GB
Memory TypeHBM1
Memory Interface4096-bit
Memory Bandwidth512GB/s
AMD PowerTune Technology
Product FamilyRadeon Instinct™
Product LineRadeon Instinct MI Series
ModelMI8
PlatformServer
OS SupportLinux® 64-bit
Software PlatformROCm Software Ecosystem Compatible
  1. Measurements conducted by AMD Performance Labs as of June 2, 2017 on the Radeon Instinct™ MI8 “Fiji” architecture based accelerator. Results are estimates only and may vary. Performance may vary based on use of latest drivers. PC/system manufacturers may vary configurations yielding different results. The results calculated for MI8 resulted in 8.2 TFLOPS peak half precision (FP16) performance and 8.2 TFLOPS peak single precision (FP32) floating-point performance. AMD TFLOPS calculations conducted with the following equation: FLOPS calculations are performed by taking the engine clock from the highest DPM state and multiplying it by xx CUs per GPU. Then, multiplying that number by xx stream processors, which exist in each CU. Then, that number is multiplied by 2 FLOPS per clock for FP32. To calculate TFLOPS for FP16, 4 FLOPS per clock were used. Measurements on the Nvidia Tesla P40 resulted in 0.19 TFLOPS peak Half precision (FP16) peak floating-point performance with 250w TDP GPU card from external source. Sources: https://devblogs.nvidia.com/parallelforall/mixed-precision-programming-cuda-8/; http://images.nvidia.com/content/pdf/tesla/184427-Tesla-P40-Datasheet-NV-Final-Letter-Web.pdf. Measurements on the Nvidia Tesla P4 resulted in 0.09 TFLOPS peak half precision (FP16) floating-point performance with 75w TDP GPU card from external source. Sources: https://devblogs.nvidia.com/parallelforall/mixed-precision-programming-cuda-8/; http://images.nvidia.com/content/pdf/tesla/184457-Tesla-P4-Datasheet-NV-Final-Letter-Web.pdf. AMD has not independently tested or verified external and/or third party results/data and bears no responsibility for any errors or omissions therein. RIF-1
  2. Measurements conducted by AMD Performance Labs as of June 2, 2017 on the Radeon Instinct™ MI8 “Fiji” architecture based accelerator. Results are estimates only and may vary. Performance may vary based on use of latest drivers. PC/system manufacturers may vary configurations yielding different results. The results calculated for Radeon Instinct MI8 resulted in 47 GFLOPS/watt peak half precision (FP16) performance and 47 GFLOPS/watt peak single precision (FP32) floating-point performance. AMD GFLOPS per watt calculations conducted with the following equation: FLOPS calculations are performed by taking the engine clock from the highest DPM state and multiplying it by xx CUs per GPU. Then, multiplying that number by xx stream processors, which exist in each CU. Then, that number is multiplied by 2 FLOPS per clock for FP32. To calculate TFLOPS for FP16, 4 FLOPS per clock were used. Once the TFLOPs are calculated, the number is divided by the 175w TDP power and multiplied by 1,000. Measurements on the Nvidia Tesla P40 based on 0.19 TFLOPS peak FP16 with 250w TDP GPU card result in 0.76 GFLOPS/watt peak half precision (FP16) performance. Sources for Nvidia Tesla P40 FP16 TFLOPs number: https://devblogs.nvidia.com/parallelforall/mixed-precision-programming-cuda-8/; http://images.nvidia.com/content/pdf/tesla/184427-Tesla-P40-Datasheet-NV-Final-Letter-Web.pdf. Measurements on the Nvidia Tesla P4 based on 0.09 TFLOPS peak FP16 with 75w TDP GPU card result in 1.2 GFLOPS/watt peak half precision (FP16) performance. Sources for Nvidia Tesla P40 FP16 TFLOPs number: https://devblogs.nvidia.com/parallelforall/mixed-precision-programming-cuda-8/; http://images.nvidia.com/content/pdf/tesla/184457-Tesla-P4-Datasheet-NV-Final-Letter-Web.pdf. AMD has not independently tested or verified external and/or third party results/data and bears no responsibility for any errors or omissions therein. RIF-2
  3. Planned support for machine Intelligence frameworks. Refer to www.GPUOpen.com web site for framework availability.
  4. Planned support for multiple architectures including x86, Power8 and ARM AMD also supports current interconnect technologies and has planned support for future industry standard interconnect technologies including GenZ, CCIX, and OpenCAPI™. Timing and availability of supported architectures and industry standard interconnect technologies will vary. Check with your system vendor to see whether your specific system has architecture/technology support.

The information contained herein is for informational purposes only, and is subject to change without notice. While every precaution has been taken in the preparation of this document, it may contain technical inaccuracies, omissions and typographical errors, and AMD is under no obligation to update or otherwise correct this information. Advanced Micro Devices, Inc. makes no representations or warranties with respect to the accuracy or completeness of the contents of this document, and assumes no liability of any kind, including the implied warranties of non-infringement, merchantability or fitness for particular purposes, with respect to the operation or use of AMD hardware, software or other products described herein. “Fiji” is an AMD internal codename for the architecture only and not a product name. No license, including implied or arising by estoppel, to any intellectual property rights is granted by this document. Terms and limitations applicable to the purchase or use of AMD’s products are as set forth in a signed agreement between the parties or in AMD’s Standard Terms and Conditions of Sale. GD-18

© 2017 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, Radeon, and combinations thereof are trademarks of Advanced Micro Devices, Inc. OpenCL is a trademark of Apple Inc. used by permission by Khronos. Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.