Versatile Training and Inference Accelerator for Machine Intelligence and Deep Learning

Notify Me

Ideal Edge-Training Solution for Machine and Deep Learning Applications

Powered by the "Polaris" Architecture

 

36

COMPUTE UNITS
2304 Stream Processors
 

5.7

TFLOPS
FP16 and FP32Performance
 

16GB

GDDR5
 
 

224GB/s

MEMORY BANDWIDTH
 

PERFORMANCE

5.7 TFLOPS of Peak Half or Single Precision Performance in Single-Slot Card Using Under 150 Watts TDP 1

  • 5.7 TFLOPS peak FP16 | FP32 GPU compute performance.

    With 5.7 TFLOPS peak FP16 or FP32 compute performance on a single board with 16GB GDDR5 memory, the Radeon Instinct MI6 server accelerator provides unmatched single-precision performance with large memory on a single-slot card for machine and deep learning inference and edge training applications, along with providing a cost-effective solution for HPC development systems needing more memory. 1

  • 16GB ultra-fast GDDR5 GPU Memory on 256-bit memory interface.

    With 16GB of GDDR5 GPU memory and up to 224GB/s of memory bandwidth, the Radeon Instinct MI6 server accelerator provides a well-balanced, versatile single-precision compute solution for handling demanding machine intelligence and deep learning inference applications, along with providing a cost-effective solution for edge-training applications due to its large memory and low power requirements.

  • Up to 38 GFLOPS per watt of peak FP16 and FP32 GPU compute performance.

    With up to 38 GFLOPS/watt peak FP16 or FP32 GPU compute performance, the Radeon Instinct MI6 server accelerator provides a versatile, efficient, solution for machine intelligence and deep learning inference and edge-training applications. 2

  • 36 Compute Units (2,304 Stream Processors).

    The Radeon Instinct MI6 server accelerator has 36 Compute Units each containing 64 stream processors, for a total of 2,304 stream processors that are available for running many smaller batches of data simultaneously against trained deep learning neural networks to get quick results. Single-precision performance in a low-cost, efficient, solution is crucial to these types of system installations, and MI6 accelerator provides the outstanding single-precision performance in a single-slot GPU card.

FEATURES

Passively Cooled Accelerator for Scalable Server Deployments

  • Passively cooled server accelerator based on “Polaris” architecture.
    The Radeon Instinct MI6 server accelerator, based on the “Polaris” architecture with a 14nm FinFET process and is designed for highly-efficient, scalable server deployments for single-precision inference and edge-training applications in machine intelligence and deep learning, along with HPC general purpose and development systems. This GPU server accelerator provides customers with a cost-effective, versatile, compute solution while consuming only 150W TDP board power.
  • 150W TDP board power, single-slot, 9.5” GPU server card.
    The Radeon Instinct MI6 server GPU card is a full-height, single-slot card and works with PCIe® Gen 3 compliant motherboards. The MI6 GPU card is designed to fit in most standard server designs providing a low-cost, highly-efficient, server solution for heterogeneous machine intelligence and deep learning inference and edge-training, along with HPC-class system deployments.
  • Ultra-fast GDDR5 with up to 224GB/s memory bandwidth.
    The Radeon Instinct MI6 server accelerator is designed with 16GB of ultra-fast GDDR5 memory allowing numerous batches of large data to be quickly handled simultaneously for demanding machine intelligence and deep learning inference and edge-training applications, along with HPC workloads.
  • MxGPU SRIOV HW Virtualization.
    The Radeon Instinct™ MI6 server accelerator is designed with support of AMD’s MxGPU SRIOV hardware virtualization technology to drive greater utilization and capacity in the data center.

USE CASES

Inference for Deep Learning

Today’s exponential data growth and dynamic nature of that data has reshaped the requirements of data center system configurations. Data center designers need to build systems capable of running workloads more complex and parallel in nature, while continuing to improve system efficiencies. Improvements in the capabilities of discrete GPUs and other accelerators over the last decade are providing data center designers with new options to build heterogeneous computing systems that help them meet these new challenges.

Datacenter deployments running inference applications, where lots of new smaller data set inputs are being run at half-precision (FP16) or single-precision (FP32) against trained neural networks to discover new knowledge, require parallel compute capable systems that can quickly run data inputs across lots of smaller cores in a power-efficient manner.

The Radeon Instinct MI6 accelerator is a powerful, cost-sensitive solution for machine intelligent and deep learning inference deployments in the datacenter delivering 5.7 TFLOPS for each of half or single precision floating point performance in a single-slot 150 watt TDP card. 1 The MI6 accelerator, based on AMD’s “Polaris” architecture with 16GB ultra-fast GDDR5 memory and up to 224 GB/s bandwidth, combined with the Radeon Instinct’s open ROCm software platform, provides datacenter designers with a versatile, highly-efficient, solution for inference deployments.

Key Benefits for Inference:

  • 5.7 TFLOPS half or single precision compute performance 1
  • 38 GFLOPS/watt peak FP16|FP32 performance for efficient inference and edge-training deployments 2
  • 358 GFLOPS peak double precision (FP64) compute performance
  • 2.4 GFLOPS/watt peak FP64 performance
  • 16GB GDDR5 on 256-bit memory interface provides ultra-fast memory performance
  • Passively cooled, Single-slot, GPU card for scalable server deployments
  • ROCm software platform provides open source Hyperscale platform
  • Open source Linux® drivers, HCC compiler, tools and libraries for full control from the metal forward
  • Optimized MIOpen deep learning framework libraries
  • Large BAR Support for mGPU peer to peer
  • MxGPU SR-IOV hardware virtualization for optimized system utilizations

 

Edge-training for Deep Learning

Today’s exponential data growth and dynamic nature of that data has reshaped the requirements of data center system configurations. Data center designers need to build systems capable of running workloads more complex and parallel in nature, while continuing to improve system efficiencies. Improvements in the capabilities of discrete GPUs and other accelerators over the last decade are providing data center designers with new options to build heterogeneous computing systems that help them meet these new challenges.

Data centers running machine intelligence and deep learning applications using edge-training deployments, where the goals are to focus on more cost-effective, efficient compute systems for training tasks using a very large number of lower-cost, edge-servers  to process less compute-intensive training tasks helping lower overall data center costs by achieving higher efficiencies. These systems require accelerators providing good single-precision performance with larger memory in a dense, low-power package.

The Radeon Instinct MI6 accelerator is a versatile, low power server accelerator that is a perfect fit for the requirements of low-cost edge-training deployments for machine intelligence and deep learning applications in the data center delivering 38 GFLOPS/watt peak half precision (FP16) or single precision (FP32) floating point performance in a single-slot 150 watt TDP GPU card. 2 The Radeon Instinct MI6 accelerator, based on AMD’s “Polaris” architecture with 16GB ultra-fast GDDR5 memory and up to 224 GB/s bandwidth, combined with the Radeon Instinct’s open ecosystem approach with the ROCm software platform, provides data center designers with a versatile, highly-efficient, solution for edge-training deployments.

Key benefits for Edge-training:

  • 5.7 TFLOPS peak half or single precision compute performance 1
  • 38 GFLOPS/watt peak FP16|FP32 performance in single-slot card 2
  • 358 GFLOPS peak double precision (FP64) compute performance
  • 2.4 GFLOPS/watt peak FP64 performance
  • 16GB GDDR5 on 256-bit memory interface provides ultra-fast memory performance
  • Passively cooled for scalable server deployments
  • ROCm software platform provides open source Hyperscale platform
  • Open source Linux® drivers, HCC compiler, tools and libraries for full control from the metal forward
  • Optimized MIOpen deep learning framework libraries
  • Large BAR Support for mGPU peer to peer
  • MxGPU SR-IOV hardware virtualization for optimized system utilization
  • Open industry standard support of multiple architectures and industry standard interconnect technologies 3

 

Heterogeneous Compute for HPC General Purpose and Development

The HPC industry is creating immense amounts of unstructured data each year and a portion of HPC system configurations are being reshaped to enable the community in extracting useful information from that data. Traditionally, these systems were predominantly CPU based, but with the explosive growth in the amount and different types of data being created, along with the evolution of more complex codes, these traditional systems don’t meet all the requirements of today’s data intensive HPC workloads. As these types of codes have become more complex and parallel, there has been a growing use of heterogeneous computing systems with different mixes of accelerators including discrete GPUs and FPGAs. The advancements of GPU capabilities over the last decade has allowed them to be used for a growing number of these mix precision parallel codes like the ones being used for deep learning applications. Scientists and researchers across the globe are now using accelerators to more efficiently process HPC parallel codes across several industries including life sciences, energy, financial, automotive and aerospace, academics, government and defense.

The Radeon Instinct MI6 accelerator, combined with AMD’s revolutionary ROCm open software platform, is a versatile, efficient heterogeneous compute solution delivering 5.7 TFLOPS for each of peak half or single precision performance in a single-slot, 150 watt TDP GPU card with 16GB of ultra-fast GDDR5 memory and up to 224 GB/s of memory bandwidth. 1 The Radeon Instinct MI6 accelerator is an ideal heterogeneous compute solution for cost-sensitive general purpose and development systems being deployed in the Financial Services, Energy, Life Science, Automotive, Academic (Research & Teaching), Government Labs and other HPC industries.

Key Benefits for HPC:

  • 5.7 TFLOPS peak half or single precision compute performance 1
  • 38 GFLOPS/watt peak FP16|FP32 compute performance for a range of HPC workloads
  • 358 GFLOPS peak double precision (FP64) compute performance
  • 2.4 GFLOPS/watt peak double precision compute performance
  • 16GB GDDR5 on 256-bit memory interface provides large ultra-fast memory performance
  • Passively cooled for scalable server deployments
  • ROCm software platform provides open source HPC-Class platform
  • Open source Linux® drivers, HCC compiler, tools and libraries for full control from the metal forward
  • MxGPU SR-IOV hardware virtualization for optimized system utilizations

Download the Radeon Instinct™ MI6 Data Sheet

Radeon Intinct™ MI6 DataSheet

Discover the Radeon Instinct™ MI Series

Radeon Intinct™ MI Series

Radeon Instinct™ MI6 DETAILS

In-Depth Look at the Specifications

Compute Units36
Thermal (active/passive, #slots)Passive, Single Slot
Peak Half Precision Compute Performance5.7TFLOPS
Peak Single Precision Compute Performance5.7TFLOPS
Peak Double Precision Compute Performance358GFLOPS
Stream Processors2304
Typical Board Power150W
Required PCI Slots1
Memory Data Rate7Gbps
Memory Speed1750MHz
Memory Size16GB
Memory TypeGDDR5
Memory Interface256-bit
Memory Bandwidth224GB/s
AMD PowerTune Technology
Product FamilyRadeon Instinct™
Product LineRadeon Instinct MI Series
ModelMI6
PlatformServer
OS SupportLinux® (64-bit)
Software PlatformROCm Software Ecosystem Compatible
  1. Measurements conducted by AMD Performance Labs as of June 2, 2017 on the Radeon Instinct™ MI6 “Polaris” architecture based accelerator. Results are estimates only and may vary. Performance may vary based on use of latest drivers. PC/system manufacturers may vary configurations yielding different results. The results calculated for Radeon Instinct MI6 resulted in 5.7 TFLOPS peak half precision (FP16) performance and 5.7 TFLOPS peak single precision (FP32) floating-point performance. AMD TFLOPS calculations conducted with the following equation: FLOPS calculations are performed by taking the engine clock from the highest DPM state and multiplying it by xx CUs per GPU. Then, multiplying that number by xx stream processors, which exist in each CU. Then, that number is multiplied by 2 FLOPS per clock for FP32. To calculate TFLOPS for FP16, 4 FLOPS per clock were used. Measurements on the Nvidia Tesla P40 resulted in 0.19 TFLOPS peak half precision (FP16) floating-point performance with 250w TDP GPU card from external source. Source: https://devblogs.nvidia.com/parallelforall/mixed-precision-programming-cuda-8/ http://images.nvidia.com/content/pdf/tesla/184427-Tesla-P40-Datasheet-NV-Final-Letter-Web.pdf Measurements on the Nvidia Tesla P4 resulted in 0.09 TFLOPS peak half precision (FP16) floating-point performance with 75w TDP GPU card from external source. Source: https://devblogs.nvidia.com/parallelforall/mixed-precision-programming-cuda-8/ http://images.nvidia.com/content/pdf/tesla/184457-Tesla-P4-Datasheet-NV-Final-Letter-Web.pdf AMD has not independently tested or verified external and/or third party results/data and bears no responsibility for any errors or omissions therein. RIP-1
  2. Measurements conducted by AMD Performance Labs as of June 2, 2017 on the Radeon Instinct™ MI6 “Polaris” architecture based accelerator. Results are estimates only and may vary. Performance may vary based on use of latest drivers. PC/system manufacturers may vary configurations yielding different results. The results calculated for Radeon Instinct MI6 resulted in 38 GFLOPS/watt peak half precision (FP16) performance and 38 GFLOPS peak single precision (FP32) floating-point performance. AMD GFLOPS per watt calculations conducted with the following equation: FLOPS calculations are performed by taking the engine clock from the highest DPM state and multiplying it by xx CUs per GPU. Then, multiplying that number by xx stream processors, which exist in each CU. Then, that number is multiplied by 2 FLOPS per clock for FP32. To calculate TFLOPS for FP16, 4 FLOPS per clock were used. Once the TFLOPs are calculated, the number is divided by the 150w TDP power and multiplied by 1,000. Measurements on the Nvidia Tesla P40 based on 0.19 TFLOPS peak FP16 with 250w TDP GPU card result in 0.76 GFLOPS/watt peak half precision (FP16) performance. Sources:https://devblogs.nvidia.com/parallelforall/mixed-precision-programming-cuda-8/ http://images.nvidia.com/content/pdf/tesla/184427-Tesla-P40-Datasheet-NV-Final-Letter-Web.pdf Measurements on the Nvidia Tesla P4 based on 0.09 TFLOPS peak FP16 with 75w TDP GPU card result in 1.2 GFLOPS/watt peak half precision (FP16) performance. Sources: https://devblogs.nvidia.com/parallelforall/mixed-precision-programming-cuda-8/ http://images.nvidia.com/content/pdf/tesla/184457-Tesla-P4-Datasheet-NV-Final-Letter-Web.pdf AMD has not independently tested or verified external/third party results/data and bears no responsibility for any errors or omissions therein. RIP-2
  3. Planned support for multiple architectures including x86, Power8 and ARM AMD also supports current interconnect technologies and has planned support for future industry standard interconnect technologies including GenZ, CCIX, and OpenCAPI™. Timing and availability of supported architectures and industry standard interconnect technologies will vary. Check with your system vendor to see whether your specific system has architecture/technology support.

The information contained herein is for informational purposes only, and is subject to change without notice. While every precaution has been taken in the preparation of this document, it may contain technical inaccuracies, omissions and typographical errors, and AMD is under no obligation to update or otherwise correct this information. Advanced Micro Devices, Inc. makes no representations or warranties with respect to the accuracy or completeness of the contents of this document, and assumes no liability of any kind, including the implied warranties of non-infringement, merchantability or fitness for particular purposes, with respect to the operation or use of AMD hardware, software or other products described herein. “Polaris” is an AMD internal codename for the architecture only and not a product name. No license, including implied or arising by estoppel, to any intellectual property rights is granted by this document. Terms and limitations applicable to the purchase or use of AMD’s products are as set forth in a signed agreement between the parties or in AMD’s Standard Terms and Conditions of Sale. GD-18

© 2017 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, Radeon, and combinations thereof are trademarks of Advanced Micro Devices, Inc. OpenCL is a trademark of Apple Inc. used by permission by Khronos. Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.