Deep Learning, an advanced form of machine learning, has generated a lot of interest due to the wide range of applications on complex data sets. Current technologies and the availability of very large amounts of complex data have made analytics on the latter more tractable.1
With deep neural networks as basis for deep learning algorithms, GPUs are now being used in deep learning applications because they provide many processing units. These processing units simulate a neural network that does the computation on data. Neural networks can therefore scale and improve the extraction of information from data.
ROCm and The AMD Deep Learning Stack
The AMD Deep Learning Stack is the result of AMD’s initiative to enable DL applications using their GPUs such as the Radeon Instinct product line. Currently, deep learning frameworks such as Caffe, Torch, and TensorFlow are being ported and tested to run on the AMD DL stack. Supporting these frameworks is MIOpen, AMD’s open-source deep learning library built for the Radeon Instinct line of compute accelerators.2
AMD’s ROCm platform serves as the foundation of this DL stack. ROCm enables the seamless integration of the CPU and GPU for high performance computing (HPC) and ultra-scale class computing. To achieve this, ROCm is built for language independence and takes advantage of the Heterogenous System Architecture (HSA) Runtime API.3 This is the basis of the ROCr System Runtime, a thin user-mode API providing access to graphics hardware driven by the AMDGPU driver and the ROCk kernel driver.4
For now, OS support for ROCm is limited to Ubuntu 14.04, Ubuntu 16.04, and Fedora 23. For these OSs, AMD provides a modified Linux version 4.6 kernel with patches to the HSA kernel driver (amdkfd) and the AMDGPU (amdgpu) kernel driver currently in the mainline Linux kernel.5
Using Docker With The AMD Deep Learning Stack
Software containers isolate the application and its dependencies from other software installed on the host. They abstract the underlying operating system while keeping its own resources (filesystem, memory, CPU) and environment separate from other containers.
In contrast to virtual machines, all containers running on the same host share a single operating system without the need to virtualize a complete machine with its own OS. This makes software containers perform much faster than virtual machines because of the lack of overhead from the guest OS and the hypervisor.
Docker is the most popular software container platform today. It is available for Linux, macOS, and Microsoft Windows. Docker containers can run under any OS with the Docker platform installed.6
Installing Docker and The AMD Deep Learning Stack
The ROCm-enabled Linux kernel and the ROCk driver, together with other needed kernel modules, must be installed on all hosts that run Docker containers. This is because the containers do not have the kernel installed inside them. Instead, the containers share the host kernel.7
The installation procedure described here is for Ubuntu 16.04. Ubuntu 16.04 is currently the most tested OS for ROCm.
The next step is to install ROCm and the ROCm kernel on each host. The procedure described below is based on instructions found in https://rocm.github.io/install.html.
Grab and install the GPG key for the repository:
|wget -qO – http://repo.radeon.com/rocm/apt/debian/rocm.gpg.key | sudo apt-key add –|
You should get the message ‘OK’. You can check if it’s there using apt-key:
In /etc/apt/sources.list.d, create a file named rocm.list and place the following line in it:
|deb [arch=amd64] http://repo.radeon.com/rocm/apt/debian/ xenial main|
Update the repository information by running ‘apt update’. If you get a warning because of the key signature, you may ignore it since the repository administrator will update this in the future.
Install the ROCm Runtime software stack using ‘apt install rocm’:
[root@pegasus ~]# apt install rocm
Reading package lists… Done
Building dependency tree
Reading state information… Done
The following packages were automatically installed and are no longer required:
hcblas hcfft hcrng miopengemm
Use ‘sudo apt autoremove’ to remove them.
The following additional packages will be installed:
hcc hip_hcc linux-headers-4.11.0-kfd-compute-rocm-rel-1.6-148 linux-image-4.11.0-kfd-compute-rocm-rel-1.6-148 rocm-dev
rocm-device-libs rocm-profiler rocm-smi rocm-utils
The following NEW packages will be installed:
hcc hip_hcc linux-headers-4.11.0-kfd-compute-rocm-rel-1.6-148 linux-image-4.11.0-kfd-compute-rocm-rel-1.6-148 rocm rocm-dev
rocm-device-libs rocm-profiler rocm-smi rocm-utils
0 upgraded, 10 newly installed, 0 to remove and 0 not upgraded.
Need to get 321 MB of archives.
After this operation, 1,934 MB of additional disk space will be used.
Do you want to continue? [Y/n]
Get:1 http://repo.radeon.com/rocm/apt/debian xenial/main amd64 rocm-utils amd64 1.0.0 [30.7 kB]
Get:2 http://repo.radeon.com/rocm/apt/debian xenial/main amd64 hcc amd64 1.0.17312 [255 MB]
Get:3 http://repo.radeon.com/rocm/apt/debian xenial/main amd64 hip_hcc amd64 1.2.17305 [876 kB]
Get:4 http://repo.radeon.com/rocm/apt/debian xenial/main amd64 linux-headers-4.11.0-kfd-compute-rocm-rel-1.6-148 amd64 4.11.0-kfd-compute-rocm-rel-1.6-148-1 [10.8 MB]
Get:5 http://repo.radeon.com/rocm/apt/debian xenial/main amd64 linux-image-4.11.0-kfd-compute-rocm-rel-1.6-148 amd64 4.11.0-kfd-compute-rocm-rel-1.6-148-1 [46.5 MB]
Get:6 http://repo.radeon.com/rocm/apt/debian xenial/main amd64 rocm-device-libs amd64 0.0.1 [587 kB]
Get:7 http://repo.radeon.com/rocm/apt/debian xenial/main amd64 rocm-smi amd64 1.0.0-25-gbdb99b4 [8,158 B]
Get:8 http://repo.radeon.com/rocm/apt/debian xenial/main amd64 rocm-profiler amd64 5.1.6400 [7,427 kB]
Get:9 http://repo.radeon.com/rocm/apt/debian xenial/main amd64 rocm-dev amd64 1.6.148 [902 B]
Get:10 http://repo.radeon.com/rocm/apt/debian xenial/main amd64 rocm amd64 1.6.148 [1,044 B]
Fetched 321 MB in 31s (10.1 MB/s)
Selecting previously unselected package rocm-utils.
(Reading database … 254059 files and directories currently installed.)
Preparing to unpack …/rocm-utils_1.0.0_amd64.deb …
Unpacking rocm-utils (1.0.0) …
Selecting previously unselected package hcc.
Preparing to unpack …/hcc_1.0.17312_amd64.deb …
Unpacking hcc (1.0.17312) …
Selecting previously unselected package hip_hcc.
Preparing to unpack …/hip%5fhcc_1.2.17305_amd64.deb …
Unpacking hip_hcc (1.2.17305) …
Selecting previously unselected package linux-headers-4.11.0-kfd-compute-rocm-rel-1.6-148.
Preparing to unpack …/linux-headers-4.11.0-kfd-compute-rocm-rel-1.6-148_4.11.0-kfd-compute-rocm-rel-1.6-148-1_amd64.deb …
Unpacking linux-headers-4.11.0-kfd-compute-rocm-rel-1.6-148 (4.11.0-kfd-compute-rocm-rel-1.6-148-1) …
Selecting previously unselected package linux-image-4.11.0-kfd-compute-rocm-rel-1.6-148.
Preparing to unpack …/linux-image-4.11.0-kfd-compute-rocm-rel-1.6-148_4.11.0-kfd-compute-rocm-rel-1.6-148-1_amd64.deb …
Unpacking linux-image-4.11.0-kfd-compute-rocm-rel-1.6-148 (4.11.0-kfd-compute-rocm-rel-1.6-148-1) …
Selecting previously unselected package rocm-device-libs.
Preparing to unpack …/rocm-device-libs_0.0.1_amd64.deb …
Unpacking rocm-device-libs (0.0.1) …
Selecting previously unselected package rocm-smi.
Preparing to unpack …/rocm-smi_1.0.0-25-gbdb99b4_amd64.deb …
Unpacking rocm-smi (1.0.0-25-gbdb99b4) …
Selecting previously unselected package rocm-profiler.
Preparing to unpack …/rocm-profiler_5.1.6400_amd64.deb …
Unpacking rocm-profiler (5.1.6400) …
Selecting previously unselected package rocm-dev.
Preparing to unpack …/rocm-dev_1.6.148_amd64.deb …
Unpacking rocm-dev (1.6.148) …
Selecting previously unselected package rocm.
Preparing to unpack …/rocm_1.6.148_amd64.deb …
Unpacking rocm (1.6.148) …
Setting up rocm-utils (1.0.0) …
Setting up hcc (1.0.17312) …
Setting up hip_hcc (1.2.17305) …
Setting up linux-headers-4.11.0-kfd-compute-rocm-rel-1.6-148 (4.11.0-kfd-compute-rocm-rel-1.6-148-1) …
Setting up linux-image-4.11.0-kfd-compute-rocm-rel-1.6-148 (4.11.0-kfd-compute-rocm-rel-1.6-148-1) …
update-initramfs: Generating /boot/initrd.img-4.11.0-kfd-compute-rocm-rel-1.6-148
W: mdadm: /etc/mdadm/mdadm.conf defines no arrays.
Generating grub configuration file …
Found linux image: /boot/vmlinuz-4.11.0-kfd-compute-rocm-rel-1.6-148
Found initrd image: /boot/initrd.img-4.11.0-kfd-compute-rocm-rel-1.6-148
Found linux image: /boot/vmlinuz-4.4.0-93-generic
Found initrd image: /boot/initrd.img-4.4.0-93-generic
Found memtest86+ image: /memtest86+.elf
Found memtest86+ image: /memtest86+.bin
Setting up rocm-device-libs (0.0.1) …
Setting up rocm-smi (1.0.0-25-gbdb99b4) …
Setting up rocm-profiler (5.1.6400) …
Setting up rocm-dev (1.6.148) …
Setting up rocm (1.6.148) …
Reboot the server. Make sure that the Linux ROCm kernel is running:
Welcome to Ubuntu 16.04.3 LTS (GNU/Linux 4.11.0-kfd-compute-rocm-rel-1.6-148 x86_64)
* Documentation: https://help.ubuntu.com
* Management: https://landscape.canonical.com
* Support: https://ubuntu.com/advantage
0 packages can be updated.
0 updates are security updates.
Test if your installation works with this sample program:
You should get an output similar to this:
Initializing the hsa runtime succeeded.
Checking finalizer 1.0 extension support succeeded.
Generating function table for finalizer succeeded.
Getting a gpu agent succeeded.
Querying the agent name succeeded.
The agent name is gfx803.
Querying the agent maximum queue size succeeded.
The maximum queue size is 131072.
Creating the queue succeeded.
“Obtaining machine model” succeeded.
“Getting agent profile” succeeded.
Create the program succeeded.
Adding the brig module to the program succeeded.
Query the agents isa succeeded.
Finalizing the program succeeded.
Destroying the program succeeded.
Create the executable succeeded.
Loading the code object succeeded.
Freeze the executable succeeded.
Extract the symbol from the executable succeeded.
Extracting the symbol from the executable succeeded.
Extracting the kernarg segment size from the executable succeeded.
Extracting the group segment size from the executable succeeded.
Extracting the private segment from the executable succeeded.
Creating a HSA signal succeeded.
Finding a fine grained memory region succeeded.
Allocating argument memory for input parameter succeeded.
Allocating argument memory for output parameter succeeded.
Finding a kernarg memory region succeeded.
Allocating kernel argument memory buffer succeeded.
Dispatching the kernel succeeded.
Freeing kernel argument memory buffer succeeded.
Destroying the signal succeeded.
Destroying the executable succeeded.
Destroying the code object succeeded.
Destroying the queue succeeded.
Freeing in argument memory buffer succeeded.
Freeing out argument memory buffer succeeded.
Shutting down the runtime succeeded.
We are installing the Docker Community Edition (also called Docker CE) on the host by using Docker’s apt repository. Our procedure is based on documentation published by Docker.8 There may be some slight differences from the original documentation. Note that the installation is done as the superuser. You can also use sudo to install Docker.
First, remove old versions of Docker:
|apt remove docker docker-engine|
If they are not installed, you will simply get a message that they are missing.
Install the following prerequisite packages using apt:
Add the Docker GPG key to your host:
|curl -fsSL https://download.docker.com/linux/ubuntu/gpg |
sudo apt-key add –
The GPG fingerprint should be 9DC8 5822 9FC7 DD38 854A E2D8 8D81 803C 0EBF CD88. Use the command
|apt-key fingerprint 0EBFCD88|
to verify this.
Now add the repository information:
“deb [arch=amd64] https://download.docker.com/linux/ubuntu \
$(lsb_release -cs) \
Finally, issue the command ‘apt update’.
Installing Docker CE should be done with ‘apt install docker-ce’. After the installation is complete, verify that Docker is properly configured and installed using the command ‘docker run hello-world’.
Running ROCm Docker Images
AMD provides a Docker image of the ROCm software framework.9 The image can be pulled from the official Docker repository:
|sudo docker pull rocm/rocm-terminal|
The image is about 1.5 GB in size and contains the necessary libraries to run ROCm-based applications. Create a container out of this image and look at the installed software in /opt/rocm:
|sudo docker run -it –rm –device=/dev/kfd rocm/rocm-terminal|
You can check for the ROCm libraries using ldconfig:
The command above should list all the libraries in the library path including the ROCm libraries.
The ROCm-docker source is available from GitHub:
git clone https://github.com/RadeonOpenCompute/ROCm-docker.git
Creating A ROCm Application Docker Image
We can use the rocm/rocm-terminal Docker image to build our own ROCm application Docker image. In the following examples, we use a couple of the sample applications that come with the ROCm development package. One of them shall be /opt/rocm/hip/samples/1_Utils/hipInfo.
Assuming the host has the complete ROCm development tools, we just do the following:
The outcome of the make command shall be a binary called hipInfo.
If the compiler complains because of a missing shared library called libsupc++, we will need to install that somewhere in the host’s library path. In our case, we shall place the shared library in /usr/local/lib and make sure that ldconfig can find it. You can simply create a shared library from the installed static library /usr/lib/gcc/x86_64-linux-gnu/4.8/libsupc++.a:
|mkdir -p ~/tmp/libsupc++
ar x /usr/lib/gcc/x86_64-linux-gnu/4.8/libsupc++.a
ls -l *.o
gcc -shared -o libsupc++.so *.o
sudo cp -p libsupc++.so /usr/local/lib/
sudo ldconfig -v
Make sure that /usr/local/lib is seen by ldconfig. You may have to specify this directory in /etc/ld.so.conf.d if it is not found. Simply add a file named local_lib.conf with the line /usr/local/lib by itself.
Check the output of hipInfo by running it. You should get something like this (it will be slightly different from the literal output below depending on what type of GPU configuration you have):
compiler: hcc version=1.0.17312-d1f4a8a-19aa706-56b5abe, workweek (YYWWD) = 17312
Name: Device 67df
clockRate: 1303 Mhz
memoryClockRate: 2000 Mhz
clockInstructionRate: 1000 Mhz
totalGlobalMem: 8.00 GB
maxSharedMemoryPerMultiProcessor: 8.00 GB
sharedMemPerBlock: 64.00 KB
memInfo.total: 8.00 GB
memInfo.free: 7.75 GB (97%)
Now that hipInfo is compiled and has been tested, let us create a Docker image with it. Create a directory for building an image with Docker.
Copy the necessary files for the Docker image to run properly:
|cp -p /usr/local/lib/libsupc++.so . # If hipInfo needs this
cp -p /opt/rocm/hip/samples/1_Utils/hipInfo/hipInfo .
Create a file named Dockerfile in the current directory. It should contain this:
COPY libsupc++.so /usr/local/lib/
Build the Docker image:
|sudo docker build -t my_rocm_hipinfo .|
Create and run a container based on the new image:
|sudo docker run –rm –device=”/dev/kfd” my_rocm_hipinfo|
The device /dev/kfd is the kernel fusion driver. You should be getting a similar output as if you ran the hipInfo binary directly on the host.
Without the –rm parameter, the container will persist. You can then run the same container again and get some output:
|sudo docker run –device=”/dev/kfd” –name nifty_hugle my_rocm_hipinfo|
The Docker container shall persist:
|sudo docker ps -a|
You may get an output that looks like this:
Now, try this command and you should see the output from hipInfo again:
|sudo docker start -i nifty_hugle|
The second Docker image we shall create will contain the sample binary called vector_copy. The source is in /opt/rocm/hsa/sample. As done with hipInfo, use make to build the binary. Note that this binary also depends on the files with the .brig extension to run.
We do the following before we build the image:
cp -p /usr/local/lib/libsupc++.so . # Do this if necessary
cp -p /opt/rocm/hsa/sample/vector_copy .
cp -p /opt/rocm/hsa/sample/vector_copy*.brig .
cd .. # Back to ~/tmp/my_rocm_vectorcopy
For our Dockerfile, we have this:
COPY libsupc++.so /usr/local/lib/
Building the Docker image for vector_copy should be familiar by now.
As an exercise, run the Docker image to see what output you get. Try with or without –rm and with the ‘docker start’ command.
For our last example, we shall use a Docker container for the Caffe deep learning framework. We are going to use the HIP port of Caffe which can be targeted to both AMD ROCm and Nvidia CUDA devices.10 Converting CUDA code to portable C++ is enabled by HIP. For more information on HIP, see https://github.com/ROCm-Developer-Tools/HIP.
Let us pull the hip-caffe image from the Docker registry:
|docker pull intuitionfabric/hip-caffe|
Test the image by running a device query on the AMD GPUs:
|sudo docker run –name my_caffe -it –device=/dev/kfd –rm \
intuitionfabric/hip-caffe ./build/tools/caffe device_query -gpu all
You should get an output similar to the one below. Note that your output may differ due to your own host configuration.
|I0831 19:05:30.814853 1 caffe.cpp:138] Querying GPUs all
I0831 19:05:30.815135 1 common.cpp:179] Device id: 0
I0831 19:05:30.815145 1 common.cpp:180] Major revision number: 2
I0831 19:05:30.815148 1 common.cpp:181] Minor revision number: 0
I0831 19:05:30.815153 1 common.cpp:182] Name: Device 67df
I0831 19:05:30.815158 1 common.cpp:183] Total global memory: 8589934592
I0831 19:05:30.815178 1 common.cpp:184] Total shared memory per block: 65536
I0831 19:05:30.815192 1 common.cpp:185] Total registers per block: 0
I0831 19:05:30.815196 1 common.cpp:186] Warp size: 64
I0831 19:05:30.815201 1 common.cpp:188] Maximum threads per block: 1024
I0831 19:05:30.815207 1 common.cpp:189] Maximum dimension of block: 1024, 1024, 1024
I0831 19:05:30.815210 1 common.cpp:192] Maximum dimension of grid: 2147483647, 2147483647, 2147483647
I0831 19:05:30.815215 1 common.cpp:195] Clock rate: 1303000
I0831 19:05:30.815219 1 common.cpp:196] Total constant memory: 16384
I0831 19:05:30.815223 1 common.cpp:200] Number of multiprocessors: 36
Let us now run Caffe in a container. We begin by creating a container for this purpose.
|sudo docker run -it –device=/dev/kfd –rm intuitionfabric/hip-caffe|
Run the MNIST example in the container. Once the above command is executed, you should be inside the container.
First, get the raw MNIST data:
Make sure you format the data for Caffe:
Once that’s done, proceed with training the network:
You should get an output similar to this:
|I0831 18:43:19.290951 37 caffe.cpp:217] Using GPUs 0
I0831 18:43:19.291165 37 caffe.cpp:222] GPU 0: Device 67df
I0831 18:43:19.294853 37 solver.cpp:48] Initializing solver from parameters:
I0831 18:43:19.294972 37 solver.cpp:91] Creating training net from net file: examples/mnist/lenet_train_test.prototxt
I0831 18:43:19.295145 37 net.cpp:322] The NetState phase (0) differed from the phase (1) specified by a rule in layer mnist
I0831 18:43:19.295169 37 net.cpp:322] The NetState phase (0) differed from the phase (1) specified by a rule in layer accuracy
I0831 18:43:19.295181 37 net.cpp:58] Initializing net from parameters:
I0831 18:43:19.295332 37 layer_factory.hpp:77] Creating layer mnist
I0831 18:43:19.295426 37 net.cpp:100] Creating Layer mnist
I0831 18:43:19.295444 37 net.cpp:408] mnist -> data
I0831 18:43:19.295478 37 net.cpp:408] mnist -> label
I0831 18:43:19.304414 40 db_lmdb.cpp:35] Opened lmdb examples/mnist/mnist_train_lmdb
I0831 18:43:19.304760 37 data_layer.cpp:41] output data size: 64,1,28,28
I0831 18:43:19.305835 37 net.cpp:150] Setting up mnist
I0831 18:43:19.305842 37 net.cpp:157] Top shape: 64 1 28 28 (50176)
I0831 18:43:19.305848 37 net.cpp:157] Top shape: 64 (64)
I0831 18:43:19.305851 37 net.cpp:165] Memory required for data: 200960
I0831 18:43:19.305874 37 layer_factory.hpp:77] Creating layer conv1
I0831 18:43:19.305907 37 net.cpp:100] Creating Layer conv1
I0831 18:43:19.305912 37 net.cpp:434] conv1 <- data
I0831 18:43:19.305940 37 net.cpp:408] conv1 -> conv1
I0831 18:43:19.314159 37 cudnn_conv_layer.cpp:259] Before miopenConvolution*GetWorkSpaceSize
I0831 18:43:19.319051 37 cudnn_conv_layer.cpp:295] After miopenConvolution*GetWorkSpaceSize
I0831 18:43:19.319625 37 cudnn_conv_layer.cpp:468] Before miopenFindConvolutionForwardAlgorithm
I0831 18:43:19.927783 37 cudnn_conv_layer.cpp:493] fwd_algo_: 1
I0831 18:43:19.927809 37 cudnn_conv_layer.cpp:494] workspace_fwd_sizes_:57600
I0831 18:43:19.928071 37 cudnn_conv_layer.cpp:500] Before miopenFindConvolutionBackwardWeightsAlgorithm
….….I0831 18:43:23.296785 37 net.cpp:228] mnist does not need backward computation.
I0831 18:43:23.296789 37 net.cpp:270] This network produces output loss
I0831 18:43:23.296799 37 net.cpp:283] Network initialization done.
I0831 18:43:23.296967 37 solver.cpp:181] Creating test net (#0) specified by net file: examples/mnist/lenet_train_test.prototxt
I0831 18:43:23.296985 37 net.cpp:322] The NetState phase (1) differed from the phase (0) specified by a rule in layer mnist
I0831 18:43:23.296995 37 net.cpp:58] Initializing net from parameters:
I0831 18:44:12.620506 37 solver.cpp:404] Test net output #1: loss = 0.0299084 (* 1 = 0.0299084 loss)
In this article, we provided with you a guide on how to use AMD’s ROCm framework with Docker container technology. This should serve as a good jumpstart to begin your Deep Learning development using AMDs platform.
Docker has become an essential technology in containing the complexity of Deep Learning development. Deep Learning frameworks and tools have many dependencies. By leveraging Docker to isolate these dependencies within a Linux container leads to not only greater reliability and robustness but also to greater agility and flexibility. There are many frameworks and tools that are emerging and it is best practice to have a robust solution to the management of disparate parts. Docker containers have become a standard practice in Deep Learning and this technology is well supported by AMD’s ROCm framework.