deco image

Deep Learning, an advanced form of machine learning, has generated a lot of interest due to the wide range of applications on complex data sets. Current technologies and the availability of very large amounts of complex data have made analytics on the latter more tractable.1

With deep neural networks as basis for deep learning algorithms, GPUs are now being used in deep learning applications because they provide many processing units. These processing units simulate a neural network that does the computation on data. Neural networks can therefore scale and improve the extraction of information from data.

ROCm and The AMD Deep Learning Stack

The AMD Deep Learning Stack is the result of AMD’s initiative to enable DL applications using their GPUs such as the Radeon Instinct product line.  Currently, deep learning frameworks such as Caffe, Torch, and TensorFlow are being ported and tested to run on the AMD DL stack. Supporting these frameworks is MIOpen, AMD’s open-source deep learning library built for the Radeon Instinct line of compute accelerators.2

AMD’s ROCm platform serves as the foundation of this DL stack. ROCm enables the seamless integration of the CPU and GPU for high performance computing (HPC) and ultra-scale class computing. To achieve this, ROCm is built for language independence and takes advantage of the Heterogenous System Architecture (HSA) Runtime API.3 This is the basis of the ROCr System Runtime, a thin user-mode API providing access to graphics hardware driven by the AMDGPU driver and the ROCk kernel driver.4

For now, OS support for ROCm is limited to Ubuntu 14.04, Ubuntu 16.04, and Fedora 23. For these OSs, AMD provides a modified Linux version 4.6 kernel with patches to the HSA kernel driver (amdkfd) and the AMDGPU (amdgpu) kernel driver currently in the mainline Linux kernel.5

Using Docker With The AMD Deep Learning Stack

Docker Containers

Software containers isolate the application and its dependencies from other software installed on the host. They abstract the underlying operating system while keeping its own resources (filesystem, memory, CPU) and environment separate from other containers.

In contrast to virtual machines, all containers running on the same host share a single operating system without the need to virtualize a complete machine with its own OS. This makes software containers perform much faster than virtual machines because of the lack of overhead from the guest OS and the hypervisor.

Docker is the most popular software container platform today. It is available for Linux, macOS, and Microsoft Windows. Docker containers can run under any OS with the Docker platform installed.6

Installing Docker and The AMD Deep Learning Stack

The ROCm-enabled Linux kernel and the ROCk driver, together with other needed kernel modules, must be installed on all hosts that run Docker containers. This is because the containers do not have the kernel installed inside them. Instead, the containers share the host kernel.7

The installation procedure described here is for Ubuntu 16.04. Ubuntu 16.04 is currently the most tested OS for ROCm.

Installing ROCm

The next step is to install ROCm and the ROCm kernel on each host. The procedure described below is based on instructions found in https://rocm.github.io/install.html.

Grab and install the GPG key for the repository:

wget -qO – http://repo.radeon.com/rocm/apt/debian/rocm.gpg.key | sudo apt-key add –

You should get the message ‘OK’. You can check if it’s there using apt-key:

apt-key list

In /etc/apt/sources.list.d, create a file named rocm.list and place the following line in it:

deb [arch=amd64] http://repo.radeon.com/rocm/apt/debian/ xenial main

Update the repository information by running ‘apt update’. If you get a warning because of the key signature, you may ignore it since the repository administrator will update this in the future.

Install the ROCm Runtime software stack using ‘apt install rocm’:

[root@pegasus ~]# apt install rocm
Reading package lists… Done
Building dependency tree
Reading state information… Done

The following packages were automatically installed and are no longer required:
hcblas hcfft hcrng miopengemm
Use ‘sudo apt autoremove’ to remove them.
The following additional packages will be installed:
hcc hip_hcc linux-headers-4.11.0-kfd-compute-rocm-rel-1.6-148 linux-image-4.11.0-kfd-compute-rocm-rel-1.6-148 rocm-dev
rocm-device-libs rocm-profiler rocm-smi rocm-utils

Suggested packages: 
linux-firmware-image-4.11.0-kfd-compute-rocm-rel-1.6-148

The following NEW packages will be installed:
hcc hip_hcc linux-headers-4.11.0-kfd-compute-rocm-rel-1.6-148 linux-image-4.11.0-kfd-compute-rocm-rel-1.6-148 rocm rocm-dev
rocm-device-libs rocm-profiler rocm-smi rocm-utils
0 upgraded, 10 newly installed, 0 to remove and 0 not upgraded.
Need to get 321 MB of archives.
After this operation, 1,934 MB of additional disk space will be used.
Do you want to continue? [Y/n]
Get:1 http://repo.radeon.com/rocm/apt/debian xenial/main amd64 rocm-utils amd64 1.0.0 [30.7 kB]
Get:2 http://repo.radeon.com/rocm/apt/debian xenial/main amd64 hcc amd64 1.0.17312 [255 MB]
Get:3 http://repo.radeon.com/rocm/apt/debian xenial/main amd64 hip_hcc amd64 1.2.17305 [876 kB]
Get:4 http://repo.radeon.com/rocm/apt/debian xenial/main amd64 linux-headers-4.11.0-kfd-compute-rocm-rel-1.6-148 amd64 4.11.0-kfd-compute-rocm-rel-1.6-148-1 [10.8 MB]
Get:5 http://repo.radeon.com/rocm/apt/debian xenial/main amd64 linux-image-4.11.0-kfd-compute-rocm-rel-1.6-148 amd64 4.11.0-kfd-compute-rocm-rel-1.6-148-1 [46.5 MB]
Get:6 http://repo.radeon.com/rocm/apt/debian xenial/main amd64 rocm-device-libs amd64 0.0.1 [587 kB]
Get:7 http://repo.radeon.com/rocm/apt/debian xenial/main amd64 rocm-smi amd64 1.0.0-25-gbdb99b4 [8,158 B]
Get:8 http://repo.radeon.com/rocm/apt/debian xenial/main amd64 rocm-profiler amd64 5.1.6400 [7,427 kB]
Get:9 http://repo.radeon.com/rocm/apt/debian xenial/main amd64 rocm-dev amd64 1.6.148 [902 B]
Get:10 http://repo.radeon.com/rocm/apt/debian xenial/main amd64 rocm amd64 1.6.148 [1,044 B]
Fetched 321 MB in 31s (10.1 MB/s)
Selecting previously unselected package rocm-utils.
(Reading database … 254059 files and directories currently installed.)
Preparing to unpack …/rocm-utils_1.0.0_amd64.deb …
Unpacking rocm-utils (1.0.0) …
Selecting previously unselected package hcc.
Preparing to unpack …/hcc_1.0.17312_amd64.deb …
Unpacking hcc (1.0.17312) …
Selecting previously unselected package hip_hcc.
Preparing to unpack …/hip%5fhcc_1.2.17305_amd64.deb …
Unpacking hip_hcc (1.2.17305) …
Selecting previously unselected package linux-headers-4.11.0-kfd-compute-rocm-rel-1.6-148.
Preparing to unpack …/linux-headers-4.11.0-kfd-compute-rocm-rel-1.6-148_4.11.0-kfd-compute-rocm-rel-1.6-148-1_amd64.deb …
Unpacking linux-headers-4.11.0-kfd-compute-rocm-rel-1.6-148 (4.11.0-kfd-compute-rocm-rel-1.6-148-1) …
Selecting previously unselected package linux-image-4.11.0-kfd-compute-rocm-rel-1.6-148.
Preparing to unpack …/linux-image-4.11.0-kfd-compute-rocm-rel-1.6-148_4.11.0-kfd-compute-rocm-rel-1.6-148-1_amd64.deb …
Unpacking linux-image-4.11.0-kfd-compute-rocm-rel-1.6-148 (4.11.0-kfd-compute-rocm-rel-1.6-148-1) …
Selecting previously unselected package rocm-device-libs.
Preparing to unpack …/rocm-device-libs_0.0.1_amd64.deb …
Unpacking rocm-device-libs (0.0.1) …
Selecting previously unselected package rocm-smi.
Preparing to unpack …/rocm-smi_1.0.0-25-gbdb99b4_amd64.deb …
Unpacking rocm-smi (1.0.0-25-gbdb99b4) …
Selecting previously unselected package rocm-profiler.
Preparing to unpack …/rocm-profiler_5.1.6400_amd64.deb …
Unpacking rocm-profiler (5.1.6400) …
Selecting previously unselected package rocm-dev.
Preparing to unpack …/rocm-dev_1.6.148_amd64.deb …
Unpacking rocm-dev (1.6.148) …
Selecting previously unselected package rocm.
Preparing to unpack …/rocm_1.6.148_amd64.deb …
Unpacking rocm (1.6.148) …
Setting up rocm-utils (1.0.0) …
Setting up hcc (1.0.17312) …
Setting up hip_hcc (1.2.17305) …
Setting up linux-headers-4.11.0-kfd-compute-rocm-rel-1.6-148 (4.11.0-kfd-compute-rocm-rel-1.6-148-1) …
Setting up linux-image-4.11.0-kfd-compute-rocm-rel-1.6-148 (4.11.0-kfd-compute-rocm-rel-1.6-148-1) …
update-initramfs: Generating /boot/initrd.img-4.11.0-kfd-compute-rocm-rel-1.6-148
W: mdadm: /etc/mdadm/mdadm.conf defines no arrays.
Generating grub configuration file …
Found linux image: /boot/vmlinuz-4.11.0-kfd-compute-rocm-rel-1.6-148
Found initrd image: /boot/initrd.img-4.11.0-kfd-compute-rocm-rel-1.6-148
Found linux image: /boot/vmlinuz-4.4.0-93-generic
Found initrd image: /boot/initrd.img-4.4.0-93-generic
Found memtest86+ image: /memtest86+.elf
Found memtest86+ image: /memtest86+.bin
done
Setting up rocm-device-libs (0.0.1) …
Setting up rocm-smi (1.0.0-25-gbdb99b4) …
Setting up rocm-profiler (5.1.6400) …
Setting up rocm-dev (1.6.148) …
Setting up rocm (1.6.148) …
KERNEL==”kfd”, MODE=”0666″

Reboot the server. Make sure that the Linux ROCm kernel is running:

Welcome to Ubuntu 16.04.3 LTS (GNU/Linux 4.11.0-kfd-compute-rocm-rel-1.6-148 x86_64)

* Documentation:  https://help.ubuntu.com
* Management:     https://landscape.canonical.com
* Support:        https://ubuntu.com/advantage

0 packages can be updated.
0 updates are security updates.

Test if your installation works with this sample program:

cd /opt/rocm/hsa/sample
make
./vector_copy

You should get an output similar to this:

Initializing the hsa runtime succeeded.
Checking finalizer 1.0 extension support succeeded.
Generating function table for finalizer succeeded.
Getting a gpu agent succeeded.
Querying the agent name succeeded.
The agent name is gfx803.
Querying the agent maximum queue size succeeded.
The maximum queue size is 131072.
Creating the queue succeeded.
“Obtaining machine model” succeeded.
“Getting agent profile” succeeded.
Create the program succeeded.
Adding the brig module to the program succeeded.
Query the agents isa succeeded.
Finalizing the program succeeded.
Destroying the program succeeded.
Create the executable succeeded.
Loading the code object succeeded.
Freeze the executable succeeded.
Extract the symbol from the executable succeeded.
Extracting the symbol from the executable succeeded.
Extracting the kernarg segment size from the executable succeeded.
Extracting the group segment size from the executable succeeded.
Extracting the private segment from the executable succeeded.
Creating a HSA signal succeeded.
Finding a fine grained memory region succeeded.
Allocating argument memory for input parameter succeeded.
Allocating argument memory for output parameter succeeded.
Finding a kernarg memory region succeeded.
Allocating kernel argument memory buffer succeeded.
Dispatching the kernel succeeded.
Passed validation.
Freeing kernel argument memory buffer succeeded.
Destroying the signal succeeded.
Destroying the executable succeeded.
Destroying the code object succeeded.
Destroying the queue succeeded.
Freeing in argument memory buffer succeeded.
Freeing out argument memory buffer succeeded.
Shutting down the runtime succeeded.

Installing Docker

We are installing the Docker Community Edition (also called Docker CE) on the host by using Docker’s apt repository. Our procedure is based on documentation published by Docker.8 There may be some slight differences from the original documentation. Note that the installation is done as the superuser. You can also use sudo to install Docker.

First, remove old versions of Docker:

apt remove docker docker-engine

If they are not installed, you will simply get a message that they are missing.

Install the following prerequisite packages using apt:

  • apt-transport-https
  • ca-certificates
  • curl
  • software-properties-common

Add the Docker GPG key to your host:

curl -fsSL https://download.docker.com/linux/ubuntu/gpg |
sudo apt-key add –

The GPG fingerprint should be 9DC8 5822 9FC7 DD38 854A E2D8 8D81 803C 0EBF CD88. Use the command

apt-key fingerprint 0EBFCD88

to verify this.

Now add the repository information:

add-apt-repository \
“deb [arch=amd64] https://download.docker.com/linux/ubuntu \
$(lsb_release -cs) \
stable”

Finally, issue the command ‘apt update’.

Installing Docker CE should be done with ‘apt install docker-ce’. After the installation is complete, verify that Docker is properly configured and installed using the command ‘docker run hello-world’.

Running ROCm Docker Images

AMD provides a Docker image of the ROCm software framework.9 The image can be pulled from the official Docker repository:

sudo docker pull rocm/rocm-terminal

The image is about 1.5 GB in size and contains the necessary libraries to run ROCm-based applications. Create a container out of this image and look at the installed software in /opt/rocm:

sudo docker run -it –rm –device=/dev/kfd rocm/rocm-terminal

You can check for the ROCm libraries using ldconfig:

ldconfig -NXv

The command above should list all the libraries in the library path including the ROCm libraries.

The ROCm-docker source is available from GitHub:

mkdir ~/tmp
cd ~/tmp
git clone https://github.com/RadeonOpenCompute/ROCm-docker.git

 Creating A ROCm Application Docker Image

We can use the rocm/rocm-terminal Docker image to build our own ROCm application Docker image. In the following examples, we use a couple of the sample applications that come with the ROCm development package. One of them shall be /opt/rocm/hip/samples/1_Utils/hipInfo.

Assuming the host has the complete ROCm development tools, we just do the following:

cd /opt/rocm/hip/samples/1_Utils/hipInfo
make

The outcome of the make command shall be a binary called hipInfo.

If the compiler complains because of a missing shared library called libsupc++, we will need to install that somewhere in the host’s library path. In our case, we shall place the shared library in /usr/local/lib and make sure that ldconfig can find it. You can simply create a shared library from the installed static library /usr/lib/gcc/x86_64-linux-gnu/4.8/libsupc++.a:

mkdir -p ~/tmp/libsupc++
cd ~/tmp/libsupc++
ar x /usr/lib/gcc/x86_64-linux-gnu/4.8/libsupc++.a
ls -l *.o
gcc -shared -o libsupc++.so *.o
sudo cp -p libsupc++.so /usr/local/lib/
sudo ldconfig -v

Make sure that /usr/local/lib is seen by ldconfig. You may have to specify this directory in /etc/ld.so.conf.d if it is not found. Simply add a file named local_lib.conf with the line /usr/local/lib by itself.

Check the output of hipInfo by running it. You should get something like this (it will be slightly different from the literal output below depending on what type of GPU configuration you have):

$ ./hipInfo
compiler: hcc version=1.0.17312-d1f4a8a-19aa706-56b5abe, workweek (YYWWD) = 17312
——————————————————————————–
device#                           0
Name:                             Device 67df
pciBusID:                         1
pciDeviceID:                      0
multiProcessorCount:              36
maxThreadsPerMultiProcessor:      2560
isMultiGpuBoard:                  1
clockRate:                        1303 Mhz
memoryClockRate:                  2000 Mhz
memoryBusWidth:                   256
clockInstructionRate:             1000 Mhz
totalGlobalMem:                   8.00 GB
maxSharedMemoryPerMultiProcessor: 8.00 GB
totalConstMem:                    16384
sharedMemPerBlock:                64.00 KB
regsPerBlock:                     0
warpSize:                         64
l2CacheSize:                      0
computeMode:                      0
maxThreadsPerBlock:               1024
maxThreadsDim.x:                  1024
maxThreadsDim.y:                  1024
maxThreadsDim.z:                  1024
maxGridSize.x:                    2147483647
maxGridSize.y:                    2147483647
maxGridSize.z:                    2147483647
major:                            2
minor:                            0
concurrentKernels:                1
arch.hasGlobalInt32Atomics:       1
arch.hasGlobalFloatAtomicExch:    1
arch.hasSharedInt32Atomics:       1
arch.hasSharedFloatAtomicExch:    1
arch.hasFloatAtomicAdd:           0
arch.hasGlobalInt64Atomics:       1
arch.hasSharedInt64Atomics:       1
arch.hasDoubles:                  1
arch.hasWarpVote:                 1
arch.hasWarpBallot:               1
arch.hasWarpShuffle:              1
arch.hasFunnelShift:              0
arch.hasThreadFenceSystem:        0
arch.hasSyncThreadsExt:           0
arch.hasSurfaceFuncs:             0
arch.has3dGrid:                   1
arch.hasDynamicParallelism:       0
peers:
non-peers:                        device#0
memInfo.total:                    8.00 GB
memInfo.free:                     7.75 GB (97%)

Now that hipInfo is compiled and has been tested, let us create a Docker image with it. Create a directory for building an image with Docker.

mkdir ~/tmp/my_rocm_hipinfo
cd ~/tmp/my_rocm_hipinfo

Copy the necessary files for the Docker image to run properly:

cp -p /usr/local/lib/libsupc++.so .   # If hipInfo needs this
cp -p /opt/rocm/hip/samples/1_Utils/hipInfo/hipInfo .

Create a file named Dockerfile in the current directory. It should contain this:

FROM rocm/rocm-terminal:latest

COPY libsupc++.so /usr/local/lib/
COPY hipInfo /usr/local/bin/
RUN sudo ldconfig

USER rocm-user
WORKDIR /home/rocm-user
ENV PATH “${PATH}:/opt/rocm/bin:/usr/local/bin”

ENTRYPOINT [“hipInfo”]

Build the Docker image:

sudo docker build -t my_rocm_hipinfo .

Create and run a container based on the new image:

sudo docker run –rm –device=”/dev/kfd” my_rocm_hipinfo

The device /dev/kfd is the kernel fusion driver. You should be getting a similar output as if you ran the hipInfo binary directly on the host.

Without the –rm parameter, the container will persist. You can then run the same container again and get some output:

sudo docker run –device=”/dev/kfd” –name nifty_hugle my_rocm_hipinfo

The Docker container shall persist:

sudo docker ps -a

You may get an output that looks like this:

Now, try this command and you should see the output from hipInfo again:

sudo docker start -i nifty_hugle

The second Docker image we shall create will contain the sample binary called vector_copy. The source is in /opt/rocm/hsa/sample. As done with hipInfo, use make to build the binary. Note that this binary also depends on the files with the .brig extension to run.

We do the following before we build the image:

mkdir ~/tmp/my_rocm_vectorcopy
cd ~/tmp/my_rocm_vectorcopy
mkdir vector_copy
cp -p /usr/local/lib/libsupc++.so . # Do this if necessary
cd vector_copy
cp -p /opt/rocm/hsa/sample/vector_copy .
cp -p /opt/rocm/hsa/sample/vector_copy*.brig .
cd ..   # Back to ~/tmp/my_rocm_vectorcopy

For our Dockerfile, we have this:

FROM rocm/rocm-terminal:latest

COPY libsupc++.so /usr/local/lib/
RUN sudo mkdir /usr/local/vector_copy
COPY vector_copy/* /usr/local/vector_copy/
RUN sudo ldconfig

USER rocm-user
ENV PATH “${PATH}:/opt/rocm/bin:/usr/local/vector_copy”

WORKDIR /usr/local/vector_copy
ENTRYPOINT [“vector_copy”]

 

Building the Docker image for vector_copy should be familiar by now.

As an exercise, run the Docker image to see what output you get. Try with or without –rm and with the ‘docker start’ command.

 

For our last example, we shall use a Docker container for the Caffe deep learning framework. We are going to use the HIP port of Caffe which can be targeted to both AMD ROCm and Nvidia CUDA devices.10 Converting CUDA code to portable C++ is enabled by HIP. For more information on HIP, see https://github.com/ROCm-Developer-Tools/HIP.

 

Let us pull the hip-caffe image from the Docker registry:

docker pull intuitionfabric/hip-caffe

Test the image by running a device query on the AMD GPUs:

sudo docker run –name my_caffe -it –device=/dev/kfd –rm \
intuitionfabric/hip-caffe ./build/tools/caffe device_query -gpu all

You should get an output similar to the one below. Note that your output may differ due to your own host configuration.

I0831 19:05:30.814853     1 caffe.cpp:138] Querying GPUs all
I0831 19:05:30.815135     1 common.cpp:179] Device id:                     0
I0831 19:05:30.815145     1 common.cpp:180] Major revision number:         2
I0831 19:05:30.815148     1 common.cpp:181] Minor revision number:         0
I0831 19:05:30.815153     1 common.cpp:182] Name:                          Device 67df
I0831 19:05:30.815158     1 common.cpp:183] Total global memory:           8589934592
I0831 19:05:30.815178     1 common.cpp:184] Total shared memory per block: 65536
I0831 19:05:30.815192     1 common.cpp:185] Total registers per block:     0
I0831 19:05:30.815196     1 common.cpp:186] Warp size:                     64
I0831 19:05:30.815201     1 common.cpp:188] Maximum threads per block:     1024
I0831 19:05:30.815207     1 common.cpp:189] Maximum dimension of block:    1024, 1024, 1024
I0831 19:05:30.815210     1 common.cpp:192] Maximum dimension of grid:     2147483647, 2147483647, 2147483647
I0831 19:05:30.815215     1 common.cpp:195] Clock rate:                    1303000
I0831 19:05:30.815219     1 common.cpp:196] Total constant memory:         16384
I0831 19:05:30.815223     1 common.cpp:200] Number of multiprocessors:     36

 

Let us now run Caffe in a container. We begin by creating a container for this purpose.

 

sudo docker run -it –device=/dev/kfd –rm intuitionfabric/hip-caffe

Run the MNIST example in the container. Once the above command is executed, you should be inside the container.

First, get the raw MNIST data:

./data/mnist/get_mnist.sh

Make sure you format the data for Caffe:

./examples/mnist/create_mnist.sh

Once that’s done, proceed with training the network:

./examples/mnist/train_lenet.sh

You should get an output similar to this:

I0831 18:43:19.290951    37 caffe.cpp:217] Using GPUs 0
I0831 18:43:19.291165    37 caffe.cpp:222] GPU 0: Device 67df
I0831 18:43:19.294853    37 solver.cpp:48] Initializing solver from parameters:
test_iter: 100
test_interval: 500
base_lr: 0.01
display: 100
max_iter: 10000
lr_policy: “inv”
gamma: 0.0001
power: 0.75
momentum: 0.9
weight_decay: 0.0005
snapshot: 5000
snapshot_prefix: “examples/mnist/lenet”
solver_mode: GPU
device_id: 0
net: “examples/mnist/lenet_train_test.prototxt”
train_state {
level: 0
stage: “”
}
I0831 18:43:19.294972    37 solver.cpp:91] Creating training net from net file: examples/mnist/lenet_train_test.prototxt
I0831 18:43:19.295145    37 net.cpp:322] The NetState phase (0) differed from the phase (1) specified by a rule in layer mnist
I0831 18:43:19.295169    37 net.cpp:322] The NetState phase (0) differed from the phase (1) specified by a rule in layer accuracy
I0831 18:43:19.295181    37 net.cpp:58] Initializing net from parameters:
name: “LeNet”
state {
phase: TRAIN
level: 0
stage: “”
}
layer {
name: “mnist”
type: “Data”
top: “data”
top: “label”
include {
phase: TRAIN
}
transform_param {
scale: 0.00390625
}
data_param {
source: “examples/mnist/mnist_train_lmdb”
batch_size: 64
backend: LMDB
}
}
layer {
name: “conv1”
type: “Convolution”
bottom: “data”
top: “conv1”
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 20
kernel_size: 5
stride: 1
weight_filler {
type: “xavier”
}
bias_filler {
type: “constant”
}
}
}
….….layer {
name: “loss”
type: “SoftmaxWithLoss”
bottom: “ip2”
bottom: “label”
top: “loss”
}
I0831 18:43:19.295332    37 layer_factory.hpp:77] Creating layer mnist
I0831 18:43:19.295426    37 net.cpp:100] Creating Layer mnist
I0831 18:43:19.295444    37 net.cpp:408] mnist -> data
I0831 18:43:19.295478    37 net.cpp:408] mnist -> label
I0831 18:43:19.304414    40 db_lmdb.cpp:35] Opened lmdb examples/mnist/mnist_train_lmdb
I0831 18:43:19.304760    37 data_layer.cpp:41] output data size: 64,1,28,28
I0831 18:43:19.305835    37 net.cpp:150] Setting up mnist
I0831 18:43:19.305842    37 net.cpp:157] Top shape: 64 1 28 28 (50176)
I0831 18:43:19.305848    37 net.cpp:157] Top shape: 64 (64)
I0831 18:43:19.305851    37 net.cpp:165] Memory required for data: 200960
I0831 18:43:19.305874    37 layer_factory.hpp:77] Creating layer conv1
I0831 18:43:19.305907    37 net.cpp:100] Creating Layer conv1
I0831 18:43:19.305912    37 net.cpp:434] conv1 <- data
I0831 18:43:19.305940    37 net.cpp:408] conv1 -> conv1
I0831 18:43:19.314159    37 cudnn_conv_layer.cpp:259] Before miopenConvolution*GetWorkSpaceSize
I0831 18:43:19.319051    37 cudnn_conv_layer.cpp:295] After miopenConvolution*GetWorkSpaceSize
I0831 18:43:19.319625    37 cudnn_conv_layer.cpp:468] Before miopenFindConvolutionForwardAlgorithm
I0831 18:43:19.927783    37 cudnn_conv_layer.cpp:493] fwd_algo_[0]: 1
I0831 18:43:19.927809    37 cudnn_conv_layer.cpp:494] workspace_fwd_sizes_[0]:57600
I0831 18:43:19.928071    37 cudnn_conv_layer.cpp:500] Before miopenFindConvolutionBackwardWeightsAlgorithm
….….I0831 18:43:23.296785    37 net.cpp:228] mnist does not need backward computation.
I0831 18:43:23.296789    37 net.cpp:270] This network produces output loss
I0831 18:43:23.296799    37 net.cpp:283] Network initialization done.
I0831 18:43:23.296967    37 solver.cpp:181] Creating test net (#0) specified by net file: examples/mnist/lenet_train_test.prototxt
I0831 18:43:23.296985    37 net.cpp:322] The NetState phase (1) differed from the phase (0) specified by a rule in layer mnist
I0831 18:43:23.296995    37 net.cpp:58] Initializing net from parameters:
name: “LeNet”
state {
phase: TEST
}
layer {
name: “mnist”
type: “Data”
top: “data”
top: “label”
include {
phase: TEST
}
transform_param {
scale: 0.00390625
}
data_param {
source: “examples/mnist/mnist_test_lmdb”
batch_size: 100
backend: LMDB
}
}……

I0831 18:44:12.620506    37 solver.cpp:404]     Test net output #1: loss = 0.0299084 (* 1 = 0.0299084 loss)
I0831 18:44:12.624415    37 solver.cpp:228] Iteration 9000, loss = 0.011652
I0831 18:44:12.624441    37 solver.cpp:244]     Train net output #0: loss = 0.011652 (* 1 = 0.011652 loss)
I0831 18:44:12.624449    37 sgd_solver.cpp:106] Iteration 9000, lr = 0.00617924
I0831 18:44:13.055759    37 solver.cpp:228] Iteration 9100, loss = 0.0061008
I0831 18:44:13.055778    37 solver.cpp:244]     Train net output #0: loss = 0.0061008 (* 1 = 0.0061008 loss)
I0831 18:44:13.055800    37 sgd_solver.cpp:106] Iteration 9100, lr = 0.00615496
I0831 18:44:13.497696    37 solver.cpp:228] Iteration 9200, loss = 0.00277705
I0831 18:44:13.497715    37 solver.cpp:244]     Train net output #0: loss = 0.00277706 (* 1 = 0.00277706 loss)
I0831 18:44:13.497720    37 sgd_solver.cpp:106] Iteration 9200, lr = 0.0061309
I0831 18:44:13.941920    37 solver.cpp:228] Iteration 9300, loss = 0.0111398
I0831 18:44:13.941941    37 solver.cpp:244]     Train net output #0: loss = 0.0111398 (* 1 = 0.0111398 loss)
I0831 18:44:13.941946    37 sgd_solver.cpp:106] Iteration 9300, lr = 0.00610706
I0831 18:44:14.386647    37 solver.cpp:228] Iteration 9400, loss = 0.0179196
I0831 18:44:14.386667    37 solver.cpp:244]     Train net output #0: loss = 0.0179195 (* 1 = 0.0179195 loss)
I0831 18:44:14.386672    37 sgd_solver.cpp:106] Iteration 9400, lr = 0.00608343
I0831 18:44:14.828459    37 solver.cpp:337] Iteration 9500, Testing net (#0)
I0831 18:44:14.983165    37 solver.cpp:404]     Test net output #0: accuracy = 0.9884
I0831 18:44:14.983183    37 solver.cpp:404]     Test net output #1: loss = 0.0393952 (* 1 = 0.0393952 loss)
I0831 18:44:14.987198    37 solver.cpp:228] Iteration 9500, loss = 0.00496538
I0831 18:44:14.987211    37 solver.cpp:244]     Train net output #0: loss = 0.00496537 (* 1 = 0.00496537 loss)
I0831 18:44:14.987217    37 sgd_solver.cpp:106] Iteration 9500, lr = 0.00606002
I0831 18:44:15.433176    37 solver.cpp:228] Iteration 9600, loss = 0.00308157
I0831 18:44:15.433193    37 solver.cpp:244]     Train net output #0: loss = 0.00308157 (* 1 = 0.00308157 loss)
I0831 18:44:15.433200    37 sgd_solver.cpp:106] Iteration 9600, lr = 0.00603682
I0831 18:44:15.878787    37 solver.cpp:228] Iteration 9700, loss = 0.00220143
I0831 18:44:15.878806    37 solver.cpp:244]     Train net output #0: loss = 0.00220143 (* 1 = 0.00220143 loss)
I0831 18:44:15.878813    37 sgd_solver.cpp:106] Iteration 9700, lr = 0.00601382
I0831 18:44:16.321408    37 solver.cpp:228] Iteration 9800, loss = 0.0108761
I0831 18:44:16.321426    37 solver.cpp:244]     Train net output #0: loss = 0.0108761 (* 1 = 0.0108761 loss)
I0831 18:44:16.321432    37 sgd_solver.cpp:106] Iteration 9800, lr = 0.00599102
I0831 18:44:16.765200    37 solver.cpp:228] Iteration 9900, loss = 0.00478531
I0831 18:44:16.765219    37 solver.cpp:244]     Train net output #0: loss = 0.00478531 (* 1 = 0.00478531 loss)
I0831 18:44:16.765226    37 sgd_solver.cpp:106] Iteration 9900, lr = 0.00596843
I0831 18:44:17.204908    37 solver.cpp:454] Snapshotting to binary proto file examples/mnist/lenet_iter_10000.caffemodel
I0831 18:44:17.208767    37 sgd_solver.cpp:273] Snapshotting solver state to binary proto file examples/mnist/lenet_iter_10000.solverstate
I0831 18:44:17.211735    37 solver.cpp:317] Iteration 10000, loss = 0.0044067
I0831 18:44:17.211750    37 solver.cpp:337] Iteration 10000, Testing net (#0)
I0831 18:44:17.364528    37 solver.cpp:404]     Test net output #0: accuracy = 0.9902
I0831 18:44:17.364547    37 solver.cpp:404]     Test net output #1: loss = 0.0303562 (* 1 = 0.0303562 loss)
I0831 18:44:17.364552    37 solver.cpp:322] Optimization Done.
I0831 18:44:17.364555    37 caffe.cpp:254] Optimization Done.

Conclusion

In this article, we provided with you a guide on how to use AMD’s ROCm framework with Docker container technology.  This should serve as a good jumpstart to begin your Deep Learning development using AMDs platform.

Docker has become an essential technology in containing the complexity of Deep Learning development.  Deep Learning frameworks and tools have many dependencies.  By leveraging Docker to isolate these dependencies within a Linux container leads to not only greater reliability and robustness but also to greater agility and flexibility.   There are many frameworks and tools that are emerging and it is best practice to have a robust solution to the management of disparate parts.  Docker containers have become a standard practice in Deep Learning and this technology is well supported by AMD’s ROCm framework.

No Comments

Be the first to comment this post.

Your email address will not be published. Required fields are marked *

1. import.io. Andrew Ng, Chief Scientist at Baidu, 2015. https://youtu.be/O0VN0pGgBZM.
2. Smith, Ryan. “AMD Announces Radeon Instinct: GPU Accelerators for Deep Learning, Coming In 2017.” AnandTech: Hardware News and Tech Reviews Since 1997, 12 Dec. 2016, http://www.anandtech.com/show/10905/amd-announces-radeon-instinct-deep-learning-2017/.
3. “ROCm. A New Era in GPU Computing.” ROCm, A New Era in Open GPU Computing, 16 Dec. 2016, https://rocm.github.io/index.html.
4. “RadeonOpenCompute/ROCR-Runtime.” GitHub, https://github.com/RadeonOpenCompute/ROCR-Runtime.
5. “ROCK-Kernel-Driver/README.md at Roc-1.6.0.” GitHub.com, 16 Nov. 2016, https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/blob/roc-1.6.x/README.md.
6. “What Is Docker?” Docker - Build, Ship, and Run Any App, Anywhere, https://www.docker.com/what-docker.
7. “ROCm-Docker.” GitHub - ROCM-Docker, https://github.com/RadeonOpenCompute/ROCm-docker. Accessed 24 Mar. 2017.
8. “Get Docker for Ubuntu.” Docker - Build, Ship, and Run Any App, Anywhere, https://docs.docker.com/engine/installation/linux/ubuntu/. Accessed 27 Mar. 2017.
9. “ROCm-Docker.” GitHub - ROCM-Docker, https://github.com/RadeonOpenCompute/ROCm-docker. Accessed 24 Mar. 2017.
10. “hipCaffe: The HIP Port of Caffe.” GitHub.com, https://github.com/ROCmSoftwarePlatform/hipCaffe/blob/hip/README.ROCm.md. Accessed 01 Jun. 2017.