Minsky Deep Learning and PowerAI

The Minsky boxes and PowerAI

EECS have been provided with a pair of Power8 based servers on a try-before-you-buy basis from IBM. Each server has 256GB of RAM, 2 Power8 CPU (at a huge but odd number of 77 cores per CPU) and 4 Tesla P100 GPU with 16GB of RAM per GPU connected to the CPU using Nvlink. The two boxes are named marvin (available to researchers in the vision research group) and gloria (available to all). Access is available via SSH key based authentication only and please note – especially if you are an Anaconda user – that these are Power8 not x86 architecture boxes and that code compiled for the latter will not run without recompilation. Also: Both machines have 500GB of scratch – do use that in preference to your home drive for ephemeral (interim) data as it will be faster.

Specific frameworks available

caffe-bvlc – Berkeley Vision and Learning Center (BVLC) upstream Caffe, v1.0.0rc3
caffe-ibm – IBM Optimized version of BVLC Caffe, v1.0.0rc3
caffe-nv – NVIDIA fork of Caffe, v0.15.13
chainer – Chainer, v1.18.0
digits – DIGITS, v5.0.0-rc.1
tensorflow – Google TensorFlow, v1.0.0
theano – Theano, v0.8.2
torch – Torch, v7

Getting started with MLDL Frameworks

General setup

Each framework package provides a shell script to simplify environmental setup.

You can either update your shell rc file (e.g. .bashrc) to source the desired setup scripts. For example:

 source /opt/DL/<framework>/bin/<framework>-activate

or simply execute the activation script in a similar manner to Python’s VirtualEnv after login:

[09:42]tim@gloria(3):~[0]$ source /opt/DL/<framework>/bin/<framework>-activate

Sadly – and unlike VirtualEnv – there’s no deactivate command available, you’ll need to prune your env settings or log out and back in to change frameworks. Each framework also provides a test script to verify basic function:

    [09:42]tim@gloria(3):~[0]$ <framework>-test

Please try the test script before reporting any errors.

Getting started with Caffe

Caffe alternatives

Packages are provided for upstream BVLC Caffe (/opt/DL/caffe-bvlc), IBM optimized BVLC Caffe (/opt/DL/caffe-ibm), and NVIDIA’s Caffe (/opt/DL/caffe-nv). The system default is set to the IBM optimized version so

   [09:42]tim@gloria(3):~[0]$ source /opt/DL/caffe/bin/caffe-activate

will enable that. For a specific version source the specific script (we would be more than happy to receive comparative benchmarks for your job). Do be aware that the Nvidia Caffe version is NCCL enabled so you’ll need to enable both:


[09:42]tim@gloria(3):~[0]$ source /opt/DL/nccl/bin/nccl-activate
[09:42]tim@gloria(3):~[0]$ source /opt/DL/caffe-nv/bin/caffe-activate

Attempting to activate multiple Caffe packages in a single login session will cause unpredictable behavior.

Caffe samples and examples

Each Caffe package includes example scripts and sample models, etc. A script is provided to copy the sample content into a specified directory:

    [09:42]tim@gloria(3):~[0]$ caffe-install-samples <somedir>

More info

Visit Caffe’s website (http://caffe.berkeleyvision.org/) for tutorials and example programs that you can run to get started.

Here are links to a couple of the example programs:

LeNet MNIST Tutorial – Train a neural network to understand handwritten digits
CIFAR-10 tutorial – Train a convolutional neural network to classify small images

Getting started with Chainer

The Chainer home page at http://chainer.org/ includes documentation for the Chainer project, including a Quick Start example.

Getting started with Tensorflow

The TensorFlow homepage (https://www.tensorflow.org/) has a variety of information, including Tutorials, How Tos, and a Getting Started guide.

Additional tutorials and examples are available from the community, for example:

API changes and sample models

Note that the TensorFlow API is updated in version 1.0, so programs written for earlier versions of TensorFlow may need to be updated. The TensorFlow v1.0.0 release notes describe the changes and link to a conversion tool. See: https://github.com/tensorflow/tensorflow/releases/tag/v1.0.0

The TensorFlow team provides example models on GitHub at https://github.com/tensorflow/models Some of the example models may not be updated for the new API.

For the inception/imagenet_train example:

For Tensorflow 1.0.0Commit ef84162c from fork repo https://github.com/ibmsoe/tensorflow-models (i.e. branch inception-imagenet-1.0) should work.
For TensorFlow 0.12.0Commit 91c7b91f from upstream repo https://github.com/tensorflow/models should work. The example will print may API warnings as it starts up, but should run to completion.

Additional features

The PowerAI TensorFlow packages include TensorBoard. See: https://www.tensorflow.org/get_started/summaries_and_tensorboard

The TensorFlow 1.0 package also includes NCCL support, and experimental suppport for XLA JIT compilation. See: https://www.tensorflow.org/versions/master/experimental/xla/ as with NV Caffe above you will need to source the NCCL libabry module as well as Tensorflow to get this working.

Getting started with Torch

The Torch Cheatsheet contains lots of info for people new to Torch, including tutorials and examples.

The Torch project has a demos repository at https://github.com/torch/demos

Tutorials can be found at https://github.com/torch/tutorials

Visit Torch’s website for the latest from Torch.

Torch samples and examples

The Torch package includes example scripts and samples models. A script is provided to copy the sample content into a specified directory:

    [10:45]tim@gloria(0):~[0]$ torch-install-samples <somedir>

Among these are the Imagenet examples from https://github.com/soumith/imagenet-multiGPU.torch with a few modifications.

Extending Torch with additional Lua rocks

The Torch package includes several Lua rocks useful for creating Deep Learning applications. Additional Lua rocks can be installed locally to extend functionality. For example a rock providing NCCL bindings can be installed by:

    [10:45]tim@gloria(0):~[0]$ source /opt/DL/torch/bin/torch-activate
    [10:45]tim@gloria(0):~[0]$ source /opt/DL/nccl/bin/nccl-activate

    [10:47]tim@gloria(0):~[0]$ luarocks install --local --deps-mode=all "https://raw.githubusercontent.com/ngimel/nccl.torch/master/nccl-scm-1.rockspec"
    ...
    nccl scm-1 is now built and installed in /home/user/.luarocks/ (license: BSD)

    [10:47]tim@gloria(0):~[0]$ luajit
    LuaJIT 2.1.0-beta1 -- Copyright (C) 2005-2015 Mike Pall. http://luajit.org/
    JIT: OFF
    > require 'torch'
    > require 'nccl'
    >

Getting started with Theano

Here are some links to help you get started with Theano:

Visit Theano’s website for the latest from Theano.

Getting started with DIGITS

The first time it’s run digits-activate will create a .digits subdirectory in your home containing the DIGITS jobs directory, as well as the digits.log file

Multiple instances of the DIGITS server can be run at once, including by different users, but users may need to set the network port number to avoid conflicts, I would suggest using your uid .

To start DIGITS server with default port (5000):

    [10:50]tim@gloria(0):~[0]$ digits-devserver

To start DIGITS server with your uid as the listening port

    [10:50]tim@gloria(0):~[0]$ digits-devserver -p $UID

NVIDIA’s DIGITS site has more information about DIGITS.

The DIGITS Getting Started guide describes how to train a network model to classify the MNIST hand-written digits dataset.

Additional DIGITS examples are available at https://github.com/NVIDIA/DIGITS/tree/master/examples

Using Torch with DIGITS

Using Torch with DIGITS requires additional packages that are not part of this PowerAI release distribution.

Torch can be made to work with DIGITS if you install additional luarocks needed for DIGITS’ Torch support


[10:58]tim@gloria(0):~[0]$ source /opt/DL/torch/bin/torch-activate
[10:58]tim@gloria(0):~[0]$ luarocks install --local --deps-mode=order tds 
[10:58]tim@gloria(0):~[0]$ luarocks install --local --deps-mode=order totem 
[10:58]tim@gloria(0):~[0]$ luarocks install --local --deps-mode=order "https://raw.github.com/deepmind/torch-hdf5/master/hdf5-0-0.rockspec" 
[10:58]tim@gloria(0):~[0]$ luarocks install --local --deps-mode=order "https://raw.github.com/Neopallium/lua-pb/master/lua-pb-scm-0.rockspec" 
[10:58]tim@gloria(0):~[0]$ luarocks install --local --deps-mode=order lightningmdb 0.9.18.1-1 LMDB_INCDIR=/usr/include LMDB_LIBDIR=/usr/lib/powerpc64le-linux-gnu 
[10:58]tim@gloria(0):~[0]$ luarocks install --local --deps-mode=order "https://raw.githubusercontent.com/ngimel/nccl.torch/master/nccl-scm-1.rockspec"

Legal Notices

IBM, the IBM logo, ibm.com, POWER, Power, POWER8, and Power systems are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml.

Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both.

The TensorFlow package includes code from the BoringSSL project. The following notices may apply:

    This product includes software developed by the OpenSSL Project for
    use in the OpenSSL Toolkit. (http://www.openssl.org/)

    This product includes cryptographic software written by Eric Young
    (eay@cryptsoft.com)