Nvidia DGX1 – EECS Support Pages

A subset of the vision group have access to 2 Nvidia DGX1 supercomputers, to determine whether you have access to one or both of these machines you can type “id | grep -c 4104” on any managed machine for a boolean yes or no response. These machines run various learning frameworks as docker containers, please ensure that – when running a container – you name the container as username_identifier (example: -n tim_theanotest3) and that you map your EECS uidNumber to the running container (example: –user $UID).

dgx1-1

dgx1-1 is the original Ubuntu 12.02 release of the Nvidia Deep Learning framework about which there is little documentation.

dgx1-2

dgx1-2 was upgraded on 29th January 2017 and is running NVIDIA GPU Optimized Deep Learning Frameworks Container Release 16.11. The standard (nvcr.io/nvidia) container images are heavily optimized for P100 GPU and nvlink and are built on Ubuntu 14.04 with CUDA 8.0.54 with CuDNN 5.1.10 and NCCL 1.6.1. Images provided include the following:

Caffe (based on NVIDIA/Caffe 0.16)
- Supports fp32 arithmetic and storage, fp16 arithmetic and storage
- [NEW] Supports full fp16 training of convolutional neural nets (e.g., popular vision networks) – all layers except the Data, DetectNetTransformation, Softmax, SoftmaxWithLoss, and Accuracy layers
- Optimized multi-GPU training
- Seamless NCCL integration
- [IMPROVED] Significant improvements to fp32 performance scaling compared with earlier releases (NVIDIA/Caffe 0.15)
- Prototxt (text file), C++, and Python frontends
Microsoft Cognitive Toolkit / CNTK (based on 2.0.beta2)
- Supports fp32 arithmetic and storage
- Optimized multi-GPU training
- [NEW] NCCL integration for improved multi-GPU scaling
- Supports quantized (1-bit) communication
- Supports recurrent neural networks
- Supports cuDNN RNN layers (note: requires explicit use by model script)
- BrainScript (text file), Python, and C++ front-ends
NVIDIA DIGITS (based on 4.0)
- Web-based graphical user interface for DL training
- Includes Caffe 16.11
- Includes Torch 16.10
TensorFlow (based on 0.11.0rc2)
- Supports fp32 arithmetic and storage
- Supports multi-GPU training
- [BETA] NCCL integration for improved multi-GPU scaling (note: requires explicit use by model script)
- Supports recurrent neural networks
- [NEW] Support for cuDNN RNN layers (note: requires explicit use by model script)
- [IMPROVED] Better I/O throughput via libjpeg-turbo, fast iDCT decoding
- Python front-end
Theano (based on 0.8.2)
- Supports fp32 arithmetic and storage
- Supports recurrent neural networks
- Python front-end
Torch (based on Torch7 from 08 Nov 2016)
- Supports fp32 arithmetic and storage; some support for fp16 arithmetic and/or storage
- [NEW] Pseudo-fp16 support in cunn layers
- Optimized multi-GPU training
- Seamless NCCL integration (opt-in)
- [IMPROVED] Better control over workspace memory usage via support the new cudnnFindEx() routine; enable in model script by adding “cudnn.useFindEx = true”
- [BETA] New caching memory allocator reduces overheads and improves ease of use; enable via environment variable: “export THC_CACHING_ALLOCATOR=1”
- Supports recurrent neural networks
- Supports cuDNN RNN layers
- [IMPROVED] Tuned RNN performance
- Lua frontend

Known issues:

Caffe

When running the container as a non-root user, a message about Matplotlib’s font cache may appear. This message is safe to ignore and will be removed in a future release.

CNTK

Bug 1845813: OpenCV is installed from source without disabling the IEEE 1394 camera driver, which could lead to a spurious runtime error (“libdc1394 error: Failed to initialize libdc1394”). This message is safe to ignore and will be removed in a future release.

DIGITS

When running the container as a non-root user, a message about Matplotlib’s font cache may appear. This message is safe to ignore and will be removed in a future release.