Nvidia DGX1

A subset of the vision group have access to 2 Nvidia DGX1 supercomputers, to determine whether you have access to one or both of these machines you can type “id | grep -c 4104” on any managed machine for a boolean yes or no response. These machines run various learning frameworks as docker containers, please ensure that – when running a container – you name the container as username_identifier (example: -n tim_theanotest3) and that you map your EECS uidNumber to the running container (example: –user $UID).

dgx1-1

dgx1-1 is the original Ubuntu 12.02 release of the Nvidia Deep Learning framework about which there is little documentation.

dgx1-2

dgx1-2 was upgraded on 29th January 2017 and is running  NVIDIA GPU Optimized Deep Learning Frameworks Container Release 16.11. The standard (nvcr.io/nvidia) container images are heavily optimized for P100 GPU and nvlink and are built on Ubuntu 14.04 with CUDA 8.0.54  with CuDNN 5.1.10 and NCCL 1.6.1. Images provided include the following:

  • Caffe (based on NVIDIA/Caffe 0.16)
    • Supports fp32 arithmetic and storage, fp16 arithmetic and storage
    • [NEW] Supports full fp16 training of convolutional neural nets (e.g., popular vision networks) – all layers except the Data, DetectNetTransformation, Softmax, SoftmaxWithLoss, and Accuracy layers
    • Optimized multi-GPU training
    • Seamless NCCL integration
    • [IMPROVED] Significant improvements to fp32 performance scaling compared with earlier releases (NVIDIA/Caffe 0.15)
    • Prototxt (text file), C++, and Python frontends
  • Microsoft Cognitive Toolkit / CNTK (based on 2.0.beta2)
    • Supports fp32 arithmetic and storage
    • Optimized multi-GPU training
    • [NEW] NCCL integration for improved multi-GPU scaling
    • Supports quantized (1-bit) communication
    • Supports recurrent neural networks
    • Supports cuDNN RNN layers (note: requires explicit use by model script)
    • BrainScript (text file), Python, and C++ front-ends
  • NVIDIA DIGITS (based on 4.0)
    • Web-based graphical user interface for DL training
    • Includes Caffe 16.11
    • Includes Torch 16.10
  • TensorFlow (based on 0.11.0rc2)
    • Supports fp32 arithmetic and storage
    • Supports multi-GPU training
    • [BETA] NCCL integration for improved multi-GPU scaling (note: requires explicit use by model script)
    • Supports recurrent neural networks
    • [NEW] Support for cuDNN RNN layers (note: requires explicit use by model script)
    • [IMPROVED] Better I/O throughput via libjpeg-turbo, fast iDCT decoding
    • Python front-end
  • Theano (based on 0.8.2)
    • Supports fp32 arithmetic and storage
    • Supports recurrent neural networks
    • Python front-end
  • Torch (based on Torch7 from 08 Nov 2016)
    • Supports fp32 arithmetic and storage; some support for fp16 arithmetic and/or storage
    • [NEW] Pseudo-fp16 support in cunn layers
    • Optimized multi-GPU training
    • Seamless NCCL integration (opt-in)
    • [IMPROVED] Better control over workspace memory usage via support the new cudnnFindEx() routine; enable in model script by adding “cudnn.useFindEx = true”
    • [BETA] New caching memory allocator reduces overheads and improves ease of use; enable via environment variable: “export THC_CACHING_ALLOCATOR=1”
    • Supports recurrent neural networks
    • Supports cuDNN RNN layers
    • [IMPROVED] Tuned RNN performance
    • Lua frontend

Known issues:

Caffe

When running the container as a non-root user, a message about Matplotlib’s font cache may appear. This message is safe to ignore and will be removed in a future release.

CNTK

Bug 1845813: OpenCV is installed from source without disabling the IEEE 1394 camera driver, which could lead to a spurious runtime error (“libdc1394 error: Failed to initialize libdc1394”). This message is safe to ignore and will be removed in a future release.

DIGITS

When running the container as a non-root user, a message about Matplotlib’s font cache may appear. This message is safe to ignore and will be removed in a future release.