News

DDN Introduces Purpose-Built AI Storage Solution for NVIDIA DGX-1

None
Oct. 5, 2018

By: Michael Feldman

DataDirect Networks (DDN) has launched a trio of specialized storage appliances that are designed to support the data needs of NVIDIA DGX-1 servers running artificial intelligence and deep learning workloads.

The new products, known as AI200, AI400, and AI7990, are based on a parallel storage platform called DDN A³I (Accelerated, Any-Scale, AI), which has been specifically built for feeding DGX-1 boxes. A single DGX-1 is equipped with eight Tesla V100 GPUs, representing a peak petaflop of deep learning performance. That kind of compute power gives the NVIDIA server a particularly large appetite for data, especially the large volumes of training data needed to build neural networks. However, according to DDN, the platform is able to support the entire AI/DL lifecycle, from data ingestion, through training, verification, and inference.

To speed that dataflow along, the A³I platform uses multiple ports of EDR InfiniBand or 100 Gigabit Ethernet to connect the storage appliance to one or more DGX-1 servers. To make this go even faster, the platform employs Remote Direct Memory Access (RDMA) networking, which enables the data to be directly delivered to the GPU, bypassing the overhead involved in CPU mediation. For InfiniBand, RDMA is supported natively, while for Ethernet, the A³I system uses RoCE (RDMA over Converged Ethernet). The goal here is to make sure the GPUs are well fed so that they can spend as much time as possible crunching the data, rather than waiting for it.

According to the datasheet on the AI200, AI400, and AI7990 appliances: “Performance testing on the DDN A³I architecture has been conducted with all widely-used DL frameworks (TensorFlow, Horovod, Torch, PyTorch, NVIDIA® TensorRT™, Caffe, Caffe2, CNTK, MXNET and Theano). Using the A³I intelligent client, containerized applications can engage the full capabilities of the data infrastructure, and that the AI servers achieves full GPU saturation consistently for DL workloads.”

The AI200, is an all-SSD, NVMe-flavored appliance that can deliver up to a million IOPs, 20 GB/sec of sequential reads, and 16 GB/sec of sequential writes. It uses four EDR InfiniBand or 100GbE ports to communicate with one or more DGX-1 servers. The AI200 come in three capacity sizes: 40TB, 80TB, and 150TB.

The AI400 is a higher-end version of the AI200, providing up to three million IOPS, 40 GB/sec of reads, and 32 GB/sec of writes. The doubling of the AI200 throughput is a result of supporting twice as many EDR InfiniBand or 100GbE network links (8). Maximum capacity on the AI400 appears to be 360TB, although the data sheet description is a little fuzzy on this aspect.

The AI7990 is a hybrid SSD/HDD appliance, supporting up to 5.4 PB. A single appliance can deliver up to 700 thousand IOPS, 20 GB/sec of sequential reads, and 16 GB/sec of sequential writes. Like the AI200, it employs four 100G network links. The larger storage capacity is for use cases, such as natural language processing, like where access to larger data libraries are needed, rather than just the hot training dataset.

To integrate these appliances with the DGX-1, DDN has developed client-side EXAscaler software that runs on the NVIDIA server to provide parallel file access to the storage.  A single DDN appliance can be shared across multiple DGX-1 servers, while at the same time, multiple appliances can be used to increase storage capacity and I/O performance.

The integrated solution, consisting of storage appliances, GPU servers, network switches, and the relevant software stack, can be purchased from a handful of authorized resellers. These include Meadowgate Technologies, Microway, and Groupware Technology, in the US; and GDEP Solutions, XENON and E4 Computer Engineering, elsewhere.