Category: nvidia

Processors for Deep Learning: Nvidia Ampere GPU, Tesla Dojo, AWS Inferentia, Cerebras

December 27, 2020May 2, 2023 · Leave a comment ·

The NVidia Volta-100 GPU released in Dec 2017 was the first microprocessor with dedicated cores purely for matrix computations called Tensor Cores. The Ampere-100 GPU released May’20 is its successor. Ampere has 84 Streaming Multiprocessors (SMs) with 4 Tensor Cores (TCs) each for a total of 336 TCs. Tensor Cores reduce the cycle time for matrix multiplications, operating on 4×4 matrices of 16bit floating point numbers. These GPUs are aimed at Deep Learning use cases which consist of a pipeline of matrix operations.

Here’s an article on choosing the right EC2 instance type for DL – https://towardsdatascience.com/choosing-the-right-gpu-for-deep-learning-on-aws-d69c157d8c86 (G4 for inferencing, P4 for training).

How did the need for specialized DL chips arise, and why are Tensors important in DL ? In math, we have Scalars and Vectors. Scalars are used for magnitude and Vectors encode magnitude and direction. To transform Vectors, one applies Linear Transformations in the form of Matrices. Matrices for Linear Transformations have EigenVectors and EigenValues which describe the invariants of the transformation. A Tensor in math and physics is a concept that exhibits certain types invariance during transformations. In 3 dimensions, a Stress Tensor has 9 components, which can be representated as a 3×3 matrix; under a change of basis the components of the tensor change however the tensor itself does not.

In Deep Learning applications a Tensor is basically a Matrix. The Generalized Matrix Multiplication (GEMM) operation, D=AxB+C, is at the heart of Deep Learning, and Tensor Cores are designed to speed these up.

In Deep Learning, multilinear maps are interleaved with non-linear transforms to model arbitrary transforms of input to output and a specific model is arrived by a process of error reduction on training of actual data. This PyTorch Deep Learning page is an excellent resource to transition from traditional linear algebra to deep learning software – https://pytorch.org/tutorials/beginner/nlp/deep_learning_tutorial.html .

Tesla Dojo is planned to build a processor/computer dedicated for Deep Learning to train on vast amounts of video data. Launched on Tesla AI Day, Aug’20 2021, a video at https://www.youtube.com/watch?v=DSw3IwsgNnc

AWS Inferentia is a chip for deep learning inferencing, with its four Neuron Cores.

AWS Trainium is an ML chip for training.

Generally speaking the desire in deep learning community is to have simpler processing units in larger numbers.

Updates: Cerebras announced a chip which can handle neural networks with 120 trillion parameters, with 850,000 AI optimized cores per chip.

SambaNova, Anton, Cerebras and Graphcore presentations are at https://www.anandtech.com/show/16908/hot-chips-2021-live-blog-machine-learning-graphcore-cerebras-sambanova-anton

SambaNova is building 400,000 AI cores per chip.

NVIDIA GPU	AWS Instance	Azure Instance
M60	G3
T4	G4	NVv4
V100	P3	NCv4
A100	P4, P4d	NDv4

https://lambdalabs.com/blog/nvidia-a100-vs-v100-benchmarks

Omnisci GPU based columnar database

May 24, 2019July 1, 2019 · Leave a comment ·

Omnisci is a columnar database that reads a column into GPU memory, in compressed form, allowing for interactive queries on the data. A single gpu can load 10million to 50million rows of data and allows interactive querying without indexing. A demo was shown at the GTC keynote this year, by Aaron Williams. He gave a talk on vehicle analytics that I attended last month.

In the vehicle telemetry demo, they obtain vehicle telemetry data from an F1 game that has data output as UDP, 10s of thousands of packets a second – take the binary data off of UDP, and convert it to json and use it as a proxy for real telemetry data. The webserver refreshes every 3-4 seconds. The use case is analysis of increasing amounts of vehicle sensor data as discussed in this video and described in the detailed Omnisci blog post here.

Programming of the blocks is done through the UX with StreamSets which is an open source tool to build data pipelines – a cool demo of streamsets for creating pipelines including kafka is here.

The vehicle analytics demo pipeline consisted of UDP to Kafka, Kafka to JSON, then JSON to OmniSci via pymapd . Kafka serves as a message broker and also for playback of data.

Based on the the GPU loaded data, the database allows queries and stats on different vehicles that are running.

The entire system runs in the cloud on a VM supporting Nvidia GPUs, and can also be run on a local GPU box.

NVidia Tiny Linux Kernel and TrustZone

June 10, 2017January 8, 2018 · Leave a comment ·

The NVidia Tiny Linux Kernel (TLK), is 23K lines of BSD licensed code, which supports multi-threading, IPC and thread scheduling and implements TrustZone features of a Trusted Execution Environment (TEE). It is based on the Little Kernel for embedded devices.

The TEE is an isolated environment that runs in parallel with an operating system, providing security. It is more secure than an OS and offers a higher level of functionality than a SE, using a hybrid approach that utilizes both hardware and software to protect data. Trusted applications running in a TEE have access to the full power of a device’s main processor and memory, while hardware isolation protects these from user installed apps running in a main operating system. Software and cryptographic isolation inside the TEE protect the trusted applications contained within from each other. A paper describing the separation with alternatives for virtualizing the TEE appeared at https://ipads.se.sjtu.edu.cn/lib/exe/fetch.php?media=publications:vtz.pdf

TrustZone was developed by Trusted Foundations Software which was acquired by Gemalto. Giesecke & Devrient developed a rival implementation named Mobicore. In April 2012 ARM, Gemalto and Giesecke & Devrient combined their TrustZone portfolios into a joint venture Trustonic, which was the first to qualify a GlobalPlatform-compliant TEE product in 2013.

A comparison with other hardware based security technologies is found here. Whereas a TPM is exclusively for security functions and does not have access to the CPU, the TEE does have such access.

Attacks against TrustZone on Android are described in this blackhat talk. With a TEE exploit, “avc_has_perm” can be modified to bypass SELinux for Android. By the way, Access Vectors in SELinux are described in this wonderful link. “avc_has_perm” is a function to check the AccessVectors allows permission.

NVidia Volta GPU vs Google TPU

May 14, 2017May 30, 2017 · 1 Comment ·

A Graphics Processing Unit (GPU) allows multiple hardware processors to act in parallel on a single array of data, allowing a divide and conquer approach to large computational tasks such as video frame rendering, image recognition, and various types of mathematical analysis including convolutional neural networks (CNNs). The GPU is typically placed on a larger chip which includes CPU(s) to direct data to the GPUs. This trend is making supercomputing tasks much cheaper than before .

Tesla_v100 is a System on Chip (SoC) which contains the Volta GPU which contains TensorCores, designed specifically for accelerating deep learning, by accelerating the matrix operation D = A*B+C, each input being a 4×4 matrix. More on Volta at https://devblogs.nvidia.com/parallelforall/inside-volta/ . It is helpful to read the architecture of the previous Pascal P100 chip which contains the GP100 GPU, described here – http://wccftech.com/nvidia-pascal-specs/ . Background on why NVidia builds chips this way (SIMD < SIMT < SMT) is here – http://yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html .

Volta GV100 GPU = 6 GraphicsProcessingClusters x 7 TextureProcessingCluster/GraphicsProcessingCluster x 2 StreamingMultiprocessor/TextureProcessingCluster x (64 FP32Units +64 INT32Units + 32 FP64Units +8 TensorCoreUnits +4 TextureUnits)

The FP32 cores are referred to as CUDA cores, which means 84×64 = 5376 CUDA cores per Volta GPU. The Tesla V100 which is the first product (SoC) to use the Volta GPU uses only 80 of the 84 SMs, or 80×64=5120 cores. The frequency of the chip is 1.455Ghz. The Fused-Multiply-Add (FMA) instruction does a multiplication and addition in a single instruction (a*b+c), resulting in 2 FP operations per instruction, giving a FLOPS of 1.455*2*5120=14.9 Tera FLOPs due to the CUDA cores alone. The TensorCores do a 3d Multiply-and-Add with 7x4x4+4×4=128 FP ops/cycle, for a total of 1.455*80*8*128 = 120TFLOPS for deep learning apps.

3D matrix multiplication 3d_matrix_multiply

The Volta GPU uses a 12nm manufacturing process, down from 16nm for Pascal. For comparison the Jetson TX1 claims 1TFLOPS and the TX2 twice that (or same performance with half the power of TX1). The VOLTA will be available on Azure, AWS and platforms such as Facebook. Several applications in Amazon. MS Cognitive toolkit will use it.

For comparison, the Google TPU runs at 700Mhz, and is manufactured with a 28nm process. Instead of FP operations, it uses quantization to integers and a systolic array approach to minimize the watts per matrix multiplication, and optimizes for neural network calculations instead of more general GPU operations. The TPU uses a design based on an array of 256×256 multiply-accumulate (MAC) units, resulting in 92 Tera Integer ops/second.

Given that NVidia is targeting additional use cases such as computer vision and graphics rendering along with neural network use cases, this approach would not make sense.

Miscellaneous conference notes:

Nvidia DGX-1. “Personal Supercomputer” for $69000 was announced. This contains eight Tesla_v100 accelerators connected over NVLink.

Tesla. FHHL, Full Height, Half Length. Inferencing. Volta is ideal for inferencing, not just training. Also for data centers. Power and cooling use 40% of the datacenter.

As AI data floods the data centers, Volta can replace 500 CPUswith 33 GPUs.
Nvidia GPU cloud. Download the container of your choice. First hybrid deep learning cloud network. Nvidia.com/cloud . Private beta extended to gtc attendees.

Containerization with GPU support. Host has the right NVidia driver. Docker from GPU cloud adapts to the host version. Single docker. Nvidiadocker tool to initialize the drivers.

Moores law comes to an end. Need AI at the edge, far from the data center. Need it to be compact and cheap.

Jetson board had a Tegra SoC chip which has 6cpus and a Pascal GPU.

AWS Device Shadows vs GE Digital Twins. Different focus. Availabaility+connectivity vs operational efficiency. Manufacturing perspective vs operational perspective. Locomotive may be simulated when disconnected .

DeepInstinct analysed malware data using convolutional neural networks on GPUs, to better detect malware and its variations.

Omni.ai – deep learning for time series data to detect anomalous conditions on sensors on the field such as pressure in a gas pipeline.

GANS applications to various problems – will be refined in next few years.

GeForce 960 video card. Older but popular card for gamers, used the Maxwell GPU, which is older than Pascal GPU.

Cooperative Groups in Cuda9. More on Cuda9.

Neural Network Training and Inferencing on Nvidia

September 14, 2016October 8, 2016 · Leave a comment ·

Nvidia just announced the Tesla P40 and P4 cards for Neural network inferencing applications. A review is at http://www.anandtech.com/show/10675/nvidia-announces-tesla-p40-tesla-p4. Comparing it to the Tesla P100 released earlier this year, the P40 is targeted to inferencing applications. Whereas the P100 was targeted to more demanding training phase of neural networks. P4o comes with the TensorRT (real time) library for fast inferencing (e.g. real time detection of objects).

Some of the best solutions of hard problems in machine learning come from neural networks, whether in computer vision, voice recognition, games such as Go and other domains. Nvidia and other hardware kits are accelerating AI applications with these releases.

What happens if the neural network draws a bad inference, in a critical AI application ? Bad inferences have been discussed in the literature, for example in the paper: Intriguing properties of neural networks.

There are ways to minimize bad inferences in the training phase, but not foolproof – in fact the paper above mentions that bad inferences are low probabalility yet dense.

Level 5 autonomous driving is where the vehicle can handle unknown terrain. Most current systems are targeting Level 2 or 3 autonomy. The Tesla Model S’ Autopilot is Level 2.

An answer is to pair it with a regular program that checks for certain safety constraints. This would make it safer, but this alone is likely insufficient either for achieving Level 5 operations, or for providing safely for them.

Embedded Neural Nets

August 29, 2016June 24, 2017 · Leave a comment ·

A key problem for embedded neural networks is reduction of size and power consumption.

The hardware on which the neural net runs on can be a dedicated chip, an FPGA, a GPU or a CPU. Each of these consumes about 10x the power of the previous choice. But in terms of upfront cost, the dedicated chip costs the highest, the CPU the lowest. An NVidia whitepaper compares GPU with CPU on speed and power consumption. (It discusses key neural networks like AlexNet. The AlexNet was a breakthrough in 2012 showing a neural network to be superior to other image recognition approaches by a wide margin).

Reducing the size of the neural network also reduces its power consumption. For NN size reduction, pruning of the weak connections in the net was proposed in “Learning both Weights and Connections for Efficient Neural Networks” by Song Han and team at NVidia and Stanford. This achieved a roughly 10x reduction in network size without loss of accuracy. Further work in “Deep Compression” achieved a 35x reduction.

Today I attended a talk on SqueezeNet by Forrest Iandola. His team at Berkeley modified (squeezed) the original architecture, then applied the Deep Compression technique above to achieve a 461x size reduction over the original, to 0.5Mb. This makes it feasible for mobile applications. This paper also references the V.Badrinarayan’s work on SegNet – a different NN architecture, discussed in a talk earlier this year.

The Nervana acquisition by Intel earlier this year was for a low power GPU like SOC chip with very high memory bandwidth.

Nvidia nvGraph and Tesla P100 GPU

April 9, 2016October 5, 2016 · 1 Comment ·

“The latest version of NVIDIA’s parallel computing platform gives developers direct access to powerful new Pascal features, including unified memory and NVLink. Also included in this release is a new graph analytics library — nvGRAPH — which can be used for robotic path planning, cyber security and logistics analysis, expanding the application of GPU acceleration into the realm of big data analytics.”

nvGRAPH supports three widely-used algorithms:

Page Rank is most famously used in search engines, and also used in social network analysis, recommendation systems, and for novel uses in natural science when studying the relationship between proteins and in ecological networks.

Single Source Shortest Path is used to identify the fastest path from A to B through a road network, and can also be used for a optimizing a wide range of other logistics problems.

Single Source Widest Path is used in domains like IP traffic routing and traffic-sensitive path planning.

In addition, the nvGRAPH semiring Sparse Matrix Vector Multiplication (SPMV) operations can be used to build a wide range of innovative graph traversal algorithms.

A paper on how to represent cyber attacks as graphs – http://csis.gmu.edu/noel/pubs/2015_IEEE_HST.pdf references the CAPEC, which is a collection of vulnerabilities for such a graph study.

A graphdb represents data using nodes, edges and properties.

graphdatabase_propertygraph

It allows querying with a graph traversal language such as Gremlin. Blazegraph offers a GPU accelerated graphdb. Here’s a graph of machine learning papers from Research Front Maps.

rfmap1

MapD, a GPU accelerated db recently got funding from Nvidia and others.

Secure Machinery

On the evolution of security and intelligent machinery