Category: Uncategorized

NVidia Volta GPU vs Google TPU

A Graphics Processing Unit (GPU) allows multiple hardware processors to act in parallel on a single array of data, allowing a divide and conquer approach to large computational tasks such as video frame rendering, image recognition, and various types of mathematical analysis including convolutional neural networks (CNNs). The GPU is typically placed on a larger chip which includes CPU(s) to direct data to the GPUs. This trend is making supercomputing tasks much cheaper than before .

Tesla_v100 is a System on Chip (SoC) which contains the Volta GPU which contains TensorCores, designed specifically for accelerating deep learning, by accelerating the matrix operation D = A*B+C, each input being a 4×4 matrix.  More on Volta at https://devblogs.nvidia.com/parallelforall/inside-volta/ . It is helpful to read the architecture of the previous Pascal P100 chip which contains the GP100 GPU, described here – http://wccftech.com/nvidia-pascal-specs/ .  Background on why NVidia builds chips this way (SIMD < SIMT < SMT) is here – http://yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html .

Volta GV100 GPU = 6 GraphicsProcessingClusters x  7 TextureProcessingCluster/GraphicsProcessingCluster x 2 StreamingMultiprocessor/TextureProcessingCluster x (64 FP32Units +64 INT32Units + 32 FP64Units +8 TensorCoreUnits +4 TextureUnits)

The FP32 cores are referred to as CUDA cores, which means 84×64 = 5376 CUDA cores per Volta GPU. The Tesla V100 which is the first product (SoC) to use the Volta GPU uses only 80 of the 84 SMs, or 80×64=5120 cores. The frequency of the chip is 1.455Ghz. The Fused-Multiply-Add (FMA) instruction does a multiplication and addition in a single instruction (a*b+c), resulting in 2 FP operations per instruction, giving a FLOPS of 1.455*2*5120=14.9 Tera FLOPs due to the CUDA cores alone. The TensorCores do a 3d Multiply-and-Add with 7x4x4+4×4=128 FP ops/cycle, for a total of 1.455*80*8*128 = 120TFLOPS for deep learning apps.

3D matrix multiplication3d_matrix_multiply

The Volta GPU uses a 12nm manufacturing process, down from 16nm for Pascal. For comparison the Jetson TX1 claims 1TFLOPS and the TX2 twice that (or same performance with half the power of TX1). The VOLTA will be available on Azure, AWS and platforms such as Facebook.  Several applications in Amazon. MS Cognitive toolkit will use it.

For comparison, the Google TPU runs at 700Mhz, and is manufactured with a 28nm process. Instead of FP operations, it uses quantization to integers and a systolic array approach to minimize the watts per matrix multiplication, and optimizes for neural network calculations instead of more general GPU operations.  The TPU uses a design based on an array of 256×256 multiply-accumulate (MAC) units, resulting in 92 Tera Integer ops/second.

Given that NVidia is targeting additional use cases such as computer vision and graphics rendering along with neural network use cases, this approach would not make sense.

Miscellaneous conference notes:

Nvidia DGX-1. “Personal Supercomputer” for $69000 was announced. This contains eight Tesla_v100 accelerators connected over NVLink.

Tesla. FHHL, Full Height, Half Length. Inferencing. Volta is ideal for inferencing, not just training. Also for data centers. Power and cooling use 40% of the datacenter.

As AI data floods the data centers, Volta can replace 500 CPUswith 33 GPUs.
Nvidia GPU cloud. Download the container of your choice. First hybrid deep learning cloud network. Nvidia.com/cloud . Private beta extended to gtc attendees.

Containerization with GPU support. Host has the right NVidia driver. Docker from GPU cloud adapts to the host version. Single docker. Nvidiadocker tool to initialize the drivers.

Moores law comes to an end. Need AI at the edge, far from the data center. Need it to be compact and cheap.

Jetson board had a Tegra SoC chip which has 6cpus and a Pascal GPU.

AWS Device Shadows vs GE Digital Twins. Different focus. Availabaility+connectivity vs operational efficiency. Manufacturing perspective vs operational perspective. Locomotive may  be simulated when disconnected .

DeepInstinct analysed malware data using convolutional neural networks on GPUs, to better detect malware and its variations.

Omni.ai – deep learning for time series data to detect anomalous conditions on sensors on the field such as pressure in a gas pipeline.

GANS applications to various problems – will be refined in next few years.

GeForce 960 video card. Older but popular card for gamers, used the Maxwell GPU, which is older than Pascal GPU.

Cooperative Groups in Cuda9. More on Cuda9.

Weeping Angel

Sales of hardware camera blockers and similar devices should increase, with the Weeping Angel disclosure. Wikileaks detailed how the CIA and MI5 hacked Samsung TVs to silently monitor remote communications. Interesting to read for the level of technical detail: https://wikileaks.org/vault7/document/EXTENDING_User_Guide/EXTENDING_User_Guide.pdf . The attack was called ‘Weeping Angel’, a term borrowed from Doctor Who.

Other such schemes are described at https://wikileaks.org/vault7/releases/#Weeping%20Angel , including a iPhone implant to get data from your phone – https://wikileaks.org/vault7/document/NightSkies_v1_2_User_Guide/NightSkies_v1_2_User_Guide.pdf .

ICS Threat Landscape

Kaspersky labs released a report on Industrial and Control System (ICS) security trends. The data was reported to be gathered using Kaspersky Security Network (KSN), a distributed antivirus network.

https://ics-cert.kaspersky.com/reports/2017/03/28/threat-landscape-for-industrial-automation-systems-in-the-second-half-of-2016/

The report is here – https://ics-cert.kaspersky.com/wp-content/uploads/sites/6/2017/03/KL-ICS-CERT_H2-2016_report_FINAL_EN.pdf

ics_cert_en_1

The rising trend is partly due to the isolation strategy currently being followed for ICS network security no longer being effective in protecting industrial networks.

RSA World 2017

I was struck by the large number of vendors offering visibility as a key selling point. Into various types of network traffic. Network monitors, industrial network gateways, SSL inspection, traffic in VM farms, between containers, and on mobile.

More expected are the vendors offering reduced visibility – via encryption, TPMs, Secure Enclaves, mobile remote desktop, one way links, encrypted storage in the cloud etc. The variety of these solutions is also remarkable.

Neil Degrasse’s closing talk was so different yet strangely related to these topics, the universe slowly making itself visible to us through light from distant places, with insights from physicists and mathematics building up to the experiments that recently confirmed gravitational waves – making the invisible, visible. . This tenuous connection between security and physics left me misty eyed. What an amazing time to be living in.

Industrial Robot Safety Systems

Robot safety systems must be concerned with graceful degradation in the face of component failures, bad inputs and extreme operating conditions. With the increasing complexity and prevelance of robots, one can expect these requirements to grow.

This document “Robot System Safety” describes the following safety features of Kuka robots-
— Restricted envelope
— EMERGENCY STOP
— Enabling switches
— Guard interlock

An interlock system described here for general control systems, is a mechanism for guaranteeing that an undesired combination of states does not occur – for e.g. robot moving when the cell door is open, or an elevator moving when the door is open. The combination of the controlled stop with the guard interlock is described as follows:

“The robot controller features a two–channel safety input, to which the guard interlock can be connected. In the automatic modes, the opening of the guard connected to this input causes a controlled stop, with power to the drives being maintained in order to ensure this controlled stop. The power is only disconnected once the robot has come to a standstill. Motion in Automatic mode is prevented until the guard connected to this input is closed. This input has no effect in Test mode. The guard must be designed in such a way that it is only possible to acknowledge the stop from outside the safeguarded space.”

This Standford Linear Accelerator (SLAC) paper describes the advantages of PLCs for interlock design as:

. flexible system configuration due to modular hardware and software
. regularly scheduled background tests of PLC system and sensitive I/O
. comprehensive system self-tests
. intelligent fault diagnostics simplify trouble-shooting
. easy reconfiguration of the interlock logic
. no mechanical wear and tear
. improved security due to logic encapsulation in firmware

The SLAC use case is to lock the doors unless (a) the power has been shutoff or (b) the power is on but there is an explicit bypass for hot maintenance. They implement it on a Siemens PLC using statement lists (vs ladder logic or control system flowchart programming).
GE has a number of industrial interlocks for various safety functions – http://www.clrwtr.com/PDF/GE-Security/Sentrol-Catalog.pdf

A very useful article on interlocking devices – http://machinerysafety101.com/2012/06/01/interlocking-devices-the-good-the-bad-and-the-ugly/

Awesome IOT hacks repost

Mirroring from https://github.com/nebgnahz/awesome-iot-hackss for easy access.

A curated list of hacks in IoT space so that researchers and industrial products can address the security vulnerabilities (hopefully).

Thingbots

RFID

Home Automation

Connected Doorbell

Hub

Smart Coffee

Wearable

Smart Plug

Cameras

Traffic Lights

Automobiles

Airplanes

Light Bulbs

Locks

Smart Scale

Smart Meters

Pacemaker

Thermostats

Fridge

Media Player & TV

Toilet

Toys

Software Defined Networking Security

Software Defined Networking seeks to centralize control of a large network. The abstractions around computer networking evolved from the connecting nodes via switches, to applications that run on top with the OSI model, to the controllers that manage the network. The controller abstraction were relatively weak – this had been the domain of telcos and ISPs, and as the networks of software intensive companies like Google approached the size of telco networks, they moved to reinvent the controller stack.  Traffic engineering and security which were done in disparate regions were attempted to be centralized in order to better achieve economies of scale. Google adopted openflow for this, developed by Nicira, which was soon after acquired by VMWare; Cisco internal discussions concluded that such a centralization wave would reduce Cisco revenues in half, so they spun out Insieme networks for SDN capabilities and quickly acquired it back. This has morphed into the APIC offering.

The centralization wave is a bit at odds with the security and resilience of networks because of their inherent distributed and heterogenous nature. Distributed systems provide availability, part of the security CIA triad, and for many systems availability trumps security. The centralized controllers would become attractive targets for compromise. This is despite the intention of SDN, as envisioned by Nicira founder M. Casado, to have security as its cornerstone as described here. Casado’s problem statement is interesting: “We don’t have a ubiquitous and horizontal security layer that provides both context and isolation. Where do we normally put security controls? We put it in one of two places. We might put it in the physical infrastructure, which is great because you have isolation. If I have ACLs [access control lists] or a firewall or an IDS [intrusion detection system], I put it in a separate box and I put it away from the applications so that the attack surface is pretty small and it’s totally isolated… you have isolation, but you have no context. ..  Another place we put security is in the end host, an application or operating system. This suffers from the opposite problem. You have all the context that you need — applications, users, the data being accessed — but you have absolutely no isolation.” The centralization imperative comes from the need to isolate and minimize the trusted computing base.

In the short term, there may be some advantage to be gained by complexity reduction through centralized administration, but the recommendation of dumb switches that respond to a tightly controlled central brain, go against the tenets of compartmentalization of risk and if such networks are put into practice widely they can result in failures that are catastrophic instead of isolated.

What the goal should be is a distributed system which is also responsive.