Category: Uncategorized

Reinforcement learning

An Agent is in an Environment. a) Agent reads Input (State) from Environment. b) Agent produces Output (Action) that affects its State relative to Environment c) Agent receives reward (or feedback) for the Output produced. With the reward/feedback it receives it learns to produce better Output for given Input.

Where do neural networks come in ?

Optimal control theory considers control of a dynamical system such that an objective function is optimized (with applications including stability of rockets, helicopters). In optimal control theory, Pontryagin’s principle says: a necessary condition for solving the optimal control problem is that the control input should be chosen to minimize the control Hamiltonian. This “control Hamiltonian” is inspired by the classical Hamiltonian and the principle of least action. The goal is to find an optimal control policy function u∗(t) and, with it, an optimal trajectory of the state variable x∗(t) which by Pontryagin’s maximum principle are the arguments that maximize the Hamiltonian.

Derivatives are needed for the continuous optimizations. Deep learning models are capable of performing continuous linear and non-linear transformations, which in turn can compute derivatives and integrals. They can be trained automatically using real-world inputs, outputs and feedback. So a neural network can provide a system for sophisticated feedback-based non-linear optimization of the map from Input space to Output space.

The above could be accomplished by a feedforward neural network that is trained with a feedback (reward). Additionally a recurrent neural network could encode a memory into the system by making reference to previous states (likely with higher training and convergence costs).

Model-free reinforcement learning does not explicitly learn a model of the environment.

Manifestations of RL: Udacity self-driving course – lane detection. Karpathy’s RL blog post has an explanation of a network structure that can produce policies in a malleable manner, called policy gradients.

Practical issues in Reinforcement Learning –

Raw inputs vs model inputs: There is the problem of mapping inputs from real-world to the actual inputs to a computer algorithm. Volume/quality of information – high vs low requirement.

Exploitation vs exploration dilemma: Simple exploration methods are the most practical. With probability ε, exploration is chosen, and the action is chosen uniformly at random. With probability 1 − ε, exploitation is chosen, and the agent chooses the action that it believes has the best long-term effect (ties between actions are broken uniformly at random). ε is usually a fixed parameter but can be adjusted either according to a schedule (making the agent explore progressively less), or adaptively based on heuristics.

AWS DeepRacer. Allows exploration of RL. Simplifies the mapping of camera input to computer input, so one can focus more on the reward function and deep learning aspects. The car has a set of possible actions (change heading, change speed). The RL task is to predict the actions based on the inputs.

What are some of the strategies applied to winning DeepRacer ?

Reward function input parameters –

DeepRacer: Educational Autonomous Racing Platform for Experimentation with Sim2Real Reinforcement Learning” –

RL is not a fit for every problem. Alternative approaches with better explainability and determinism include behavior trees, vectorization/VectorNet, …

DeepMind says reinforcement learning is ‘enough’ to reach general AI

Richard Sutton and Andrew Barto’s book on RL: An introduction.

This paper explores incorporating Attention mechanism with Reinforcement learning – Reinforcement Learning with Attention that Works: A Self-Supervised Approach. A video review of the ‘Attention is all you need’ is here, the idea being to replace an RNN with a mechanism to selectivity track a few relevant things.

Multi agent Deep Deterministic Policy Gradients – cooperation between agents. Agents learn a centralized critic based on the observations and actions of all agents. .

Multi-vehicle RL for multi-lane driving.

Reinforcement learning in chip design

Deep learning is being applied to combinatorial optimization problems. A very intriguing talk by Anna Goldie discussed an application of RL to chip design that cuts down the time for layout optimization and which in turn enables optimizing of the chip design for a target software stack in simulation before the chip goes to production. Here’s a paper – graph placement methodology for fast chip design.

A snippet on how the research direction evolved to a learning problem.

Chip floorplanning as a learning problem

The underlying problem is a high-dimensional contextual bandits problem but, as in prior work, we have chosen to reformulate it as a sequential Markov decision process (MDP), because this allows us to more easily incorporate the problem constraints as described below. Our MDP consists of four key elements:
(1) States encode information about the partial placement, including the netlist (adjacency matrix), node features (width, height, type), edge features (number of connections), current node (macro) to be placed, and metadata of the netlist graph (routing allocations, total number of wires, macros and standard cell clusters).
(2) Actions are all possible locations (grid cells of the chip canvas) onto which the current macro can be placed without violating any hard constraints on density or blockages.
(3) State transitions define the probability distribution over next states, given a state and an action.
(4) Rewards are 0 for all actions except the last action, where the reward is a negative weighted sum of proxy wirelength, congestion and density, as described below.

We train a policy (an RL agent) modelled by a neural network that, through repeated episodes (sequences of states, actions and rewards), learns to take actions that will maximize cumulative reward (see Fig. 1).
We use proximal policy optimization (PPO) to update the parameters of the policy network, given the cumulative reward for each placement.”

Their diagram:

“An embedding layer encodes information about the netlist adjacency, node features and the current macro to be placed. The policy and value networks then output a probability distribution over available grid cells and an estimate of the expected reward for the current placement, respectively. id: identification number; fc: fullyconnected layer; de-conv: deconvolution layer”

A graph placement methodology for fast chip design | Nature

“Fig. 1 | Overview of our method and training regimen.In each training iteration, the RL agent places macros one at a time (actions, states and rewards are denoted byai, si and ri, respectively). Once all macros are placed, the standard cells are placed using a force-directed method. The intermediate rewards are zero. The reward at the end of each iteration is calculated as a linear combination of the approximate wirelength, congestion and density, and is provided as feedback to the agent to optimize its parameters for the next iteration.”

The references mention a number of applications of ML to chip design. A project exploring these is at at

ML for Forecasting

In this paper – “DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks”, the authors discuss a method for learning a global model from several individual time series.

Let’s break down some aspects of the approach and design.

“In probabilistic forecasting one is interested in the full predictive distribution, not just a single best realization, to be used in downstream decision making systems.”

The autoregressive model specifies that the output variable depends linearly on its own previous values and on a stochastic term (an imperfectly predictable term).

Recurrent Neural Network is used to refer to NNs with an infinite impulse response, and are used for speech recognition, handwriting recognition and such tasks involving sequences.

An LSTM or The Long Short-Term Memory (LSTM) is a type of RNN, that came about to solve a problem of vanishing gradients in previous RNN designs. An LSTM cell can process data sequentially and keep its hidden state through time.

A covariate is an independant random variable, with which the target random variable is assumed to have some covariance.

The approach has distinct features described in this snippet

“In addition to providing better forecast accuracy than previous methods, our approach has a number key advantages compared to classical approaches and other global methods: (i) As the model learns seasonal behavior and dependencies on given covariates across time series, minimal manual feature engineering is needed to capture complex, group-dependent behavior; (ii) DeepAR makes probabilistic forecasts in the form of Monte Carlo samples that can be used to compute consistent quantile estimates for all sub-ranges in the prediction horizon; (iii) By learning from similar items, our method is able to provide forecasts for items with little or no history at all, a case where traditional single-item forecasting methods fail; (vi) Our approach does not assume Gaussian noise, but can incorporate a wide range of likelihood functions, allowing the user to choose one that is appropriate for the statistical properties of the data.
Points (i) and (iii) are what set DeepAR apart from classical forecasting approaches, while (ii) and (iv) pertain to producing accurate, calibrated forecast distributions learned from the historical behavior of all of the time series jointly, which is not addressed by other global methods (see Sec. 2). Such probabilistic forecasts are of crucial importance in many applications, as they—in contrast to point forecasts—enable optimal decision making under uncertainty by minimizing risk functions, i.e. expectations of some loss function under the forecast distribution.”

Facebook Prophet is an open-source library for forecasting –

ARMA – AutoRegressive Moving Average Estimator

ARIMA estimator – AutoRegressive Integrated Moving Average is a generalization of ARMA and can better handle non-stationarity in a time series.

Bitcoin market cap reaches $1T

Bitcoin reached a $1T market cap last month.

A Bitcoin halving event is scheduled to take place every 210,000 blocks. This reduces the payoff of securing a block by half. Three Bitcoin halvings have taken place so far in 2012, 2016, 2020. The next halving is predicted to occur in 2024. The corresponding block reward went from 50btc in 2009 to 25 in ‘12, 12.5 in ‘16, 6.25 in ‘20 and 3.125 in ‘24.

The rate of production of bitcoin over time is shown below. Mining will continue until 21million btc are created.

VeChain is a blockchain proposal/implementation for supply chain tracking.

EdgeChain is an architecture for placement of applications on the edge amongst multiple resources from multiple providers. It is built on Vechain for Mobile and Edge Computing use cases.

Disaster Recovery: Understanding and designing for RPO and RTO

Let’s take a disaster scenario where a system loses its data-in-transit (i.e. not yet persisted) at a certain point in time. and some time after this point, a recovery process kicks in, which restores the system back to normal functioning.

Recovery Point Objective refers to the amount of tolerable data loss measured in time. It can be measured in time based on the fact that it is in-transit data of a certain max velocity, so bounding the time bounds the amount of data that can be lost. This time figure, the RPO, is used to determine how frequently the data must be persisted and replicated. An RPO of 10 minutes implies the data must be backed up every 10 minutes. If there’s a crash the system can be restored to a point not more than 10 minutes prior to the time of crash. RPO determines frequency of backups, snapshots or transaction logs.

Recovery Time Objective refers to the amount of time required to restore a system to normal behavior after a disaster has happened. This includes restoration of all infrastructure components that provide a service, not just the restoration of data.

Lower RPO/RTO is higher cost.

Matrix of RPO – high/low vs RTO – high/low can be used to categorize applications.

Low RPO, Low RTO. Critical online application like a storefront.

Low RPO, High RTO. Data sensitive application but not online, like analytics.

High RPO, Low RTO. Redundantly available data or no data. Compute clusters that are highly available.

High RPO, High RPO. Non-prod systems – dev/test/qa ?

Amount of acceptable data loss <= App (data?) criticality.
One can expect a pyramid of apps – large number with less criticality, small number with high criticality

Repeatability. Backup and recovery procedures. Must be written. Must be tested. Automation.

HA/DR spectrum of solutions:

  • Backups, save transaction logs
  • Snapshots
  • Replication – synchronous, asynchronous
  • Storage only vs in-memory as well. Application level crash consistency of backups.
  • Multiple AZs
  • Hybrid

Tech: S3 versioning and DDB streams, Global tables.

Rules of thumb:

Related terms: RPA and RTA

3 types of disasters.

  • Natural disaster – e.g. floods, earthquakes, fire
  • Technical failure – e.g. loss of power, cable pulled
  • Human error – e.g. delete all files as admin

Replication – works for first two. Continuous snapshots/backup/versioning – for the last one. Replication will just delete the data on both sides. Need the ability to go back in time and restore data.

Cost – how to optimize cost of infrastructure and its maintenance.

Which region to choose ? Key considerations: What types of disasters are the concern (Risk). How much proximity is needed to end-customers and to primary region (Performance). What’s the cost of the region (Cost) ?

Delta Lake and Spark for threat detection and response at scale

Notes on a talk on the data platform for Threat detection and response at scale at Spark+AI Summit, 2018.

The threat detection problem, use-cases and scale.

  • It’s important to focus on and build the data platform first else one can get siloed into narrow set of addressable use-cases.
  • we want to detect attacks,
  • contextualize the attacks
  • determine root cause of an attack,
  • determine what the scope of the incident might be
  • determine what we need to contain it
  • Diverse threats require diverse data sets
  • the Threat signal can be concentrated or spread in time
  • Keylines visualization library is used to build a visualization of detection, contextualization, containment

Streaming is a simple pattern that takes us very far for detection

  • Streams are left-joined with context and filtered or inner-joined with indicators
  • Can do a lot with this but not everything
  • Graphs are key. Graphs at scale are super hard.

Enabling triage and containment with search and query

  • to triage the detection, it comes down to search and query.
  • ETM does 3.5million records/sec. 100TB of data a day. 300B events a day.
  • 11 trillion rows, 0.5PB of data.

Ingestion architecture – tries to solve all these problems and balance issues.

  • data comes into s3 in a consistent json wrapper
  • there’s a single ETL job that takes all the data and writes it into a single staging table which is partitioned by date and event-type, has a long retention
  • table is optimized to stream new data in and stream data out of, but can be queried as well. you can actually go and query it using sql function
  • highest value data – we write parsers, we have discrete parsing streams and put them into a common schema and put it into separate delta tables. well parsed, well structured.
  • use optimizations from delta, z-odering.. to index over columns that are common predicates. search by IP address, domain names – those are what we order by
  • indexing and z-ordering – take advantage of data skipping
  • sometimes we parser code gets messed up.
  • single staging table.. is great . we just let the fixed parser run forward, we have all the data corrected, then we are back-corrected. don’t have to repackage code and run as a batch job. we literally just fix code and run it in the same model that’s it.
  • off of these refined tables or parsed data sets, this is where the detection comes in.
  • we have a number of detection streams in batches, that do the logic and analysis. facet-faced or statistical.
  • alerts that come out of this go to their own alert table. goes to delta again. long retection, consistent schema. another streaming job then does de-duplication and whitelisting and writes out alerts to our alert management system. we durably store all the alerts, whether or not de-duped/whitelisted
  • allows us to go back and fix things if things are not quite correct, accidentally.
  • all this gives us operational sanity, and a nice feedback loop

Thanks to z-ordering, it can go from scanning 500TB to 36TB.

  • average case is searching over weeks or months. it makes it usable for ad-hoc refinements.
  • simple, unified platform.

Michael: Demo on interactive queries over months of data

  • first attempt is sql SELECT on raw data. takes too long, cancelled. second attempt uses HMS, still too long, cancelled. why is this so hard ?
  • observation: every data problem is actually two problems 1) data engineering and 2) data science. most projects fail on step 1.
  • doing it with delta – the following command takes 17s and fires off spark job to put the data in a common schema.

CREATE TABLE connections USING delta AS SELECT * from json.'/data/connections'


SELECT * FROM connections WHERE dest_port = 666

this is great to query the historical data quickly.. however batch alone is not going to cut it as we may have attacks going on right now. but delta plugs into streaming as well:

INSERT INTO connections SELECT * from kafkaStream

Now we’ve Indexed both batch and streaming data.

We can run a python netviz command to visualize the connections.

Here’s a paper on the Databricks Delta Lake approach – .

AWS Builders Library

A few interesting ideas from AWS Builders Library

Toyota’s Five-why’s approach to root cause a problem – is good but not enough to find all other root causes that might also cause a problem.

Couple great talks on serverless

Smithy is an Apache-2.0 licensed, protocol-agnostic IDL for defining APIs, generating clients, servers and documentation.

Lacework Intrusion Detection System – Cloud IDS

Lacework Polygraph is a Host based IDS for cloud workloads. It provides a graphical view of who did what on which system, reducing the time for root cause analysis for anomalies in system behaviors. It can analyze workloads on AWS, Azure and GCP.

It installs a lightweight agent on each target system which aggregates information from processes running on the system into a centralized customer specific (MT) data warehouse (Snowflake on AWS)  and then analyzes the information using machine learning to generate behavioral profiles and then looks for anomalies from the baseline profile. The design allows automating analysis of common attack scenarios using ssh, privilege changes, unauthorized access to files.

The host based model gives detailed process information such as which process talked to which other and over what api. This info is not available to a network IDS. The behavior profiles reduce the false positive rates. The graphical view is useful to drill down into incidents.

OSQuery is a tool for gathering data from hosts, and this is a source of data aggregated for threat detection.

Here’s an agent for libpcap

It does not have an intrusion prevention (IPS) functionality. False positives on an IPS could block network/host access and negatively affect the system being protected, so it’s a harder problem.

Cloud based network isolation tools like Aviatrix might make IPS scenarios feasible by limiting the effect of an IPS.

Stabilizing Cryptocurrency

Before I forget it’s name, BaseCoin is a project that attempts to stabilize a cryptocurrency, so it does not have wild fluctuations. 

Regular (Fiat) currencies are actively managed by Federal banks to be stable and are also stabilized by being the default currency for labor negotiations, employment contracts, retirement accounts etc which are slow moving changes.

More on crypto stabilization in this post –!/main/articles/post-09-the-intelligent-cryptocurrency-investor . notes that Basis coin failed. It also makes a distinction between fiat-backed stablecoins and cryptocurrency backed stablecoin. In the latter the stabilizing algo works on chain.

Avalanche has a paper classifying stablecoins.

Decentralized Identity Based on Blockchain

Sovrin project. Uses a Permissioned blockchain which allows it to do away with mining as an incentive and instead directly build a Distributed Ledger Technology which stores Distributed Identifiers (DIDs) and maps them to claims. Removal of mining frees up resources and increases network throughput. Interesting Key Management aspects, including revocation. Contrasts with Ethereum uPort – which is permissionless and public. Neat design, but will face adoption problem as it is unhitched from bitcoin/ethereum.

Click to access Sovrin–digital-identities-in-the-blockchain-era.pdf

Click to access The-Technical-Foundations-of-Sovrin.pdf

DPKI – Distributed PKI. Attempts to do reduce the weakness of a centralized certificate authority as compromising that cert authority affects each of its issued certificates. This concept is built out and used in Sovrin.

Remme. Remember me. An approach to SSL based logins. Modifies SSL.
Used an EmerCoin implementation as mvp and Ethereum blockchain. EmerCoin: . Adoption problem here is change in behavior of each browser and mobile app.

Sidechains. Original proposal was to free up resources for when trust is established, to reuse blockchain technology and to establish a two-way peg between the sidechain and the blockchain.

Coco Framework.

HyperLedger – Linux based framework for developing blockchains software. Provides a DLT and uses Intel SGS extensions. (Intel+Microsoft+Linux foundation). Uses a replicated state machine model with each validating peer independently adding to its chain after reaching consensus on order of txns with other peers using Practical Byzantine Fault Tolerance or Proof of Elapsed Time. . Related –

A comparison link –


NVidia Volta GPU vs Google TPU

A Graphics Processing Unit (GPU) allows multiple hardware processors to act in parallel on a single array of data, allowing a divide and conquer approach to large computational tasks such as video frame rendering, image recognition, and various types of mathematical analysis including convolutional neural networks (CNNs). The GPU is typically placed on a larger chip which includes CPU(s) to direct data to the GPUs. This trend is making supercomputing tasks much cheaper than before .

Tesla_v100 is a System on Chip (SoC) which contains the Volta GPU which contains TensorCores, designed specifically for accelerating deep learning, by accelerating the matrix operation D = A*B+C, each input being a 4×4 matrix.  More on Volta at . It is helpful to read the architecture of the previous Pascal P100 chip which contains the GP100 GPU, described here – .  Background on why NVidia builds chips this way (SIMD < SIMT < SMT) is here – .

Volta GV100 GPU = 6 GraphicsProcessingClusters x  7 TextureProcessingCluster/GraphicsProcessingCluster x 2 StreamingMultiprocessor/TextureProcessingCluster x (64 FP32Units +64 INT32Units + 32 FP64Units +8 TensorCoreUnits +4 TextureUnits)

The FP32 cores are referred to as CUDA cores, which means 84×64 = 5376 CUDA cores per Volta GPU. The Tesla V100 which is the first product (SoC) to use the Volta GPU uses only 80 of the 84 SMs, or 80×64=5120 cores. The frequency of the chip is 1.455Ghz. The Fused-Multiply-Add (FMA) instruction does a multiplication and addition in a single instruction (a*b+c), resulting in 2 FP operations per instruction, giving a FLOPS of 1.455*2*5120=14.9 Tera FLOPs due to the CUDA cores alone. The TensorCores do a 3d Multiply-and-Add with 7x4x4+4×4=128 FP ops/cycle, for a total of 1.455*80*8*128 = 120TFLOPS for deep learning apps.

3D matrix multiplication3d_matrix_multiply

The Volta GPU uses a 12nm manufacturing process, down from 16nm for Pascal. For comparison the Jetson TX1 claims 1TFLOPS and the TX2 twice that (or same performance with half the power of TX1). The VOLTA will be available on Azure, AWS and platforms such as Facebook.  Several applications in Amazon. MS Cognitive toolkit will use it.

For comparison, the Google TPU runs at 700Mhz, and is manufactured with a 28nm process. Instead of FP operations, it uses quantization to integers and a systolic array approach to minimize the watts per matrix multiplication, and optimizes for neural network calculations instead of more general GPU operations.  The TPU uses a design based on an array of 256×256 multiply-accumulate (MAC) units, resulting in 92 Tera Integer ops/second.

Given that NVidia is targeting additional use cases such as computer vision and graphics rendering along with neural network use cases, this approach would not make sense.

Miscellaneous conference notes:

Nvidia DGX-1. “Personal Supercomputer” for $69000 was announced. This contains eight Tesla_v100 accelerators connected over NVLink.

Tesla. FHHL, Full Height, Half Length. Inferencing. Volta is ideal for inferencing, not just training. Also for data centers. Power and cooling use 40% of the datacenter.

As AI data floods the data centers, Volta can replace 500 CPUswith 33 GPUs.
Nvidia GPU cloud. Download the container of your choice. First hybrid deep learning cloud network. . Private beta extended to gtc attendees.

Containerization with GPU support. Host has the right NVidia driver. Docker from GPU cloud adapts to the host version. Single docker. Nvidiadocker tool to initialize the drivers.

Moores law comes to an end. Need AI at the edge, far from the data center. Need it to be compact and cheap.

Jetson board had a Tegra SoC chip which has 6cpus and a Pascal GPU.

AWS Device Shadows vs GE Digital Twins. Different focus. Availabaility+connectivity vs operational efficiency. Manufacturing perspective vs operational perspective. Locomotive may  be simulated when disconnected .

DeepInstinct analysed malware data using convolutional neural networks on GPUs, to better detect malware and its variations. – deep learning for time series data to detect anomalous conditions on sensors on the field such as pressure in a gas pipeline.

GANS applications to various problems – will be refined in next few years.

GeForce 960 video card. Older but popular card for gamers, used the Maxwell GPU, which is older than Pascal GPU.

Cooperative Groups in Cuda9. More on Cuda9.

RSA World 2017

I was struck by the large number of vendors offering visibility as a key selling point. Into various types of network traffic. Network monitors, industrial network gateways, SSL inspection, traffic in VM farms, between containers, and on mobile.

More expected are the vendors offering reduced visibility – via encryption, TPMs, Secure Enclaves, mobile remote desktop, one way links, encrypted storage in the cloud etc. The variety of these solutions is also remarkable.

Neil Degrasse’s closing talk was so different yet strangely related to these topics, the universe slowly making itself visible to us through light from distant places, with insights from physicists and mathematics building up to the experiments that recently confirmed gravitational waves – making the invisible, visible. . This tenuous connection between security and physics left me misty eyed. What an amazing time to be living in.

Software Defined Networking Security

Software Defined Networking seeks to centralize control of a large network. The abstractions around computer networking evolved from the connecting nodes via switches, to applications that run on top with the OSI model, to the controllers that manage the network. The controller abstraction was relatively weak – this had been the domain of telcos and ISPs, and as the networks of software intensive companies like Google approached the size of telco networks, they moved to reinvent the controller stack.  Traffic engineering and security which were done in disparate regions were attempted to be centralized in order to better achieve economies of scale. Google adopted openflow for this, developed by Nicira, which was soon after acquired by VMWare; Cisco internal discussions concluded that such a centralization wave would reduce Cisco revenues in half, so they spun out Insieme networks for SDN capabilities and quickly acquired it back. This has morphed into the APIC offering.

The centralization wave is a bit at odds with the security and resilience of networks because of their inherent distributed and heterogenous nature. Distributed systems provide availability, part of the security CIA triad, and for many systems availability trumps security. The centralized controllers would become attractive targets for compromise. This is despite the intention of SDN, as envisioned by Nicira founder M. Casado, to have security as its cornerstone as described here. Casado’s problem statement is interesting: “We don’t have a ubiquitous and horizontal security layer that provides both context and isolation. Where do we normally put security controls? We put it in one of two places. We might put it in the physical infrastructure, which is great because you have isolation. If I have ACLs [access control lists] or a firewall or an IDS [intrusion detection system], I put it in a separate box and I put it away from the applications so that the attack surface is pretty small and it’s totally isolated… you have isolation, but you have no context. ..  Another place we put security is in the end host, an application or operating system. This suffers from the opposite problem. You have all the context that you need — applications, users, the data being accessed — but you have absolutely no isolation.” The centralization imperative comes from the need to isolate and minimize the trusted computing base.

In the short term, there may be some advantage to be gained by complexity reduction through centralized administration, but the recommendation of dumb switches that respond to a tightly controlled central brain, go against the tenets of compartmentalization of risk and if such networks are put into practice widely they can result in failures that are catastrophic instead of isolated.

What the goal should be is a distributed system which is also responsive.