Category: Uncategorized

Deep Reinforcement Learning key papers

July 2, 2023July 2, 2024 · Leave a comment ·

Reinforcement Learning (RL) combined with Deep Learning has been termed Deep Reinforcement Learning (DRL). Deep learning provides function approximation techniques that can handle large and complex state and/or action spaces, making it possible to tackle problems that were infeasible with traditional RL techniques. This line of research led to transformers and LLMs. Here’s a brief timeline of key insights and breakthroughs in Deep Reinforcement Learning over the past decade:

1. 2013 – Playing Atari with Deep Reinforcement Learning:

Organization: DeepMind
Breakthrough: This was perhaps the first major work that combined deep learning with Q-learning, resulting in a Deep Q-Network (DQN). The DQN was able to play several Atari 2600 games at or above human-level performance.
Key Insights: Experience replay and fixed Q-targets were used to stabilize learning. The experience replay helped in breaking the temporal correlations, and fixed Q-targets reduced the moving target problem in Q-learning.

2. 2015 – Human-level control through deep reinforcement learning:

Organization: DeepMind
Breakthrough: An extension of the 2013 DQN work, this presented a more robust DQN that achieved human-level performance across a broad range of Atari games.
Key Insights: Further stabilization and scaling of DQNs.

3. 2015 – Continuous control with deep reinforcement learning (DDPG):

Organization: DeepMind
Breakthrough: Introduced the Deep Deterministic Policy Gradient (DDPG) algorithm for continuous action spaces.
Key Insights: It utilized actor-critic architecture where the actor produces a deterministic policy, and the critic evaluates it. The Ornstein-Uhlenbeck process was used to add exploration noise.

4. 2016 – Asynchronous Methods for Deep Reinforcement Learning (A3C):

Organization: DeepMind
Breakthrough: Introduced the Asynchronous Advantage Actor-Critic (A3C) algorithm which combined the actor-critic approach with asynchronous updates.
Key Insights: Multiple agents, each with its own set of model parameters, explored different parts of the environment simultaneously, leading to faster and more robust policy learning. The asynchronous nature also helped in stabilizing learning.

5. 2017 – Proximal Policy Optimization (PPO):

Organization: OpenAI
Breakthrough: Introduced a simpler and more robust method for policy gradient optimization, making training more stable.
Key Insights: PPO constrains the policy updates to ensure the new policy isn’t too different from the old policy, thereby avoiding extreme policy updates that can destabilize training. PPO balances the benefits of both Policy Gradient methods (link) and Trust Region Policy Optimization methods (TRPO link). PPO achieves this by using a clipped surrogate objective that prevents large updates during training, enhancing stability and performance. In the context ofPPO, the term “surrogate objective” refers to an approximation used in place of the actual objective function during optimization. This surrogate function is easier to optimize and ensures more stable and reliable updates to the policy. The clip function ensures that the probability ratiodoes not deviate too far from 1 by clipping it to the range [1−ϵ,1+ϵ][1−ϵ,1+ϵ]. This prevents excessively large policy updates.

6. 2018 – Soft Actor-Critic (SAC):

Organization: UC Berkeley
Breakthrough: SAC is an off-policy actor-critic deep RL algorithm based on the maximum entropy RL framework.
Key Insights: SAC seeks policies that maximize both expected return and entropy, leading to more exploration, smoother policy updates, and generally better performance on continuous control tasks.

7. 2019 and beyond:

Subsequent years have seen the evolution of these methods and the introduction of new algorithms, improvements in sample efficiency, stability, and scalability. Also, there has been a focus on:

Transfer Learning: Using pre-trained models to improve sample efficiency in RL.
Meta-learning: Training agents that can quickly adapt to new tasks.
Model-based RL: Incorporating learned models of the environment dynamics to improve sample efficiency and policy learning.

Feature Vectors, Embeddings, Vector Databases, Feature Stores

April 8, 2023June 19, 2023 · 1 Comment ·

An ML model consists of a set of weights (or a set of numerical values) that transform inputs to outputs (along with a nonlinear transform such as a sigmoid function). The weights are often organized as vectors or matrices. Consider neural networks, decision trees and support vector machines as types of ML models for this discussion.

The weights representing features of the data (input or intermediate data) are also called feature vectors or vectors. They are also called embeddings, that is embeddings of vectors in a vector space. We discussed such vectors in https://securemachinery.com/2019/05/24/transformer-gpt-2/.

The term “embedding” comes from the idea that the vectors “embed” the original data into a lower-dimensional space. The embedding process involves a combination of statistical and computational techniques, such as factorization and neural networks, that learn to map the input data into the vector space in a way that preserves the relevant properties of the original data.

The use of vectors to represent words in machine learning research started in 2013 with the publication of the paper “Distributed Representations of Words and Phrases and their Compositionality” by Tomas Mikolov et al. This paper introduced the word2vec algorithm, which generates dense vector representations of words based on their distributional properties in a large corpus of text. The size of the vector or embedding in a word embedding model is a hyperparameter that needs to be determined before training the model. It is typically chosen based on the size of the vocabulary and the complexity of the task at hand. In practice, the vector size is often set to be between 100 and 300 dimensions, but this can vary depending on the specific application and the available computational resources. The optimal vector size can be determined through experimentation and tuning of hyperparameters.

One difference between embeddings and feature vectors is that embeddings are typically learned automatically from the data, while feature vectors are typically chosen based on domain knowledge or feature engineering. However these two terms are often used interchangeably. Here is a video going over how the embeddings are obtained from words in a sentence with a bag of words approach- https://www.youtube.com/watch?v=viZrOnJclY0 .

Pinecone, Milvus, Facebook AI Similarity Search (FAISS), Google Vertex Matching engine are examples of Vector databases.

The challenge in implementing a vector database is that traditional databases are not optimized for handling high-dimensional vector data, which is often used in machine learning and data science applications.

Vector data is typically represented as arrays of numbers, where each number represents a feature or attribute of the data. For example, an image might be represented as a high-dimensional vector where each dimension represents the color value of a specific pixel. In contrast to traditional databases, where each record consists of a set of fields or columns, vector databases need to store and index large volumes of high-dimensional data in a way that supports efficient similarity search.

In traditional databases, queries are typically based on simple comparisons of scalar values, such as equality or range queries. However, in vector databases, similarity search is the primary operation, which requires specialized algorithms and data structures to efficiently compute the similarity between vectors. These algorithms are designed to handle high-dimensional data and minimize the amount of computation needed to compare vectors, which can be computationally expensive.

There are several specialized algorithms that are commonly used in vector databases to support efficient similarity search. Here are some examples:

Euclidean Distance: This is a distance metric that measures the straight-line distance between two points in Euclidean space. It is commonly used in vector databases to compute the distance or similarity between vectors.
Cosine Similarity: This is a similarity metric that measures the cosine of the angle between two vectors. It is commonly used in text-based applications to measure the similarity between documents or word embeddings.
Locality-Sensitive Hashing (LSH): This is a technique used to hash high-dimensional vectors into lower-dimensional buckets based on their similarity. It is commonly used in vector databases to speed up similarity search by reducing the number of comparisons needed to find similar vectors.
Product Quantization: This is a technique used to divide high-dimensional vectors into smaller subvectors and quantize them separately. It is commonly used in vector databases to reduce the dimensionality of the data and speed up similarity search.
Inverted Indexing: This is a technique used to index the vectors based on the values of their individual dimensions. It is commonly used in text-based applications to speed up search queries by indexing the terms in the document.

Pinecone provides several indexing and search algorithms, including approximate nearest neighbor search, that are selected automatically based on the properties of the data and the search requirements. However, you can also specify a specific algorithm or tuning parameters when creating an index or performing a query by passing in the appropriate arguments. For example, you can use the method parameter when creating an index to specify the indexing method, or the distance parameter when performing a query to specify the distance metric to use.

While OpenSearch is not specifically designed as a vector database like Pinecone, it provides vector search capabilities through its support for nearest neighbor search. OpenSearch uses the K-Nearest Neighbor (K-NN) algorithm to perform nearest neighbor search for vector data. K-NN is a machine learning algorithm that can be used to find the K nearest neighbors of a query vector in a high-dimensional space. OpenSearch also provides support for approximate nearest neighbor search using algorithms such as Annoy and Hnswlib. To use vector search in OpenSearch, you first need to index your vector data using the appropriate data type (e.g., float or double). You can then perform a nearest neighbor search by specifying the query vector and the number of nearest neighbors to return. OpenSearch also provides support for vector scoring, which allows you to rank search results based on their similarity to a query vector. You can use vector scoring to boost or filter search results based on their similarity to a query vector.

What kind of vectorization schemes are useful for log processing ?

When processing log data, the goal is typically to extract useful information from the log entries and transform them into a format that can be easily analyzed and searched. Vectorization is a common technique used for this purpose, and there are several vectorization schemes that are applicable to log processing. Here are some examples:

Bag-of-words: This is a vectorization scheme that represents a document as a bag of words, where each word is represented by a dimension in the vector and the value of the dimension is the frequency of the word in the document. Bag-of-words can be used to represent log entries as a vector of words, which can be used for tasks such as text classification and anomaly detection.
TF-IDF: This is a vectorization scheme that represents a document as a weighted combination of its term frequency and inverse document frequency. TF-IDF can be used to represent log entries as a vector of weighted words, which can be used for tasks such as information retrieval and text mining.
Word embeddings: This is a vectorization scheme that represents words as dense vectors in a high-dimensional space, where the distance between vectors reflects the semantic similarity between the words. Word embeddings can be used to represent log entries as a vector of word embeddings, which can be used for tasks such as text classification and entity recognition.
Sequence embeddings: This is a vectorization scheme that represents a sequence of words as a dense vector in a high-dimensional space, where the distance between vectors reflects the similarity between the sequences. Sequence embeddings can be used to represent log entries as a vector of sequence embeddings, which can be used for tasks such as sequence classification and anomaly detection.
One-hot encoding: This is a vectorization scheme that represents categorical data as binary vectors, where each dimension corresponds to a possible category and the value of the dimension is 1 if the data belongs to that category and 0 otherwise. One-hot encoding can be used to represent log entries as a vector of categorical features, which can be used for tasks such as classification and clustering.

By using a suitable vectorization scheme, log data can be transformed into a format that can be easily analyzed and searched, enabling tasks such as anomaly detection, root cause analysis, and performance optimization.

Vector database versus Feature store – what’s the difference ?

Both vector databases and feature stores are used to manage and serve high-dimensional data, such as embeddings, vectors, and other numerical representations, but there are some key differences between the two.

A vector database is a database optimized for storing and querying high-dimensional vector data. It provides efficient indexing and search algorithms, such as approximate nearest neighbor search, that allow for fast and scalable similarity search. Vector databases are commonly used in machine learning applications, such as recommendation systems and natural language processing, where the goal is to find similar items or entities based on their vector representations.

A feature store, on the other hand, is a centralized repository for machine learning features that provides a way to store, manage, and share feature data across different applications and teams. It is designed to help data scientists and machine learning engineers build, test, and deploy machine learning models more efficiently by providing a unified interface for accessing and managing features.

While both vector databases and feature stores can store and serve high-dimensional data, the main difference is their focus and use case. Vector databases are designed for efficient similarity search, while feature stores are designed for feature management and sharing across different applications and teams. In practice, they can complement each other in many machine learning workflows, with the vector database providing the efficient similarity search capabilities and the feature store providing a centralized and standardized way to manage and share feature data.

Comparison of Milvus Pinecone Vespa Weaviate Vald GSI Qdrant – https://towardsdatascience.com/milvus-pinecone-vespa-weaviate-vald-gsi-what-unites-these-buzz-words-and-what-makes-each-9c65a3bd0696

Anyscale – Using an embeddings database to train an LLM using Ray – https://www.anyscale.com/blog/llm-open-source-search-engine-langchain-ray

OpenAI embeddings example – https://github.com/openai/openai-cookbook/blob/main/examples/Question_answering_using_embeddings.ipynb

HuggingFace sentence embeddings article – https://medium.com/huggingface/universal-word-sentence-embeddings-ce48ddc8fc3a

AWS – https://medium.com/@shankar.arunp/augmenting-large-language-models-with-verified-information-sources-leveraging-aws-sagemaker-and-f6be17fb10a8

Reasoning, Acting and Composing. ReAct and Self-Ask papers

December 3, 2022May 23, 2023 · Leave a comment ·

Reasoning and actions synergize. The ReAct paper interleaves reasoning traces and task-specific actions to achieve a synergy between the two.

A reasoning trace is a record or a description of the mental steps or thought process used to arrive at a particular conclusion or solution. It is a detailed account of how someone reasons through a problem or question, including the assumptions made, the evidence considered, the inferences drawn, and the logical steps taken to reach a conclusion. By examining the reasoning trace, one can identify potential biases, errors in reasoning, or gaps in logic that may have influenced the person’s decision-making process.

A task-specific action is an action that can help a reasoning task. This depends on the task at hand. Some examples

In a mathematical problem-solving task, a task-specific action might be to break down a complex problem into smaller, more manageable parts.
In a critical thinking task, a task-specific action might be to evaluate the evidence provided and identify any biases or assumptions that might be influencing the conclusion.
In a decision-making task, a task-specific action might be to weigh the pros and cons of each available option and consider how each option aligns with one’s goals or values.
In a scientific inquiry task, a task-specific action might be to design a controlled experiment to test a hypothesis and systematically collect and analyze data to draw conclusions.
In a legal reasoning task, a task-specific action might be to interpret and analyze case law and statutes, apply legal principles to the facts of a case, and argue persuasively for a particular legal outcome.

Task-specific actions can vary widely depending on the task and the context, but they generally involve applying relevant knowledge, skills, and strategies to solve a particular problem or achieve a specific goal.

From the ReAct ( paper – “The best approach overall is a combination of ReAct and CoT that allows for the use of both internal knowledge and externally obtained information during reasoning. On ALFWorld and WebShop, two or even one-shot ReAct prompting is able to outperform imitation or reinforcement learning methods trained with 103 ∼ 105 task instances, with an absolute improvement of 34% and 10% in success rates respectively. We also demonstrate the importance of sparse, versatile reasoning in decision making by showing consistent advantages over controlled baselines with actions only. Besides general applicability and performance boost, the combination of reasoning and acting also contributes to model interpretability, trustworthiness, and diagnosability across all domains, as humans can readily distinguish information from model’s internal knowledge versus external environments, as well as inspect reasoning traces to understand the decision basis of model actions.”

The Self-Ask paper discusses compositional reasoning and narrowing the “compositionality gap”.

Compositional reasoning is the ability to combine smaller pieces of knowledge or information to deduce new knowledge or solve a problem. It involves taking a set of facts or ideas and using them to create a new idea or answer a question that cannot be answered by any single fact alone. This type of reasoning is important in many areas, including natural language understanding, problem solving, and decision-making. Compositional reasoning allows us to use our knowledge in a more flexible and adaptive way, and is essential for many advanced cognitive tasks.

The compositionality gap is a metric used to measure the ability of language models to perform compositional reasoning tasks. It is defined as the ratio of the number of compositional questions for which the model answers the sub-questions correctly but not the overall question, to the total number of compositional questions. In other words, it measures how often models can correctly answer all sub-problems but not generate the overall solution. A high compositionality gap indicates that the model is struggling with compositional reasoning, while a low gap indicates that the model is better at composing multiple facts to answer complex questions.

The paper proposes a solution called “self-ask,” a new method of prompting language models to perform compositional reasoning tasks. With self-ask, the model explicitly asks itself follow-up questions before answering the initial question. By breaking down the reasoning process into smaller steps, the model is better able to combine relevant information from different sources and answer multi-hop questions correctly. Additionally, self-ask allows for plugging in a search engine to answer the follow-up questions, which further improves accuracy. The paper shows that self-ask narrows the compositionality gap by reasoning explicitly instead of implicitly.

Apache Yunikorn

October 29, 2022April 14, 2023 · Leave a comment ·

YuniKorn is an alternative scheduler to the default scheduler in kubernetes which benefits complex and mixed workloads. It provides advanced scheduling options like workload queueing and shared quotas. This helps improve the user experience and provides cost savings by providing better resource utilization.

Gang Scheduling refers to a scheduling algorithm for parallel systems that schedules related threads or processes to run simultaneously on different processors. In the distributed computing world, this refers to the mechanism to schedule correlated tasks in an All or Nothing manner.

Bin packing refers to the process of allocation and reallocation of pods to nodes in a way that achieves a high utilization of the nodes. When a node has a low level of utilization, its pods are moved to a node with the highest level of utilization and that has space for the pods available; after which the low utilization node is freed and released.

Yunikorn scheduler talk at ApacheCon’21 – link.

https://yunikorn.apache.org/community/events/#past-conference–meetup-recordings

Pinterest talk on their use of Yunikorn – link.

Airflow and Orchestration

March 26, 2022October 30, 2022 · Leave a comment ·

download_images >> train >> serve

This line sets the sequence of operations for an ML pipeline in Airflow. source

A metaphor to think of Airflow is that of an air-traffic controller that is orchestrating, sequencing, mediating, managing the flow of the flights of airplanes (source). It is an example of the mediator pattern which decouples dependencies in a complex system. The airplanes do not talk directly to each other, they talk to the air-traffic controller.

A functional alternative to Airflow is to use a bunch of cron jobs to schedule bash scripts. Airflow instead defines pipelines as Directed Acyclic Graphs (DAGs) in python code. This critical talk on “Don’t use Apache Airflow” describes it as cron on steroids.

A complete example of an ML pipeline built with airflow that outputs the results to a streamlit app – https://github.com/bhlr/docker-airflow-streamlit

Each operation calls an operator to do the job locally or remotely.

How does it perform an operation remotely on another node ? ssh/remote execution ? docker daemon ? k8s operator ? There can be many different ways – this logic is encapsulated by an Executor.

Local Executors

Remote Executors

A thread on airflow and alternatives- https://news.ycombinator.com/item?id=23349507 .

https://github.com/pditommaso/awesome-pipeline – A number of pipeline tools for ETL

Intro talk on Airflow by Astronomer – https://www.youtube.com/watch?v=GIztRAHc3as ,

and on an ETL use case with Snowflake – https://www.youtube.com/watch?v=3-XGY0bGJ6g

How can one compose these DAGs further and manage cross-DAG depedencies ? An approach is discussed in https://medium.com/quintoandar-tech-blog/effective-cross-dags-dependency-in-apache-airflow-1885dc7ece9f to define an explicit mediator between multiple DAGs.

Security of Solidity Smart Contracts using DistilBERT

February 26, 2022June 10, 2023 · Leave a comment ·

Smart Contracts are relatively short blocks of code that run on the Ethereum Virtual Machine (EVM), and deal with tokens of value. For example a contract may release funds when certain preconditions such are met, such as time elapsed, or a signed request received. The number of smart contracts and the value of transactions in smart contracts has grown quite a bit in the last few years along with the prices of cryptocurrencies. The code of the Smart Contract is always publicly available as bytecode which can be reverse engineered, and often the source code in solidity language is often publicly available. As a result, bugs in smart contracts have become attractive exploit targets. EVMs are a distributed computing construct that run in parallel on a network of participating nodes, coordinating their actions by a consensus mechanism and protocol that runs between the nodes.

A collection of links on smart contract security –

https://blog.sigmaprime.io/solidity-security.html

https://rekt.news/superfluid-rekt/ newsletter reporting high level analysis recent attacks.

https://secureum.xyz

https://solidity-by-example.org/variables/ Solidity has 3 types of variables 1. local (inside function), 2. state (inside contract, outside function), 3. global (e.g. block.timestamp, msg.sender – chain level. provides info about the blockchain)

https://solidity-by-example.org/data-locations (storage, memory, calldata)

https://solidity-by-example.org/visibility/ (public, private, internal, external)

https://solidity-by-example.org/function-modifier (onlyOwner to restrict access, validAddress to validate address, noReentrancy to prevent reentrancy) Incorrect reentrancy is a source of bugs.

https://www.saurik.com/optimism.html – instrumenting the blockchain to find gaps (EthDenver talk).

Security of Bridges. Bridges are implemented as smart contracts between two different chains.

https://www.bitdefender.com/blog/hotforsecurity/smart-contract-exploit-costs-nomad-crypto-bridge-200-million/

Sequence diagram of a bridge operation in https://blog.harmony.one/introducing-horizon-an-ethereum-harmony-cross-chain-bridge/

Within the last year, bridges have accounted for a majority of the total funds stolen across all of the crypto ecosystem. Massive bridge hacks have occurred on average every few months, and each losing extremely large amounts of user funds. Some bridge hacks in the last couple of years have included the Axie Infinity Ronin bridge hack, losing users $625 million, the Wormhole bridge hack costing users $300 million, the Harmony bridge hack losing users $100 million, and just this last week the Nomad bridge hack, losing users almost $200 million.

Methods for Detecting attacks

Code reviews for reentrancy bugs
Detection of source of a txn as a bad actor
Using ML for code analysis and bad actor detection

https://github.com/DicksonWu654/ethdenverhack – This submission attempts using ML for detecting reentrancy attacks in Solidity code, by using transfer learning on DistilBERT, to train on good and bad smart contract code examples, and use the trained model to detect bad code on new code samples.

“from transformers import TFDistilBertModel, DistilBertTokenizerFast” # using Hugging Face Distilbert model

These guys had a funny presentation – https://www.youtube.com/watch?v=9oLuxJdrZwo

Reinforcement learning – optimal control theory, policies, RLLib, Ray, DeepRacer, OpenAI Gym

November 26, 2021September 16, 2023 · Leave a comment ·

An Agent is in an Environment. a) Agent reads Input (State) from Environment. b) Agent produces Output (Action) that affects its State relative to Environment c) Agent receives Reward (or feedback) for the Output produced. With the reward/feedback it receives it learns to produce better Output for given Input. The map that captures the set of available Actions, consequent Rewards and subsequent States for each State is called the Policy. This is a brief look at RL from the perspective of control theory. This map is actually a map of probabilities of the state transitions and another way of looking at RL is as a Markov Decision Process.

Where do neural networks come in ?

Optimal control theory considers control of a dynamical system such that an objective function is optimized (with applications including stability of rockets, helicopters). In optimal control theory, Pontryagin’s principle says: a necessary condition for solving the optimal control problem is that the control input should be chosen to minimize the control Hamiltonian. This “control Hamiltonian” is inspired by the classical Hamiltonian and the principle of least action. The goal is to find an optimal control policy function u∗(t) and, with it, an optimal trajectory of the state variable x∗(t) which by Pontryagin’s maximum principle are the arguments that maximize the Hamiltonian.

Derivatives are needed for the continuous optimizations. In which direction and by what amount should the weights be adjusted to reduce the observed error in the output ? What is the structure of the input to output map to begin with ? Deep learning models are capable of performing continuous linear and non-linear transformations, which in turn can compute derivatives and integrals. They can be trained automatically using real-world inputs, outputs and feedback. So a neural network can provide a system for sophisticated feedback-based non-linear optimization of the map from Input space to Output space. The structure of the network is being learned empirically. For example this 2017 paper uses 8 layers (5 convolutional and 3 fully connected) to train a neural network on the ImageNet database.

The above could be accomplished by a feedforward neural network that is trained with a feedback (reward). Additionally a recurrent neural network could encode a memory into the system by making reference to previous states (likely with higher training and convergence costs).

Model-free reinforcement learning does not explicitly learn a model of the environment.

The optimal action-value function obeys an identity known as the Bellman equation. If the quality of the action selection were known for every state then the optimal strategy at every state is to select the action that maximizes the (local) quality. [ Playing Atari with Deep Reinforcement Learning, https://arxiv.org/pdf/1312.5602.pdf ]

Manifestations of RL: Udacity self-driving course – lane detection. Karpathy’s RL blog post has an explanation of a network structure that can produce policies in a malleable manner, called policy gradients.

Practical issues in Reinforcement Learning –

Raw inputs vs model inputs: There is the problem of mapping inputs from real-world to the actual inputs to a computer algorithm. Volume/quality of information – high vs low requirement.

Exploitation vs exploration dilemma: https://en.wikipedia.org/wiki/Multi-armed_bandit. Simple exploration methods are the most practical. With probability ε, exploration is chosen, and the action is chosen uniformly at random. With probability 1 − ε, exploitation is chosen, and the agent chooses the action that it believes has the best long-term effect (ties between actions are broken uniformly at random). ε is usually a fixed parameter but can be adjusted either according to a schedule (making the agent explore progressively less), or adaptively based on heuristics.

AWS DeepRacer. Allows exploration of RL. Simplifies the mapping of camera input to computer input, so one can focus more on the reward function and deep learning aspects. The car has a set of possible actions (change heading, change speed). The RL task is to predict the actions based on the inputs.

What are some of the strategies applied to winning DeepRacer ?

Implementation of the pure pursuit tracking problem, used by Scott Pletcher.
Explicit reward based on proximity, distance and speed by Daniel Gonzalez and team in https://towardsdatascience.com/an-advanced-guide-to-aws-deepracer-2b462c37eea
https://medium.com/dbs-tech-blog/an-introduction-to-aws-deepracer-from-a-2020-world-championship-finalist-3a63b5c8d8aa Fully autonomous vs Semi-autonomous. Input parameters for the reward function. Log analysis for optimizing the models.
Faster training vs slower training – https://falktan.medium.com/aws-deepracer-how-to-train-a-model-in-15-minutes-a07ab77fb793 (PPO takes full lap to learn, line of sight learns in sub-lap distances).
Soft-actor-critic algorithm. SAC demystified – https://towardsdatascience.com/soft-actor-critic-demystified-b8427df61665 . SAC works to increase entropy (to encourage exploration) and not just maximize rewards.

Reward function input parameters – https://docs.aws.amazon.com/deepracer/latest/developerguide/deepracer-reward-function-input.html

On-policy vs off-policy methods: This has to do with the policies used for exploration and exploitation. If the same policy is used for both, it’s called on-policy. If there are two different policies, one for exploration and the other for exploitation it is called an off-policy method. PPO is on-policy, SAC is off-policy. A good read on policy optimization – https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html and one on the Actor-Critic model – https://huggingface.co/learn/deep-rl-course/unit6/advantage-actor-critic .

Original paper – Proximal Policy Optimization Algorithms is at https://arxiv.org/pdf/1707.06347.pdf .

PPO – Actor Critic Network.

“DeepRacer: Educational Autonomous Racing Platform for Experimentation with Sim2Real Reinforcement Learning” – https://arxiv.org/pdf/1911.01562.pdf

DeepRacer uses RLLib which brings forth a key idea of encapsulating parallelism in the context of AI applications, as described in RLlib: Abstractions for Distributed Reinforcement Learning. RLLib is part of Ray, described in Ray: A Distributed Framework for Emerging AI Applications . Encapsulating parallelism means that individual components specify their own internal parallelism and resources requirements and can be used by other components without any knowledge of these. This allows a larger system to be built from modular components.

OpenAI Gym offers a suite of environments for developing and comparing RL algorithms. It emphasizes environments over agents, complexity over performance, knowledge sharing over competition. https://github.com/openai/gym , Open AI Gym paper. Here’s a code snippet from this paper of how they see an agent interact with the environments over 100 steps of a training episode.

ob0 = env.reset() # sample environment state, return first observation
a0 = agent.act(ob0) # agent chooses first action
ob1, rew0, done0, info0 = env.step(a0) # environment returns observation,
# reward, and boolean flag indicating if the episode is complete.
a1 = agent.act(ob1)
ob2, rew1, done1, info1 = env.step(a1)
...
a99 = agent.act(o99)
ob100, rew99, done99, info2 = env.step(a99)
# done99 == True => terminal

RL is not a fit for every problem. Alternative approaches with better explainability and determinism include behavior trees, vectorization/VectorNet, …

DeepMind says reinforcement learning is ‘enough’ to reach general AI – https://news.ycombinator.com/item?id=27456315

Richard Sutton and Andrew Barto’s book on RL: An introduction.

This paper explores incorporating Attention mechanism with Reinforcement learning – Reinforcement Learning with Attention that Works: A Self-Supervised Approach. A video review of the ‘Attention is all you need’ is here, the idea being to replace an RNN with a mechanism to selectivity track a few relevant things.

Multi agent Deep Deterministic Policy Gradients – cooperation between agents. https://www.youtube.com/watch?v=tZTQ6S9PfkE. Agents learn a centralized critic based on the observations and actions of all agents. https://arxiv.org/pdf/1706.02275.pdf .

Multi-vehicle RL for multi-lane driving. https://arxiv.org/pdf/1911.11699v1.pdf

Reinforcement learning in chip design

October 16, 2021January 3, 2022 · Leave a comment ·

Deep learning is being applied to combinatorial optimization problems. A very intriguing talk by Anna Goldie discussed an application of RL to chip design that cuts down the time for layout optimization and which in turn enables optimizing of the chip design for a target software stack in simulation before the chip goes to production. Here’s a paper – graph placement methodology for fast chip design.

A snippet on how the research direction evolved to a learning problem.

“Chip floorplanning as a learning problem

The underlying problem is a high-dimensional contextual bandits problem but, as in prior work, we have chosen to reformulate it as a sequential Markov decision process (MDP), because this allows us to more easily incorporate the problem constraints as described below. Our MDP consists of four key elements:
(1) States encode information about the partial placement, including the netlist (adjacency matrix), node features (width, height, type), edge features (number of connections), current node (macro) to be placed, and metadata of the netlist graph (routing allocations, total number of wires, macros and standard cell clusters).
(2) Actions are all possible locations (grid cells of the chip canvas) onto which the current macro can be placed without violating any hard constraints on density or blockages.
(3) State transitions define the probability distribution over next states, given a state and an action.
(4) Rewards are 0 for all actions except the last action, where the reward is a negative weighted sum of proxy wirelength, congestion and density, as described below.

We train a policy (an RL agent) modelled by a neural network that, through repeated episodes (sequences of states, actions and rewards), learns to take actions that will maximize cumulative reward (see Fig. 1).
We use proximal policy optimization (PPO) to update the parameters of the policy network, given the cumulative reward for each placement.”

Their diagram:

“An embedding layer encodes information about the netlist adjacency, node features and the current macro to be placed. The policy and value networks then output a probability distribution over available grid cells and an estimate of the expected reward for the current placement, respectively. id: identification number; fc: fullyconnected layer; de-conv: deconvolution layer”

A graph placement methodology for fast chip design | Nature

“Fig. 1 | Overview of our method and training regimen.In each training iteration, the RL agent places macros one at a time (actions, states and rewards are denoted byai, si and ri, respectively). Once all macros are placed, the standard cells are placed using a force-directed method. The intermediate rewards are zero. The reward at the end of each iteration is calculated as a linear combination of the approximate wirelength, congestion and density, and is provided as feedback to the agent to optimize its parameters for the next iteration.”

The references mention a number of applications of ML to chip design. A project exploring these is at https://github.com/The-OpenROAD-Project at https://theopenroadproject.org/wp-content/uploads/2021/11/demo-lounge-slides.pdf

PyTorch

October 9, 2021June 19, 2023 · Leave a comment ·

PyTorch is an open source machine learning framework that is primarily used for building deep learning models. The framework is built on top of the Torch library and is implemented in Python, with support for C++ and CUDA.

The main C++ classes in PyTorch are:

Tensor: This is the core object in PyTorch and represents a multi-dimensional array. Tensors are the basic building blocks of a PyTorch model and are used to store and manipulate data.
Autograd: This is PyTorch’s automatic differentiation engine, which allows developers to compute gradients of tensors with respect to a loss function. The autograd module also provides a set of functions for computing gradients of complex functions.
nn.Module: This is a base class for all neural network modules in PyTorch. It provides a convenient way to define and organize layers of a neural network, as well as a set of useful methods for training and evaluating the model.
Optimizer: This is a class that implements various optimization algorithms, such as stochastic gradient descent (SGD), Adam, and Adagrad. The optimizer is used to update the parameters of a model during training.
DataLoader: This is a utility class that provides an efficient way to load and preprocess large datasets for training a model. The DataLoader class can be used to batch and shuffle data, as well as to apply various transformations to the data.

PyTorch’s autograd engine implements a variant of reverse-mode automatic differentiation, which is also known as backpropagation. This algorithm efficiently calculates the gradients of the output with respect to each input variable by traversing the computational graph in reverse order, propagating the gradients backwards through each operation using the chain rule.

Chainer Variables implements this implementation of the chain-rule of differentiation and their doc explains this well – https://docs.chainer.org/en/latest/guides/variables.html

Step by step example of automatic differentiation – https://stats.stackexchange.com/questions/224140/step-by-step-example-of-reverse-mode-automatic-differentiation

https://pytorch.org/tutorials/beginner/nlp/deep_learning_tutorial.html

Getting started with CUDA via pytorch –


>>> import torch
>>> torch.cuda
<module 'torch.cuda' from '/opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/cuda/__init__.py'
>>> torch.cuda.is_available()
True
>>> torch.cuda.get_device_name()
'Tesla V100-SXM2-16GB'
>>> torch.cuda.memory_allocated()
0
>>> torch.cuda.get_device_properties(0).total_memory
16935419904
>>> import pynvml
>>> from pynvml import *
>>> nvmlInit()
>>> h = nvmlDeviceGetHandleByIndex(0)
>>> torch.cuda.mem_get_info()
(16112549888, 16935419904)
>>> var1=torch.FloatTensor([1.0,2.0,3.0]).cuda()
>>> var1
tensor([1., 2., 3.], device='cuda:0')
>>> var1.device
device(type='cuda', index=0)
>>> import torch.nn as nn

ML for Forecasting

September 19, 2021April 20, 2023 · Leave a comment ·

In this paper – “DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks”, the authors discuss a method for learning a global model from several individual time series.

Let’s break down some aspects of the approach and design.

“In probabilistic forecasting one is interested in the full predictive distribution, not just a single best realization, to be used in downstream decision making systems.”

The autoregressive model specifies that the output variable depends linearly on its own previous values and on a stochastic term (an imperfectly predictable term).

Recurrent Neural Network is used to refer to NNs with an infinite impulse response, and are used for speech recognition, handwriting recognition and such tasks involving sequences. https://en.wikipedia.org/wiki/Recurrent_neural_network

An LSTM or The Long Short-Term Memory (LSTM) is a type of RNN, that came about to solve a problem of vanishing gradients in previous RNN designs. An LSTM cell can process data sequentially and keep its hidden state through time.

A covariate is an independant random variable, with which the target random variable is assumed to have some covariance.

The approach has distinct features described in this snippet

“In addition to providing better forecast accuracy than previous methods, our approach has a number key advantages compared to classical approaches and other global methods: (i) As the model learns seasonal behavior and dependencies on given covariates across time series, minimal manual feature engineering is needed to capture complex, group-dependent behavior; (ii) DeepAR makes probabilistic forecasts in the form of Monte Carlo samples that can be used to compute consistent quantile estimates for all sub-ranges in the prediction horizon; (iii) By learning from similar items, our method is able to provide forecasts for items with little or no history at all, a case where traditional single-item forecasting methods fail; (vi) Our approach does not assume Gaussian noise, but can incorporate a wide range of likelihood functions, allowing the user to choose one that is appropriate for the statistical properties of the data.
Points (i) and (iii) are what set DeepAR apart from classical forecasting approaches, while (ii) and (iv) pertain to producing accurate, calibrated forecast distributions learned from the historical behavior of all of the time series jointly, which is not addressed by other global methods (see Sec. 2). Such probabilistic forecasts are of crucial importance in many applications, as they—in contrast to point forecasts—enable optimal decision making under uncertainty by minimizing risk functions, i.e. expectations of some loss function under the forecast distribution.”

A good introduction to Logistic Regression by looking at why Linear Regression does not work well for prediction of binary outcomes – https://www.youtube.com/watch?v=gNhogKJ_q7U&list=PLzMw_44KVqI9EYeGmNO-W9O1PgXReaanW&index=7

Facebook Prophet is an open-source library for forecasting – https://facebook.github.io/prophet/

ARMA – AutoRegressive Moving Average Estimator

ARIMA estimator – AutoRegressive Integrated Moving Average is a generalization of ARMA and can better handle non-stationarity in a time series.

OpenAI scaling kubernetes cluster to 7500 nodes

June 13, 2021May 2, 2023 · Leave a comment ·

Interesting use of MPI, anti-affinity, gang scheduling, alias IP addresses. some bits I found interesting below.

https://openai.com/research/scaling-kubernetes-to-7500-nodes

“A large machine learning job spans many nodes and runs most efficiently when it has access to all of the hardware resources on each node. This allows GPUs to cross-communicate directly using NVLink, or GPUs to directly communicate with the NIC using GPUDirect. So for many of our workloads, a single pod occupies the entire node. Any NUMA, CPU, or PCIE resource contention aren’t factors for scheduling. Bin-packing or fragmentation is not a common problem. Our current clusters have full bisection bandwidth, so we also don’t make any rack or network topology considerations. All of this means that, while we have many nodes, there’s relatively low strain on the scheduler.”

“We use iptables tagging on the host to track network resource usage per Namespace and pod. This lets researchers visualize their network usage patterns. In particular, since a lot of our experiments have distinct Internet and intra-pod communication patterns, it’s often useful to be able to investigate where any bottlenecks might be occurring.”

Previous blog on scaling to 2500 nodes –

https://openai.com/research/scaling-kubernetes-to-2500-nodes

“the default setting for Fluentd’s and Datadog’s monitoring processes was to query the apiservers from every node in the cluster (for example, this issue which is now fixed). We simply changed these processes to be less aggressive with their polling, and load on the apiservers became stable again”

“ARP cache size increase..is particularly relevant in Kubernetes clusters since every pod has its own IP address which consumes space in the ARP cache”

“We use Kubernetes mainly as a batch scheduling system and rely on our autoscaler to dynamically scale up and down our cluster — this lets us significantly reduce costs for idle nodes, while still providing low latency while iterating rapidly. The default kube-scheduler policy is to spread out load evenly among nodes, but we want the opposite so that unused nodes can be terminated and also so that large pods can be scheduled quickly.”

Disaster Recovery: Understanding and designing for RPO and RTO

January 30, 2021December 5, 2021 · Leave a comment ·

Let’s take a disaster scenario where a system loses its data-in-transit (i.e. not yet persisted) at a certain point in time. and some time after this point, a recovery process kicks in, which restores the system back to normal functioning.

Recovery Point Objective refers to the amount of tolerable data loss measured in time. It can be measured in time based on the fact that it is in-transit data of a certain max velocity, so bounding the time bounds the amount of data that can be lost. This time figure, the RPO, is used to determine how frequently the data must be persisted and replicated. An RPO of 10 minutes implies the data must be backed up every 10 minutes. If there’s a crash the system can be restored to a point not more than 10 minutes prior to the time of crash. RPO determines frequency of backups, snapshots or transaction logs.

Recovery Time Objective refers to the amount of time required to restore a system to normal behavior after a disaster has happened. This includes restoration of all infrastructure components that provide a service, not just the restoration of data.

Lower RPO/RTO is higher cost.

Matrix of RPO – high/low vs RTO – high/low can be used to categorize applications.

Low RPO, Low RTO. Critical online application like a storefront.

Low RPO, High RTO. Data sensitive application but not online, like analytics.

High RPO, Low RTO. Redundantly available data or no data. Compute clusters that are highly available.

High RPO, High RPO. Non-prod systems – dev/test/qa ?

Amount of acceptable data loss <= App (data?) criticality.
One can expect a pyramid of apps – large number with less criticality, small number with high criticality

Repeatability. Backup and recovery procedures. Must be written. Must be tested. Automation.

HA/DR spectrum of solutions:

Backups, save transaction logs
Snapshots
Replication – synchronous, asynchronous
Storage only vs in-memory as well. Application level crash consistency of backups.
Multiple AZs
Hybrid

Tech: S3 versioning and DDB streams, Global tables.

Rules of thumb:

test full recovery regularly, at least once an year.
backup, backup, backup
3-2-1 Rule. https://us-cert.cisa.gov/sites/default/files/publications/data_backup_options.pdf
Keep at least 3 copies of data in at least 2 media types, and at least one off-site backup.

Related terms: RPA and RTA

3 types of disasters.

Natural disaster – e.g. floods, earthquakes, fire
Technical failure – e.g. loss of power, cable pulled
Human error – e.g. delete all files as admin

Replication – works for first two. Continuous snapshots/backup/versioning – for the last one. Replication will just delete the data on both sides. Need the ability to go back in time and restore data.

Cost – how to optimize cost of infrastructure and its maintenance.

Which region to choose ? Key considerations: What types of disasters are the concern (Risk). How much proximity is needed to end-customers and to primary region (Performance). What’s the cost of the region (Cost) ?

Delta Lake and Spark for threat detection and response at scale

October 25, 2020January 18, 2022 · Leave a comment ·

Notes on a talk on the data platform for Threat detection and response at scale at Spark+AI Summit, 2018.

The threat detection problem, use-cases and scale.

It’s important to focus on and build the data platform first else one can get siloed into narrow set of addressable use-cases.
we want to detect attacks,
contextualize the attacks
determine root cause of an attack,
determine what the scope of the incident might be
determine what we need to contain it
Diverse threats require diverse data sets
the Threat signal can be concentrated or spread in time
Keylines visualization library is used to build a visualization of detection, contextualization, containment

Streaming is a simple pattern that takes us very far for detection

Streams are left-joined with context and filtered or inner-joined with indicators
Can do a lot with this but not everything
Graphs are key. Graphs at scale are super hard.

Enabling triage and containment with search and query

to triage the detection, it comes down to search and query.
ETM does 3.5million records/sec. 100TB of data a day. 300B events a day.
11 trillion rows, 0.5PB of data.

Ingestion architecture – tries to solve all these problems and balance issues.

data comes into s3 in a consistent json wrapper
there’s a single ETL job that takes all the data and writes it into a single staging table which is partitioned by date and event-type, has a long retention
table is optimized to stream new data in and stream data out of, but can be queried as well. you can actually go and query it using sql function
highest value data – we write parsers, we have discrete parsing streams and put them into a common schema and put it into separate delta tables. well parsed, well structured.
use optimizations from delta, z-odering.. to index over columns that are common predicates. search by IP address, domain names – those are what we order by
indexing and z-ordering – take advantage of data skipping
sometimes we parser code gets messed up.
single staging table.. is great . we just let the fixed parser run forward, we have all the data corrected, then we are back-corrected. don’t have to repackage code and run as a batch job. we literally just fix code and run it in the same model that’s it.
off of these refined tables or parsed data sets, this is where the detection comes in.
we have a number of detection streams in batches, that do the logic and analysis. facet-faced or statistical.
alerts that come out of this go to their own alert table. goes to delta again. long retection, consistent schema. another streaming job then does de-duplication and whitelisting and writes out alerts to our alert management system. we durably store all the alerts, whether or not de-duped/whitelisted
allows us to go back and fix things if things are not quite correct, accidentally.
all this gives us operational sanity, and a nice feedback loop

Thanks to z-ordering, it can go from scanning 500TB to 36TB.

average case is searching over weeks or months. it makes it usable for ad-hoc refinements.
simple, unified platform.

Michael: Demo on interactive queries over months of data

first attempt is sql SELECT on raw data. takes too long, cancelled. second attempt uses HMS, still too long, cancelled. why is this so hard ?
observation: every data problem is actually two problems 1) data engineering and 2) data science. most projects fail on step 1.
doing it with delta – the following command takes 17s and fires off spark job to put the data in a common schema.

CREATE TABLE connections USING delta AS SELECT * from json.'/data/connections'

then

SELECT * FROM connections WHERE dest_port = 666

this is great to query the historical data quickly.. however batch alone is not going to cut it as we may have attacks going on right now. but delta plugs into streaming as well:

INSERT INTO connections SELECT * from kafkaStream

Now we’ve Indexed both batch and streaming data.

We can run a python netviz command to visualize the connections.

—

Here’s a paper on the Databricks Delta Lake approach – https://databricks.com/wp-content/uploads/2020/08/p975-armbrust.pdf .

AWS Builders Library

January 26, 2020November 17, 2021 · Leave a comment ·

A few interesting ideas from AWS Builders Library –

design APIs to be safe for retries or idempotent for resiliency
shuffle sharding to distribute load as system scales
Amazon’s approach to failing successfully.

Toyota’s Five-why’s approach to root cause a problem – is good but not enough to find all other root causes that might also cause a problem.

Couple great talks on serverless

https://www.youtube.com/watch?v=dzU_WjobaRA

https://www.youtube.com/watch?v=Fup5vHEvU50

Smithy is an Apache-2.0 licensed, protocol-agnostic IDL for defining APIs, generating clients, servers and documentation. https://github.com/awslabs/smithy

Secure Machinery

On the evolution of security and intelligent machinery