An Agent is in an Environment. a) Agent reads Input (State) from Environment. b) Agent produces Output (Action) that affects its State relative to Environment c) Agent receives reward (or feedback) for the Output produced. With the reward/feedback it learns to produce better Output for given Input.

Where do neural networks come in ?

Optimal control theory considers control of a dynamical system such that an objective function is optimized (applications- stability of rockets, helicopters). Pontryagin’s principle: a necessary condition for solving the optimal control problem is that the control should be chosen so as to optimize the control Hamiltonian. The Control Hamiltonian is inspired by classical Hamiltonian and the principle of least action.

Derivatives are needed for the continuous optimizations. Deep learning models are capable of doing continuous non-linear transforms, which in turn can compute derivatives and integrals. They can be trained automatically using real-world inputs, outputs and feedback. So a neural network can provide a system for sophisticated feedback-based non-linear optimization of Input space to Output space.

The above could be accomplished by a feedforward neural network that is trained with a feedback (reward). Additionally a *recurrent* neural network could encode a memory into the system by making reference to previous states (likely with higher training and convergence costs).

Model-free reinforcement learning does not explicitly learn a model of the environment.

Manifestations of RL: Udacity self-driving course – lane detection. Karpathy’s RL blog post – good explanation of a network structure that can produce policies in a malleable manner, called policy gradients.

Raw inputs vs model inputs. Problem of mapping inputs from real-world to the actual inputs to a computer algorithm. Volume/quality of information – high vs low requirement. Exploitation vs exploration dilemma.

AWS DeepRacer. Simplifies the mapping of camera input to computer input, and one can focus more on the reward function and deep learning aspects. The car has a set of possible actions (change heading, change speed). Task is to predict the actions based on the inputs.

What are the some strategies applied to winning DeepRacer ?

- Implementation of the pure pursuit tracking problem, used by Scott Pletcher.
- Explicit reward based on proximity, distance and speed by Daniel Gonzalez and team in https://towardsdatascience.com/an-advanced-guide-to-aws-deepracer-2b462c37eea
- https://medium.com/dbs-tech-blog/an-introduction-to-aws-deepracer-from-a-2020-world-championship-finalist-3a63b5c8d8aa Fully autonomous vs Semi-autonomous. Input parameters for the reward function. Log analysis for optimizing the models.
- Faster training vs slower training – https://falktan.medium.com/aws-deepracer-how-to-train-a-model-in-15-minutes-a07ab77fb793 (PPO takes full lap to learn, line of sight learns in sub-lap distances).
- Soft-actor-critic algorith. SAC demystified – https://towardsdatascience.com/soft-actor-critic-demystified-b8427df61665 . SAC works to increase entropy (to encourage exploration) and not just maximize rewards.

Reward function input parameters – https://docs.aws.amazon.com/deepracer/latest/developerguide/deepracer-reward-function-input.html

“*DeepRacer: Educational Autonomous Racing Platform for Experimentation with Sim2Real Reinforcement Learning*” – https://arxiv.org/pdf/1911.01562.pdf

Alternative approaches – behavior trees, vectorization/VectorNet, …

DeepMind says reinforcement learning is ‘enough’ to reach general AI – https://news.ycombinator.com/item?id=27456315

The Multi-Armed Bandit problem – is a classic reinforcement learning problem that exemplifies the exploration–exploitation tradeoff dilemma. The name comes from imagining a gambler at a row of slot machines (sometimes known as “one-armed bandits“), who has to decide which machines to play, how many times to play each machine and in which order to play them, and whether to continue with the current machine or try a different machine. In the problem, each machine provides a random reward from a probability distribution specific to that machine, that is not known a-priori. The objective of the gambler is to maximize the sum of rewards earned through a sequence of lever pulls. The crucial tradeoff the gambler faces at each trial is between “exploitation” of the machine that has the highest expected payoff and “exploration” to get more information about the expected payoffs of the other machines.

Richard Sutton and Andrew Barto’s book on RL: An introduction.

This paper explores incorporating Attention mechanism with Reinforcement learning – Reinforcement Learning with Attention that Works: A Self-Supervised Approach. A video review of the ‘Attention is all you need’ is here, the idea being to replace an RNN with a mechanism to selectivity track a few relevant things.