Reinforcement Learning: An Introduction

Reinforcement learning is a form of machine learning that mimics the way humans and animals learn, using trial and error to build intelligence in an unknown environment. It is used in robotics. The following is my summary of Section 1.1 of Reinforcement Learning: An Introduction, by Richard Sutton and Andrew Barto.

Reinforcement learning is a form of adaptive intelligence where the system wants to maximize a reward signal and adjusts its behavior accordingly. It takes actions, and has some sensory input that allows it to detect the consequences of actions. Reinforcement learning is a closed-loop system in that past actions impact future options. The system operates in uncharted surroundings, where it senses its current state, makes an action, and then learns based on the consequences of that action. This learning is based on the concept of expectation of random variables, where the expected consequence of an action is the weighted average of the previously realized consequences of that action.

Reinforcement learning is distinct from both supervised and unsupervised learning. Supervised learning is where the agent is given the correct action in a set of cases and then generalizes to other similar cases. Unsupervised learning is where the agent extracts structure from a set of unlabeled data. Reinforcement learning, on the other hand, is an agent acting with unknown inputs, unknown consequences, and a known goal. There is also an element of state progression, where past actions impact the future state and the set of actions available at later times.

The reinforcement learning agent starts by taking arbitrary actions and observing the reward accumulated by different actions. In this point it is exploring, purely collecting knowledge. As it gathers intelligence about some actions that optimize the reward signal more than others, it can start to exploit this intelligence and stick with the actions that it has found to produce the highest reward. If the agent completely stops exploring, however, it may miss out on actions that it has never tried but which may have a higher reward than those it has already tried. This trade-off between exploration and exploitation must be balanced - if the system does not both explore and exploit, it will fail.