Reinforcement Role in Learning Essay (Critical Writing)

Introduction, thesis statement, counterargument, argument (author’s position).

This paper discusses the role of reinforcement in learning. It supports the idea that all behavior is due to reinforcement learning. Agents such as students in a language class act according to the teaching they have acquired over time. They may not show responsive behavior towards a stimulus presented in the present, but the observation should not be a reason for assuming that reinforcement learning is not taking place. This paper shows that learning relies on a person’s reinforcement history. The opposing argument, which is not supported by this article, states that reinforcement history has its limits in affecting behavior. Any living thing does not have to incur new reinforcement of the same or different stimulus to develop a memory that helps in making new decisions. The paper shows that reinforcement is continuous whose effect does not end with time. The conclusion is that learning is a product of reinforcement, but reinforcement can be a variety of things. This finding implies that researchers will have to incorporate diverse theories in their discussions on research findings to ensure their results and interpretations remain credible. On the other hand, practitioners ought to increase awareness of the concepts of learning before choosing particular systems or technologies.

In reinforcement learning, the person or animal receiving and interpreting new information has to evaluate the need to use the new information as it is or combine it with existing information to make a decision. This paper supports the idea that all observed behavior in a person or an animal is due to reinforcement. The understanding used is that learning is both the use of information stored in memory and incoming information that acts as stimuli. The learner is usually referred to as an agent. Any other thing, person, factor, or condition that affects or causes learning happens in the agent’s environment. The person or animal being observed can exhibit different behaviors and reactions to the same stimulus in the process of interacting with new information and retrieving the stored information. This paper asks whether there is something wrong with the theory of reinforcement learning, as part of finding enough literature support for reinforcement learning. It also seeks to find out whether any other alternative methods or theories can help explain the learning process. Although research on different models of reinforcement learning continues, one concept remains valid at all times, as suggested by the literature examined in this paper.

Reward and punishment reinforce behavior and become part of the memory (Shteingart & Loewenstein, 2014). The intensity of the reward and punishment affects the memory of the event, the environment, and its future cognition by the agent. Thus, the total number of reward and punishment events contributes to the overall learned experience of the agent.

People learn continuously because they are always retrieving information from memory and using it to interpret the new information received. At the same time, they rely on new information to understand the previous information stored in memory. While doing so, they gather the experience of different reinforcements that inform their present and future decisions for exploiting stored knowledge or exploring new knowledge to increase understanding. The information that individuals get can be in various forms, like direct instruction or punishment after acting.

Why the Topic is Important

Reinforcements come in different forms, and some may seem to work better than others. As a result, applying reinforcement learning will likely take an experimental outlook as people seek to find the best way to achieve the desired behavior by weighing several options. A person seeking to induce behavior in another individual may use a particular technique that relies on reinforcement learning and still fail to achieve the objective. For example, training a person to associate binge eating with bodily harm can fail to work even when a person follows the right procedures of reinforcing the image of physical injury. If this happens, then the person might end up erroneously thinking that the method or the technology used is more important than the concept of reinforcement. Therefore, people need to understand how reinforcement works so that they do not succumb to false assumptions that the concept does not work in influencing behavior. Any study that helps to clarify the way reinforcement works will be useful to practitioners and future researchers.

The presence of reinforcement or the lack of it does not imply that learning will not take place or does not take place. People and animals behave according to historical reinforcements that they experience. The behavior observed may not always be due to the present reinforcement. It can be due to past reinforcement, which overshadows the current reinforcement. Learning will not take place without any reinforcement.

The main argument against the concept of reinforcement is that learning can occur without reinforcement. Support for this argument comes from several studies done on learning and the role of reinforcement. One of the examples of such studies is an experiment done with rats, where results showed that the animals failed to form conditioned responses (Iordanova, Good, & Honey, 2008). The study hypothesized that the results would demonstrate the ability to learn without reinforcements. The researchers wanted to demonstrate that rats conditioned to behave in a given way would change behavior when presented with a new stimulus. They expected this to happen because the new stimulus was increasing the complexity of the behavior and forcing the rats to lose their coordination. The research findings were significant because they disqualified the hypothesis of the study. They showed that the intention of the research and its objectives were flawed. The findings of the research changes in behavior were also expected not to conform to changes in the incentive, but the study concluded that rats formed associations, which signaled the presence and absence of reinforcement. Also, the research showed that rats used in the experiment changed behavior to conform partially to the stimulus. The researchers, Iordanova, Good, and Honey (2008), went on to suggest that rats could form an integral or configurable memory that does not require reinforcement. They based their conclusion on the partial association rather than the full association. Therefore, their argument that reinforcement is not necessary for learning is not valid.

Additional suggestions that disapprove the need for reinforcement claim that learning can naturally occur after performing a task and learners become familiar with a new concept with enough repetition. Therefore, instead of repetition, learning only needs sufficient time and resources to complete tasks for a personal reward that is not associated with reinforcements. The argument here is that time is more important than reinforcement. It appears that those who support this argument believe that the brain is capable of coming up with self-reinforcement; it is not dependent on external reinforcement. This statement implies that there are types of reinforcement that work and others that do not work. Supporters of the idea that time is all that is required to influence behavior to consider the influence of cost, time, and resources to control behavior. They show that reinforcement is only a byproduct of considering the other features that affect behavior. Supporters of the argument state that learning outcomes can still be achieved by doing without reinforcement in the learning process. Thus, learners do not need to work with tests, use summaries, or engage in class discussions when they are learning in a typical class setting (Ringbauer & Neumann, 2011). The fault in this argument is that it fails to consider the notion of reinforcement history. It only considers the present reinforcement, which is the wrong way to approach the principle of reinforcement and its association with the behavior.

In summary, the argument is that introducing a new stimulus causes complexity in learning, such that individuals will not show full behavior response to reinforcement. Also, reinforcement only works partially when there is sufficient time to allow it to affect behavior. The counterargument will be presented in the next section, showing that it is not possible to neglect the history of reinforcement when making behavior interpretations.

This paper reiterates that learning is the result of a reinforcement history. Even though some situations may seem to lack reinforcement, they still have a historical association with reinforcement. A person may respond to a stimulus presented to him or her by not showing any action or behavior because this is a responsive option arising from the person’s history of reinforcement. In this case, the presented stimulus is reinforcing other behaviors that were already learned by an individual. Therefore, any stimulus acts as reinforcement for a particular behavior. A person can use past learned behavior to respond to present reinforcement. Therefore, reinforcement does not have to influence a particular kind of behavior; it may influence a few or many behaviors at the same time.

The influence happens at present or in the future. While in most cases this is true, it is false in others because the agent responds to past reinforcement. The observed behavior of a person can be due to reinforcement introduced in the past that work in support of other reinforcements that are presented now. While someone may introduce behavior reinforcement to another person, he or she must also realize that the environment around them is also a reinforcement of wanted or unwanted behavior. Thus, in complex situations, learning may continue to occur, and the involved memory range could be too broad to comprehend. Once a practitioner in learning understands this fact, they should not succumb to false assumptions that reinforcement is not necessary.

Learning is an effect that occurs when a person acts in a given way. Whether the person is operating within specified control factors in an artificial environment or whether the person does not do not affect the learning process significantly (Anderson & Elloumi, 2004). People learn by selecting the available alternatives and opting to use the one that they find most useful to their learning practices or goals. Therefore, a learning method and its effect do not necessarily point out the superiority of the particular method when applied randomly to other subjects.

The short-term memory is a reaction to a stimulus in the environment. The combination of the stimulus with already associated meaning in the mind leads to the development of long-term memory. All that a person stores in long-term memory serve as an aid to the subsequent cognition of a new stimulus that the given person encounters. The preferred pedagogies of teaching today are results of reinforcement that have yielded positive results to contribute to overall positive learning outcomes. As a consequence, they are now standards that new learners and teachers embrace the virtue of their association with success. It means that the pedagogies as reinforced schemes allow students and teachers to tap into to aid their learning tasks.

In learning, the learner is an agent and the agent interacts with the environment. The presence of a teacher only enriches the agent’s environment, but learning only involves the agent’s interaction with the environment at all times. An important consideration for the agent and environment relationship is the fact that the boundary separating them is not easily definable at all times. Actions and reactions are also not always in sequence. For example, a person can process a stimulus and fail to act on it, until later, after responding to stimulus from additional sources. In this case, the person reacting to a stimulus determines the extent to which the stimuli will influence his or her behavior (Anderson & Elloumi, 2004).

Decision-making becomes a critical skill in any environment that humans or animals comprehend partially. Therefore, people will go on to gain rewards, which are positive outcomes of their decisions or suffer punishments for trying to find solutions and making the wrong decisions. Questions are raised as to whether the reinforcement history works for model-based choices otherwise known as goal-oriented actions in a similar way to model-free choices that are otherwise called habits. Goals oriented actions follow anticipated outcomes (Dayan & Law, 2008). Agents act so that they limit their deviation from the goal and will immediately act in the opposite of their previous action when they find out that the last act does not lead towards the desired goal. One may think that there is no significant use of reinforcement learning when dealing with a goal-oriented action, but that is a false assumption. Even with future goals and present actions, agents continue to act based on the collective knowledge and ability to control their environments. The fact that an action is a product of previous thoughts and experiences in reward or punishment shows that a goal-oriented action is a result of one’s reinforcement history (Dayan & Law, 2008).

Another important thing to note is that technologies, just like teachers, enrich the learner’s environment, but they may not necessarily affect the learner’s performance. This would happen when the reward or punishment stimuli provided by the technology do not meet a level that would cause an agent to react and learn. When the instructional strategies are right, any medium used for learning will be useful (Anderson & Elloumi, 2004). In agreeing that it is important to reach the required level of stimulus to cause an agent reaction appropriately, Jones et al. (2013) use the self-determination theory to explain their point in facilitating behavioral parent training. According to the authors, the major failures in behavioral parent training expansion into real-world therapy settings are due to the failure to follow the correct techniques. Rather than increase the availability of the training, it is important to make practitioners aware of the role that the core components, such as positive reinforcement, assigning and reviewing homework, as well as role-playing, have on the overall results of learning.

As with any learning, the need to control the environment sometimes becomes very costly and forces practitioners to seek affordable ways of meeting the demands of learning. Standardization of known methods and discarding costly intervention are some of the solutions used (Tittle, Antonaccio, & Botchkovar, 2012). Unfortunately, this also removes some of the agent control abilities in the learning process, thus hindering learning. When the agent fails to react appropriately, it shows that the agent has learned of alternative responses. Thus, instead of assuming that learning has not occurred, it is important to note that it has, but in a different way than what was desired.

In online learning, practitioners have moved to constructivism and away from behavioral and cognitive psychology thinking. In the behaviorist school, learning is due to external stimuli in the environment. Meanwhile, cognitive psychologies see learning as the use of memory, motivation, and thinking, with reflection playing a crucial role in the outcome. Thus, a learner with a high processing capacity will be in an excellent position to gain knowledge compared to others with limited processing capabilities. However, in the constructivist thought, learning happens by interpretation, followed by personalization. Learners actively interpret their environment and incorporate what they find in what they already know to give rise to new personal knowledge. The new personal experience influences subsequent interpretation and incorporation during future learning.

What emerges from the different thoughts is that some principles persist throughout the independent views. It is a suggestion that there is a comprehensive way in which learners learn. Moreover, one thought alone can only provide a particular explanation, but it cannot fit all learning situations. Additionally, the changes in technology, learning environments, and learning motivations will favor one thought over another, which explains why there are scholars who doubt the significance of the stimulus and reinforcement in the learning process. Nevertheless, the underlying fact is that even without direct observation, the cumulative literature on the subject suggests that reinforcement plays a crucial role in learning. It influences the choice to exploit or explore and can be a differentiating factor between the various decisions made by learners about their reactions to stimuli.

Unfortunately, those arguing against reinforcement have not considered the compounding effects of reinforcement history when observing responses to stimulus. However, internal reinforcement is not enough to allow the brain to perform motion and orientation tasks, as declared in their research. In fact, in a classroom environment, whether online or offline, students who receive reinforcement perform significantly better than their counterparts who do not receive the external reinforcement. Moreover, not all situations and tasks are easy to experiment; therefore, one type of test should not be enough to discredit the argument for reinforcement and its cumulative nature to promote learning (Seitz, Nanez, Holloway, Tsushima, & Watanabe, 2006).

The learner is usually referred to as an agent. Any other thing, person, factor, or condition that affects or causes learning to occur happens in the agent’s environment. Reinforcement of behavior can be visible or indirect, but it is present in the environment at all times. However, learners choose to exploit or explore when they interact with their environment, based on an accumulation of all their experiences with reinforcements. Thus, the agent will likely remain indifferent when the present stimulus is not high enough to prevent or cause an action. Many might think the indifference is a lack of learning, but it is an increase in the agent’s available knowledge of the stimulus. A counter-argument against total reinforcement over the life of a learner is that some learned actions and reactions are independent of the stimulus presented to the agent.

The empirical evidence using rats can be satisfying when looked at independently. However, learning is a communicating process. Agents may fail to act when presented with a stimulus not because they do not recognize the stimulus, but because there are no options of acting. Agents choose to preserve actions until the environment presents appropriate conditions for actions. For example, even though rats will associate a stimulus with food, they can still fail to act on it because it is food, or they have engaged another stimulus that is stronger than the former. Therefore, it is important to interpret experience in the collective literature on reinforcement learning. Even in the e-learning application, practitioners have already recognized the futility of sticking to a given theory, be it behavioral, cognitive psychology or constructivism. Instead, there is an emerging thought that concentrates on the cumulative principles and an understanding that it is important to look at the entire reinforcement history for a learner.

Implications

Researchers will have to incorporate diverse theories in their discussions on research findings to ensure their results and interpretations remain credible. On the other hand, most practitioners will resolve to increase the awareness of the concepts of learning before choosing particular systems or technologies to use for their implementation. Those who still emphasize technologies because they worked in the past or elsewhere risk alternative learners and other practitioners whose environments, past reinforcements, and stimulus may act antagonistically to the chosen technology or system.

Anderson, T., & Elloumi, F. (Eds.). (2004). Theory and practice of online learning. Athabasca, Canada: Athabasca University.

Dayan, P., & Law, N. D. (2008). Connections between computational and neurobiological perspectives on decision making. Cognitive, Affective & Behavioral Neuroscience, 8 (4), 429-453.

Iordanova, M. D., Good, M. A., & Honey, R. C. (2008). Configural learning without reinforcement: Integrated memories for correlates of what, where, and when. The Quarterly Journal of Experimental Psychology, 61 (12), 1785-1792.

Jones, D. J., Forehand, R., Cuellar, J., Kincaid, C., Parent, J., Fenton, N., & Goodrum, N. (2013). Harnessing innovative technologies to advance children’s mental health: Behavioral parent training as an example. Clinical Psychology Review, 33 (2), 241-252.

Ringbauer, S., & Neumann, H. (2011). Perceptual learning without awareness: A motion pattern gated reinforcement learner. Journal of Vision, 11 (11), 977-977.

Seitz, A. R., Nanez, J. E., Holloway, S., Tsushima, Y., & Watanabe, T. (2006). Two cases requiring external reinforcement in perceptual learning. Journal of Vision, 6 , 966-973.

Shteingart, H., & Loewenstein, Y. (2014). Reinforcement learning and human behavior. Current Opinion in Neurobiology, 25 , 93-98.

Tittle, C. R., Antonaccio, O., & Botchkovar, E. (2012). Social Learning, reinforcement and crime: Evidence from three European cities. Social Forces, 90 (3), 863-890.

  • Development of a Token Economy
  • La Maison: Organizational Behaviour
  • How Managers Can Positively Reinforce Desirable Behavior?
  • Mahara ePortfolio Role in Education
  • Play-Based Learning: Games and Creativity
  • "Chemistry Teaching - Science or Alchemy?" by Johnstone
  • Evolving Educational Landscape: Anticipated Changes for 2060
  • Simple Textbook vs Complicated Book: Arguments for and Against
  • Chicago (A-D)
  • Chicago (N-B)

IvyPanda. (2020, July 2). Reinforcement Role in Learning. https://ivypanda.com/essays/reinforcement-role-in-learning/

"Reinforcement Role in Learning." IvyPanda , 2 July 2020, ivypanda.com/essays/reinforcement-role-in-learning/.

IvyPanda . (2020) 'Reinforcement Role in Learning'. 2 July.

IvyPanda . 2020. "Reinforcement Role in Learning." July 2, 2020. https://ivypanda.com/essays/reinforcement-role-in-learning/.

1. IvyPanda . "Reinforcement Role in Learning." July 2, 2020. https://ivypanda.com/essays/reinforcement-role-in-learning/.

Bibliography

IvyPanda . "Reinforcement Role in Learning." July 2, 2020. https://ivypanda.com/essays/reinforcement-role-in-learning/.

The Research Scientist Pod

The History of Reinforcement Learning

by Suf | Data Science , Machine Learning , Research

Reinforcement learning (RL) is an exciting and rapidly developing area of machine learning that significantly impacts the future of technology and our everyday lives. RL is a field separate from supervised and unsupervised learning focusing on solving problems through a sequence or sequences of decisions optimized by maximizing the accrual of rewards received by taking correct decisions. RL originates from animal learning in experimental psychology and optimal control theory whilst also drawing from and contributing to neuroscience. This article will give a brief overview of the history of RL, from its origins to the modern day. However, this article will not cover the technicalities of RL in-depth, which we will cover in a separate article. Through this article, you will gain an insight into the motivation for development in RL, the different philosophies that drive the field and what fascinating discoveries have been made and lie ahead in the future.

Table of contents

Turing’s unorganised machines, what is the difference between reinforcement learning and optimal control, learning automata, overlap of neurobiology and reinforcement learning, deep reinforcement learning and deep q-learning, google deepmind and video games, modern developments, concluding remarks.

infographic for history of reinforcement

Origins in Animal Learning

The origin of RL is two-pronged, namely, animal learning and optimal control. Starting with animal learning, Edward Thorndike described the essence of trial-and-error learning with the “Law of Effect” in 1911 , which, to paraphrase, states that an animal will pursue the repetition of actions if they reinforce satisfaction and will be deterred from actions that produce discomfort. Furthermore, the greater the level of pleasure or pain, the greater the pursuit or deterrence from the action. The Law of Effect describes the effect of reinforcing behaviour from positive stimuli and is widely regarded as the basic defining principle of future descriptions of behaviour. The Law of Effect combines selectional and associative learning, where selectional involves trying alternatives and selecting from them based on the outcomes, associative is where options are found by selection associated with particular situations. The term reinforcement was formally used in the context of animal learning in 1927 by Pavlov , who described reinforcement as the strengthening of a pattern of behaviour due to an animal receiving a stimulus – a reinforcer – in a time-dependent relationship with another stimulus or with a response.

In 1948, Alan Turing presented a visionary survey of the prospect of constructing machines capable of intelligent behaviour in a report called “Intelligent Machinery” . Turing may have been the first to suggest using randomly connected networks of neuron-like nodes to perform computation and proposed the construction of large, brain-like networks of such neurons capable of being trained as one would teach a child. Turing called his networks “unorganized machines”.

Turing described three types of unorganized machines. A-type and B-type unorganized machines consist of randomly connected two-state neurons. The P-type unorganized machines, which are not neuron-like, have “only two interfering inputs, one for pleasure or reward and the other for pain or punishment”. Turing studied P-type machines to try and discover training procedures analogous to children learning. He stated that by applying “appropriate inference, mimicking education”, a B-type machine can be trained to “do any required job, given sufficient time and provided the number of units is sufficient”.

Trial-and-error learning led to the production of many electro-mechanical machines. Thomas Ross, in 1933 built a machine that could find its way through a simple maze and remember the path through the configuration of switches. In 1952, Claude Shannon demonstrated a maze-running mouse named Theseus that used trial-and-error to navigate a maze. The maze itself remembered the successive directions using magnets and relays under its floor. In 1954, Marvin Minsky discussed computational reinforcement learning methods and described his construction of an analogue machine composed of components he named SNARCs (Stochastic Neural-Analog Reinforcement Calculators). SNARCs were intended to resemble modifiable synaptic connections in the brain. In 1961 Minsky addressed the Credit Assignment Problem, which describes how to distribute credit for success among the decisions that may have contributed.

Research in computational trial-and-error processes eventually generalized to pattern recognition before being absorbed into supervised learning, where error information is used to update neuron connection weights. Investigation into RL faded throughout the 1960s and 1970s. However, in 1963, although relatively unknown, John Andreae developed pioneering research, including the STELLA system , which learns through interaction with its environment, and machines with an “internal monologue ” and then later machines that can learn from a teacher .

Origins in Optimal Control

Optimal Control research began in the 1950s as a formal framework to define optimization methods to derive control policies in continuous time control problems, as shown by Pontryagin and Neustadt in 1962. Richard Bellman developed dynamic programming as both a mathematical optimization and computer programming method to solve control problems. The process defines a functional equation using the dynamic system’s state and returns what is referred to as an optimal value function. The optimal function is commonly referred to as the Bellman equation. Bellman introduced the Markovian Decision Process (MDP), which we define as a discrete stochastic version of the optimal control problem. Ronald Howard, in 1960 devised the policy iteration method for MDPs . All of these are essential elements underpinning the theory and algorithms of modern reinforcement learning.

A common question that arises is what is the difference between optimal control and reinforcement learning. The modern understanding appreciates all of the work in optimal control as related work in reinforcement learning. Reinforcement learning problems are closely associated with optimal control problems, particularly stochastic ones such as those formulated as MDPs. Solution methods of optimal control such as dynamic programming are also considered reinforcement learning methods., they gradually reach the correct answer through successive approximations. Reinforcement learning can be thought of as generalizing or extending ideas from optimal control to non-traditional control problems.

In the early 1960s, research in learning automata commenced and can be traced back to Michael Lvovitch Tsetlin in the Soviet Union. A learning automaton is an adaptive decision-making unit situated in a random environment that learns the optimal action through repeated interactions with its environment. The steps are chosen according to a specific probability distribution, based on the response from the environment. Learning automata are considered as policy iterators in RL. Tsetlin devised the Tsetlin Automaton, which is regarded as an even more fundamental and versatile learning mechanism than the artificial neuron. The Tsetlin Automaton is one of the pioneering solutions to the well-known multi-armed bandit problem and continues to be used for pattern classification and formed the core of more advanced learning automata designs, including decentralized control and equi-partitioning and faulty dichotomous search.

Hedonistic Neurons

In the late 1970s and early 1980s, Harry Klopf was dissatisfied with the focus on equilibrium-seeking processes for explaining natural intelligence and providing a basis for machine intelligence. These include homeostasis and error-correction learning commonly associated with supervised learning. He argued that systems that try to maximize a quantity are qualitatively different from equilibrium seeking systems. Furthermore, he argued that maximizing systems is integral to understanding crucial aspects of natural intelligence and building artificial intelligence. Klopf hypothesized that neurons are individually hedonistic in that they work to maximize a neuron-local analogue of pleasure while minimizing a neuron-local of pain.

Klopf’s idea of hedonistic neurons was that neurons implement a neuron-local version of the law of effect. He hypothesized that the synaptic weights of neurons change with experience. When a neuron fires an action potential, all the synapses that contribute to the action potential change the efficacies. If the action potential is rewarded, all eligible synapses’ effectiveness increases (or decrease if punished). Therefore, synapses change to alter the neuron’s firing patterns to increase the neuron’s probability of being rewarded and reduce the likelihood of being penalized by its environment.

This hypothesis produces a significant distinction between supervised learning, which is essentially an equilibrium-seeking process and reinforcement learning, which is effectively an evaluation-driven system where the learner’s decisions evolve in response to its experiences. Both error correction and RL are optimization processes, but error correction is more restricted, and RL is generalized and motivated by maximizing rewards through action optimization.

Neurobiology has explored the different forms of learning, namely unsupervised, supervised, and reinforcement learning within the cortex-cerebellum-basal ganglia system. Disentangling these learning processes and assigning their implementation to distinct brain areas has been a fundamental challenge for research in neurosciences. Several data suggested by Houk and Wise in 1995 and Schultz in 1997 indicate that the neuromodulation dopamine provides basal ganglia target structures with phasic signals that convey a reward prediction error that can influence reinforcement learning processes. It is, therefore, possible to characterize the functionality of the basal ganglia as an abstract search through the space of possible actions guided by dopaminergic feedback.

Recent research has explored how the three brain areas form a highly integrated system combining the different learning mechanisms into a super-learning process allowing for learning of flexible motor behaviour. Super-learning refers to the other learning mechanisms acting in synergy as opposed to in isolation.

Temporal Difference

Temporal difference (TD) learning is inspired by mathematical differentiation and aims to build accurate reward predictions from delayed rewards. TD tries to predict the combination of immediate reward and its reward prediction at the next time step. When the next time step arrives, the latest prediction is compared against what it was expected to be with new information. If there is a difference, the algorithm calculates the error, which is the “temporal difference” to adjust the old prediction towards the latest forecast. The algorithm aims to bring the old and new predictions closer together at every time step, ensuring the entire chain of predictions incrementally becomes more accurate.

TD learning is most closely associated with Sutton, whose 1984 PhD dissertation addressed TD learning and whose 1988 paper, in which the term Temporal Difference was first used , has become the definitive reference.

The origins of temporal difference methods are motivated strongly by theories of animal learning, particularly the concept of secondary reinforcers. A secondary reinforcer is a stimulus paired with a primary reinforcer, for example, the presence of food. The secondary reinforcer adopts similar properties to the primary. The temporal difference method entwined with the trial-and-error method when Klopf in 1972 and 1975 explored reinforcement learning in large systems as decomposed into individual sub-components of the more extensive procedure, each with their excitatory inputs rewards and inhibitory inputs as punishments, and each could reinforce one another.

Incorporating animal learning theory with methods of learning driven by changes in temporally successive predictions, including the temporal credit assignment problem, led to an explosion in reinforcement learning research. In particular, the development of the ‘Actor-Critic Architecture’ as applied to the pole-balancing problem by Barto et al. in 1983. Actor-critic methods are TD methods with a separate memory structure to explicitly represent the policy independent of the value function. The policy structure, which is used to select actions, is known as the actor, and the estimated value function, which criticizes the actions made by the actor, is known as the critic. The critique takes the form of a TD error, which is the sole output of the critic and drives all learning in both actor and critic. In 1984 and 1986, the Actor-Critic architecture was extended to integrate with backpropagation neural network techniques.

In 1992, Gerry Tesauro developed a programme that required little backgammon knowledge yet learned to play the game at the grandmaster level. The learning algorithm combined the TD-lambda algorithm and a non-linear function approximation using a multilayer neural network trained by backpropagating TD errors. Based on TD-Gammon’s success and further analysis, the best human players now play the unconventional opening positions learned by the algorithm.

Chris Watkins introduced Q-learning in 1989 in his PhD thesis “Learning from Delayed Rewards”, which introduced a model of reinforcement learning as incrementally optimizing control of a Markovian Decision Process and proposed Q-learning as a way to learn optimal control directly without modelling the transition probabilities or expected rewards of the Markovian Decision Process. Watkins and Peter Dayan presented a convergence proof in 1992. A Q-value function shows us how good a specific action is, given a state for an agent following a policy. Q-learning is the process of iteratively updating Q-values for each state-action pair using the Bellman Equation until the Q-function eventually converges to Q*. Q-learning is a model-free reinforcement learning algorithm and can handle stochastic transitions and rewards without adaptations.

These innovations stimulated significant subsequent research in reinforcement learning.

Alongside the rising interest in neural networks beginning in the mid-1980s, interest grew in deep reinforcement learning, where a neural network represents policies or value functions. TD-Gammon was the first successful application of reinforcement learning with neural networks. Around 2012, a deep learning revolution was spurred by the fast implementations of convolutional neural networks on graphical processing units for computer vision, which led to an increased interest in using deep neural networks as function approximations across various domains. Applying neural networks is particularly useful for replacing value iteration algorithms that directly update q-value tables as the agent learns. Value iteration is suitable for tasks with a small state space, but if there are more complex environments, the number of computational resources and time needed to traverse the new state and modify Q-values will be extremely prohibitive or unfeasible. Instead of computing Q-values directly through value iterations, a function approximation can estimate the optimal Q-function. A neural network receives states from an environment as input and outputs estimated Q-values for each action an agent can choose in those states. The values are compared to Q* target values to calculate the loss. The neural network’s weights are updated using backpropagation and stochastic gradient descent to produce Q-values that minimize the error and converge on the agent’s optimal actions.

Around 2013, DeepMind developed deep Q-learning, a combination of convolution neural network architecture and Q-learning. Deep Q-learning facilitates Experience Replay, which stores and replays states and allows the network to learn in small batches to avoid skewing training and speed up implementation. They tested the system on video games such as Space Invaders and Breakout. Without altering the code, the network learns how to play the game and, after several iterations, surpasses human performance. DeepMind published further research on their system surpassing human abilities in other games like Seaquest and Q*Bert.

In 2014, DeepMind published research on the computer program able to play Go. In October 2015, a computer Go program called AlphaGo beat the European Go champion, Fan Hui. This event was the first time artificial intelligence defeated a professional Go player. In March 2016, AlphaGo beat Lee Sedol, one of the highest-ranked players in the world, with a score of 4-1 in a five-game match. In the 2017 Future of Go Summit, AlphaGo won a three-game match with Ke Jie, who was the number one ranked player in the world for two years. Later that year, an improved version of AlphaGo, AlphaGo Zero, defeated AlphaGo 100 games to 0. This new version beat its predecessor after three days with less processing power than AlphaGo, which comparatively took months to learn how to play.

In Google’s work with AlphaZero in 2017, the system was able to play chess at a superhuman level within four hours of training, using 5,000 first-generation tensor processing units and 64 second-generation tensor processing units. 21

The research community is still in the early stages of understanding thoroughly how practical deep reinforcement learning is to other domains. AlphaFold, developed by DeepMind, applies artificial intelligence to amino acid folding, one of the most important goals pursued by computational biology and is essential for medicinal applications such as drug design and biotechnology such as novel enzyme design. Deep reinforcement learning has shown extreme proficiency in solving problems within constrained environments. Potential real-life applications include robotics, processing of structured medical images, self-driving cars.

Developments have been made into making deep reinforcement learning more efficient. Google Brain proposed Adaptive Behavior Policy Sharing , an optimization strategy that allows for selective information sharing across a pool of agents. DeepMind published research in 2020, exploring the Never Give Up strategy, which uses k-nearest neighbours over the agent’s recent experience to train the directed exploratory policies to solve complex exploration games.

Reinforcement learning has an extensive history with a fascinating cross-pollination of ideas, generating research that sent waves through behavioural science, cognitive neuroscience, machine learning, optimal control, and others. This field of study has evolved rapidly since its inception in the 1950s, where the theory and concepts were fleshed out, to the application of theory through neural networks leading to the conquering of electronic video games and the advanced board games Backgammon, Chess, and Go. The fantastic exploits in gaming have given researchers valuable insights into the applicability and limitations of deep reinforcement learning. Deep reinforcement learning can be computationally prohibitive to achieve the most acclaimed performance seen. New approaches are being explored, such as multi-environment training and leveraging language modelling to extract high-level extractions to learn more efficiently. The question of whether deep reinforcement learning is a step toward artificial general intelligence is still an open one, given that reinforcement learning works best in constrained environments. Generalization is the most significant hurdle to overcome. However, artificial general intelligence need not be the ultimate pursuit for this strand of research. We will see in the future that reinforcement learning will continue to augment modern society through robotics, medicine, business and industry. As computing resources become more available, the entry-level innovation in reinforcement learning will lower, meaning research will not be limited to the behemoth tech companies like Google. It appears that reinforcement learning has a long and bright future and will continue to be an area of exciting research in artificial intelligence.

Thank you for reading to the end of this article. I hope through reading this; you can see how complex and multi-faceted the origins and developments within RL are and how RL as a field of research receives contributions from neighbouring areas and reciprocates insight and innovation in kind. Would you please explore the links provided throughout to dig further into the rich history of RL? If you are excited to learn more about reinforcement learning, stay tuned for a more technical exploration of the fundamentals of reinforcement learning. Meanwhile, if you are interested in learning a more comprehensive history of machine learning and artificial intelligence, please click on another article on my site titled “ The History of Machine Learning “. For more information on what it takes to be a research scientist and the differences between that role and the role of a data scientist or data engineer, please click through to my article titled “ Key Differences Between Data Scientist, Research Scientist, and Machine Learning Engineer Roles “.

See you soon, and happy researching!

Share this:

  • Click to share on Facebook (Opens in new window)
  • Click to share on LinkedIn (Opens in new window)
  • Click to share on Reddit (Opens in new window)
  • Click to share on Pinterest (Opens in new window)
  • Click to share on Telegram (Opens in new window)
  • Click to share on WhatsApp (Opens in new window)
  • Click to share on Twitter (Opens in new window)
  • Click to share on Tumblr (Opens in new window)

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Proc Natl Acad Sci U S A
  • v.108(Suppl 3); 2011 Sep 13

Logo of pnas

Colloquium Paper

Understanding dopamine and reinforcement learning: the dopamine reward prediction error hypothesis.

Author contributions: P.W.G. wrote the paper.

A number of recent advances have been achieved in the study of midbrain dopaminergic neurons. Understanding these advances and how they relate to one another requires a deep understanding of the computational models that serve as an explanatory framework and guide ongoing experimental inquiry. This intertwining of theory and experiment now suggests very clearly that the phasic activity of the midbrain dopamine neurons provides a global mechanism for synaptic modification. These synaptic modifications, in turn, provide the mechanistic underpinning for a specific class of reinforcement learning mechanisms that now seem to underlie much of human and animal behavior. This review describes both the critical empirical findings that are at the root of this conclusion and the fantastic theoretical advances from which this conclusion is drawn.

The theory and data available today indicate that the phasic activity of midbrain dopamine neurons encodes a reward prediction error used to guide learning throughout the frontal cortex and the basal ganglia. Activity in these dopaminergic neurons is now believed to signal that a subject's estimate of the value of current and future events is in error and indicate the magnitude of this error. This is a kind of combined signal that most scholars active in dopamine studies believe adjusts synaptic strengths in a quantitative manner until the subject's estimate of the value of current and future events is accurately encoded in the frontal cortex and basal ganglia. Although some confusion remains within the larger neuroscience community, very little data exist that are incompatible with this hypothesis. This review provides a brief overview of the explanatory synergy between behavioral, anatomical, physiological, and biophysical data that has been forged by recent computational advances. For a more detailed treatment of this hypothesis, refer to Niv and Montague ( 1 ) or Dayan and Abbot ( 2 ).

Features of Midbrain Dopamine Neurons

Three groups of dopamine secreting neurons send axons along long-distance trajectories that influence brain activity in many areas ( 3 ): the A8 and A10 groups of the ventral tegmental area (VTA) and the A9 group of the substantia nigra pars compacta (SNc). Two remarkable features of these neurons noted at the time of their discovery were their very large cell bodies and very long and complicated axonal arbors that include terminals specialized to release transmitter into the extracellular space, en passant synapses, through which dopamine achieves an extremely broad anatomical distribution ( 4 ). As Cajal ( 5 ) first pointed out, the length and complexity of axonal arbors are often tightly correlated with cell body size; large cell bodies are required to support large terminal fields, and dopaminergic cell bodies are about as large as they can be. The midbrain dopaminergic system, thus, achieves the largest possible distribution of its signal with the minimal possible number of neurons.

The A9 cluster connects to the caudate and putamen, and the A8 and A10 axons make contact with the ventral striatum and the fronto-cortical regions beyond ( 6 , 7 ). There does, however, seem to be some intermixing of the three cell groups ( 8 – 10 ). Classic studies of these cells under conditions ranging from slice preparations to awake behaving primates, however, stress homogeneity in response patterns across these groups. Although knowing that one is actually recording from a dopamine neuron may be difficult in chronic studies ( 11 ), all cells that look like dopamine neurons in the core of the VTA and SNc seem to respond in the same way. Even the structure of the axons of these neurons supports the notion that activity is homogenous across this population of giant cells. Axons of adjacent neurons are electrically coupled to one another in this system ( 12 , 13 ). Modeling studies suggest that this coupling makes it more difficult for individual neurons to fire alone, enforcing highly synchronous and thus, tightly correlated firing across the population ( 14 ).

A final note is that these neurons generate atypically long-duration action potentials, as long as 2–3 ms. This is relevant, because it places a very low limit on the maximal firing rates that these neurons can produce ( 15 ).

What emerges from these many studies is the idea that the dopamine neurons are structurally well-suited to serve as a specialized low-bandwidth channel for broadcasting the same information to large territories in the basal ganglia and frontal cortex. The large size of the cell bodies, the fact that the cells are electrically coupled, and the fact that they fire at low rates and distribute dopamine homogenously throughout a huge innervation territory—all these unusual things mean that they cannot say much to the rest of the brain but what they say must be widely heard. It should also be noted, however, that specializations at the site of release may well serve to filter this common message in ways that tailor it for different classes of recipients. Zhang et al. ( 16 ) have recently shown differences between the time courses of dopamine levels in the dorsal and ventral striata that likely reflect functional specializations for release and reuptake between these areas.

Dopaminergic Targets: Frontal Cortex and Basal Ganglia

It is also important to recognize that the dopamine neurons lie embedded in a large and well-described circuit. At the level of the cortex, the dopamine neurons send whatever signal they carry throughout territories anterior to the central sulcus and send little or no information to parietal, temporal, and occipital cortices ( 6 ). The outputs of the dopamine-innervated frontal cortices, however, also share another commonality; many of the major long-distance outputs of the frontal cortex pass in a topographic manner to the two main input nuclei of the basal ganglia complex, the caudate and the putamen ( 17 ). Both areas also receive dense innervation from the midbrain dopaminergic neurons.

Structurally, the caudate and putamen (and the ventral-most parts of the putamen, known as the ventral striatum) are largely a single nucleus separated during development by the incursion of the fibers of the corona radiata ( 2 , 6 , 18 ) that project principally to two output nuclei, the globus pallidus and the substantia nigra pars reticulata. These nuclei then, in turn, provide two basic outputs. The first and largest of these outputs returns information to the frontal cortex through a thalamic relay. Interestingly, this relay back to the cortex maintains a powerful topographic sorting ( 19 , 20 ). The medial and posterior parts of the cortex that are concerned with planning skeletomuscular movements send their outputs to a specific subarea of the putamen, which sends signals back to this same area of the cortex through the globus pallidus and the ventrolateral thalamus. Together, these connections form a set of long feedback loops that seems to be serially interconnected ( 9 ) and ultimately, generates behavioral output through the skeletomuscular and eye-movement control pathways of the massive frontal-basal ganglia system.

The second principal class of output from the basal ganglia targets the midbrain dopamine neurons themselves and also forms a feedback loop. These outputs pass to the dendrites of the dopamine cells, where they combine with inputs from the brainstem that likely carry signals about rewards being currently consumed ( 21 ). In this way, the broadly distributed dopamine signals sent to the cortex and the basal ganglia likely reflect some combination of outputs from the cortex and places such as the tongue. The combined signals are then, of course, broadcast by the dopamine neurons throughout the basal ganglia and the frontal cortex.

Theory of Reinforcement Learning

From pavlov to rescorla and wagner..

Understanding the functional role of dopamine neurons, however, requires more than a knowledge of brain circuitry; it also requires an understanding of the classes of computational algorithms in which dopamine neurons seem to participate. Pavlov ( 22 ) observed, in his famous experiment on the salivating dog, that if one rings a bell and follows that bell with food, dogs become conditioned to salivate after the bell is rung. This process, where an unconditioned response comes to be elicited by a conditioned stimulus, is one of the core empirical regularities around which psychological theories of learning have been built. Pavlov ( 22 ) hypothesized that this behavioral regularity emerges because a preexisting anatomical connection between the sight of food and activation of the salivary glands comes to be connected to bell-detecting neurons by experience.

This very general idea was first mathematically formalized when Bush and Mosteller ( 23 , 24 ) proposed that the probability of Pavlov's ( 22 ) dog expressing the salivary response on sequential trials could be computed through an iterative equation where ( Eq. 1 )

equation image

In this equation, A next_trial is the probability that the salivation will occur on the next trial (or more formally, the associative strength of the connection between the bell and salivation). To compute A next_trial , one begins with the value of A on the previous trial and adds to it a correction based on the animal's experience during the most recent trial. This correction, or error term, is the difference between what the animal actually experienced (in this case, the reward of the meat powder expressed as R current_trial ) and what he expected (simply, what A was on the previous trial). The difference between what was obtained and what was expected is multiplied by α, a number ranging from 0 to 1, which is known as the learning rate. When α = 1, A is always immediately updated so that it equals R from the last trial. When α = 0.5, only one-half of the error is corrected, and the value of A converges in half steps to R . When the value of α is small, around 0.1, then A is only very slowly incremented to the value of R .

What the Bush and Mosteller ( 23 , 24 ) equation does is compute an average of previous rewards across previous trials. In this average, the most recent rewards have the greatest impact, whereas rewards far in the past have only a weak impact. If, to take a concrete example, α = 0.5, then the equation takes the most recent reward, uses it to compute the error term, and multiplies that term by 0.5. One-half of the new value of A is, thus, constructed from this most recent observation. That means that the sum of all previous error terms (those from all trials in the past) has to count for the other one-half of the estimate. If one looks at that older one-half of the estimate, one-half of that one-half comes from what was observed one trial ago (thus, 0.25 of the total estimate) and one-half (0.25 of the estimate) comes from the sum of all trials before that one. The iterative equation reflects a weighted sum of previous rewards. When the learning rate (α) is 0.5, the weighting rule effectively being carried out is ( Eq. 2 )

equation image

an exponential series, the rate at which the weight declines being controlled by α.

When α is high, the exponential function declines rapidly and puts all of the weight on the most recent experiences of the animal. When α is low, it declines slowly and averages together many observations, which is shown in Fig. 1 .

An external file that holds a picture, illustration, etc.
Object name is pnas.1014269108fig01.jpg

“Weights determining the effects of previous rewards on current associative strength effectively decline as an exponential function of time” ( 65 ). [Reproduced with permission from Oxford University Press from ref. 65 (Copyright 2010, Paul W. Glimcher).]

The Bush and Mosteller ( 23 , 24 ) equation was critically important, because it was the first use of this kind of iterative error-based rule for reinforcement learning; additionally, it forms the basis of all modern approaches to this problem. This is a fact often obscured by what is known as the Rescorla–Wagner model of classical conditioning ( 25 ). The Rescorla–Wagner model was an important extension of the Bush and Mosteller approach ( 23 , 24 ) to the study of what happens to associative strength when two cues predict the same event. Their findings were so influential that the basic Bush and Mosteller rule is now often mistakenly attributed to Rescorla and Wagner by neurobiologists.

Learning Value Instead of Associations.

The next important point to make is that these early psychological theories were about associative strengths between classically conditioned stimuli and conditioned automatic responses. These models were about learning but not about concepts like value and choice that figure prominently in modern discussions of the function of dopamine. In the worlds of dynamic programming, computer science, and economics, however, these basic equations were easily extended to include a more explicit notion of value. Consider an animal trying to learn the value of pressing a lever that yields four pellets of food with a probability of 0.5. Returning to the Bush and Mosteller ( 23 , 24 ) (or Rescorla–Wagner) ( 25 ) equation ( Eq. 3 ),

equation image

Because, in one-half of all trials, the animal is rewarded and in one-half, he is not and because all rewards have a value of four, we know exactly what this equation will do. If α has a value of one, A will bounce up and down between zero and four; if α is infinitely small, A will converge to two. That is striking, because two is the long-run average, or expected, value of pressing the lever. Therefore, in an environment that does not change, when α is small, this equation converges to the expected value of the action ( Eq. 4 ):

equation image

Today, the Bush and Mosteller ( 23 , 24 ) equation forms the core of how most people think about learning values. The equation provides a way for us to learn expected values. If we face a stable environment and have lots of time, we can even show that this equation is guaranteed to converge to expected value ( 26 ).

Sutton and Barto: Temporal Difference Model.

The story of reinforcement learning described up to this point is a story largely from psychology and mostly focused on associative learning. That story changed abruptly in the 1990s when computer scientists Sutton and Barto ( 26 ) began to think seriously about these preexisting theories and noticed two key problems with them:

  • i ) These theories all treated time as passing in unitary fixed epochs usually called trials. In Bush and Mosteller ( 23 , 24 ), trials pass one after another, and updates to the values of actions occur only between trials. In the real world, time is more continuous. Different events in a trial might mean different things or might indicate different things about value.
  • ii ) The second key problem was that these theories dealt in only a rudimentary way with how to link sequential cues (for example, a tone followed by a bell) with a later event of positive or negative value. The theories were good at learning that a tone or a lever predicted a reward but not so good at learning that a light that perfectly predicted the appearance of a lever meant that the later appearance of the lever told you nothing new.

To address these issues, Sutton and Barto ( 26 ) developed what has come to be known as temporal difference (TD) learning. That model has been presented in detail in elsewhere ( 26 ). Here, we review the most important advances that they achieved that are critical for understanding dopamine.

Sutton and Barto ( 26 ) began by arguing that, in essence, the Bush and Mosteller ( 23 , 24 ) approach stated the problem that learning systems were trying to solve incorrectly. The Bush and Mosteller ( 23 , 24 ) equation learns the values of previous events. Sutton and Barto ( 26 ) argued that the goal of a learning system should instead be to predict the value of future events. Of course, predictions have to be based on previous experience, and therefore, these two ideas are closely related; however, TD learning was designed with a clear goal in mind: predict the value of the future.

That is an important distinction, because it changes how one has to think about the reward prediction error at the heart of these reinforcement learning models. In Bush and Mosteller ( 23 , 24 ) class models, reward prediction error is the difference between a weighted average of past rewards and the reward that has just been experienced. When those are the same, there is no error, and the system does not learn. Sutton and Barto ( 26 ), by contrast, argued that the reward prediction error term should be viewed as the difference between one's rational expectations of all future rewards and any information (be it an actual reward or a signal that a reward is coming up) that leads to a revision of expectations. If, for example, we predict that we will receive one reward every 1 min for the next 10 min and a visual cue indicates that, instead of these 10 rewards, we will receive one reward every 1 min for 11 min, then a prediction error exists when the visual cue arrives, not 11 min later when the final (and at that point, fully expected) reward actually arrives. This is a key difference between TD class and Bush and Mosteller ( 23 , 24 ) class models.

To accomplish the goal of building a theory that both could deal with a more continuous notion of time and could build a rational (or near-rational) expectation of future rewards, Sutton and Barto ( 26 ) switched away from simple trial-based representations of time to a representation of time as a series of discrete moments extending from now into the infinite future. They then imagined learning as a process that occurred not just at the end of each trial but at each of these discrete moments.

To understand how they did this, consider a simple version of TD learning in which each trial can be thought of as made up of 20 moments. What the TD model is attempting to accomplish is to build a prediction about the rewards that can be expected in each of those 20 moments. The sum of those predictions is our total expectation of reward. We can represent this 20-moment expectation as a set of 20 learned values, one for each of the 20 moments. This is the first critical difference between TD class and Bush and Mosteller ( 23 , 24 ) class models. The second difference lies in how these 20 predictions are generated. In TD, the prediction at each moment indicates not only the reward that is expected at that moment but also the sum of (discounted) rewards available in each of the subsequent moments.

To understand this critical point, consider the value estimate, V 1 , that is attached to the first moment in the 20-moment-long trial. That variable needs to encode the value of any rewards expected at that moment, the value of any reward expected at the next moment decremented by the discount factor, the value of the next moment further decremented by the discount factor, and so on. Formally, that value function at time tick number one is ( Eq. 5 )

equation image

where γ, the discount parameter, captures the fact that each of us prefers (derives more utility from) sooner rather than later rewards; the size of γ depends on the individual and the environmental context. Because this is a reinforcement learning system, it also automatically takes probability into account as it builds these estimates of r at each time tick. This means that the r values shown here are really expected rewards or average rewards observed at that time tick. Two kinds of events can, thus, lead to a positive prediction error: the receipt of an unexpected reward or the receipt of information that allows one to predict a later (and previously unexpected) reward.

To make this important feature clear, consider a situation in which an animal sits for 20 moments, and at any unpredictable moment, a reward might be delivered with a probability of 0.01. Whenever a reward is delivered, it is almost completely unpredictable, which leads to a large prediction error at the moment that the reward is delivered. This necessarily leads to an increment in the value of that moment. On subsequent trials, however, it is usually the case that no reward is received (because the probability is so low), and thus, on subsequent trials, the value of that moment is repeatedly decremented. If learning rates are low, the result of this process of increment and decrement is that the value of that moment will fluctuate close to zero, and we will observe a large reward prediction error signal after each unpredicted reward. Of course this is, under these conditions, true of all of the 20 moments in this imaginary trial.

Next, consider what happens when we present a tone at any of the first 10 moments that is followed 10 moments later by a reward. The first time that this happens, the tone conveys no information about future reward, no reward is expected, and therefore, we have no prediction error to drive learning. At the time of the reward, in contrast, a prediction error occurs that drives learning in that moment. The goal of TD, however, is to reach a point at which the reward delivered 10 moments after the tone is unsurprising. The goal of the system is to produce no prediction error when the reward is delivered. Why is the later reward unsurprising? It is unsurprising because of the tone. Therefore, the goal of TD is to shift the prediction error from the reward to the tone.

TD accomplishes this goal by attributing each obtained reward not just to the value function for the current moment in time but also to a few of the preceding moments in time (exactly how many is a free parameter of the model). In this way, gradually over time, the unexpected increment in value associated with the reward effectively propagates backward in time to the tone. It stops there simply because there is nothing before the tone that predicts the future reward. If there had been a light fixed before that tone in time, then the prediction would have propagated backward to that earlier light. In exactly this way, TD uses patterns of stimuli and experienced rewards to build an expectation about future rewards.

Theory and Physiology of Dopamine

With a basic knowledge of both the anatomy of dopamine and the theory of reinforcement learning, consider the following classic experiment by Schultz et al. ( 27 ). A thirsty monkey is seated before two levers. The monkey has been trained to perform a simple instructed choice task. After the illumination of a centrally located start cue, the monkey will receive an apple juice reward if he reaches out and presses the left but not the right lever. While the animal is performing this task repeatedly, Schultz et al. ( 27 ) record the activity of midbrain dopamine neurons. Interestingly, during the early phases of this process, the monkeys behave somewhat erratically, and the neurons are silent when the start cue is presented but respond strongly whenever the monkey receives a juice reward. As the monkey continues to perform the task, however, both the behavior and the activity of the neurons change systematically. The monkey comes to focus all of his lever pressing on the lever that yields a reward, and as this happens, the response of the neurons to the juice reward dies out. This is shown in Fig. 2 .

An external file that holds a picture, illustration, etc.
Object name is pnas.1014269108fig02.jpg

“Raster plot of dopamine neuron activity. Upper panel shows response of dopamine neuron to reward before and after training. Lower panel shows response of dopamine neuron to start cue after training” ( 26 ). [Reproduced with permission from ref. 26 (Copyright 1993, Society for Neuroscience).]

At the same time, however, the neurons begin to respond whenever the start cue is illuminated. When Schultz et al. ( 27 ) first observed these responses, they hypothesized that “dopamine neurons are involved with transient changes of impulse activity in basic attentional and motivational processes underlying learning and cognitive behavior” ( 27 ).

Shortly after this report had been published, Montague et al. ( 28 , 29 ) had begun to examine the activity of octopamine neurons in honey bees engaged in learning. They had hypothesized that the activity of these dopamine-related neurons in these insects encoded a reward prediction error of some kind ( 28 , 29 ). When they became aware of the results of Schultz et al. ( 27 ), they realized that it was not simply the reward prediction error (RPE) defined by Bush and Mosteller ( 23 , 24 ) class models, but it was exactly the RPE signal predicted by a TD class model. Recall that the TD model generates an RPE whenever the subject's expected reward changes. For a TD class model, this means that, after an unpredictable visual cue comes to predict a reward, it is the unexpected visual cue that tells you that the world is better than you expected. The key insight here is that the early burst of action potentials after the visual cue is what suggested to Montague et al. ( 28 , 29 ) that Schultz et al. ( 27 ) were looking at a TD class system.

Subsequently, these two groups collaborated ( 26 ) to examine the activity of primate midbrain dopamine neurons during a conditioning task of exactly the kind that Pavlov ( 22 ) had originally studied. In that experiment, thirsty monkeys sat quietly under one of two conditions. In the first condition, the monkeys received, at unpredictable times, a squirt of water into their mouths. They found that, under these conditions, the neurons responded to the juice with a burst of action potentials immediately after any unpredicted water was delivered. In the second condition, the same monkey sat while a visual stimulus was delivered followed by a squirt of juice. The first time that this happened to the monkey, the neurons responded as before: they generated a burst of action potentials after the juice delivery but were silent after the preceding visual stimulus. With repetition, however, two things happened. First, the magnitude of the response to the water declined until, after dozens of trials, the water came to evoke no response in the neurons. Second and with exactly the same time course, the dopamine neurons began responding to the visual stimulus. As the response to the reward itself diminished, the response to the visual stimulus grew. What they had observed were two classes of responses, one to the reward and one to the tone, but both were responses predicted by the TD models that Montague et al. ( 28 , 29 ) had been exploring.

Two Dopamine Responses and One Theory.

This is a point about which there has been much confusion, and therefore, we pause for a moment to clarify this important issue. Many scientists who are familiar only with Bush and Mosteller ( 23 , 24 ) class models (like the Rescorla–Wagner model) ( 25 ) have looked at these data (or others like them) and been struck by these two different responses—one at the reward delivery, which happens only early in the session, and a second at the visual cue, which happens only late in the session. The Bush and Mosteller ( 23 , 24 ) algorithm predicts only the responses synchronized to the reward itself, and therefore, these scholars often conclude that dopamine neurons are doing two different things, only one of which is predicted by theory. If, however, one considers the TD class of models (which was defined more than a decade before these neurons were studied), then this statement is erroneous. The insight of Sutton and Barto ( 31 ) in the early 1980s was that reinforcement learning systems should use the reward prediction error signal to drive learning whenever something changes expectations about upcoming rewards. After a monkey has learned that a tone indicates a reward is forthcoming, then hearing the tone at an unexpected time is as much a positive reward prediction error as is an unexpected reward itself. The point here is that the early and late bursts observed in the Schultz et al. ( 27 , 30 ) experiment described above are really the same thing in TD class models. This means that there is no need to posit that dopamine neurons are doing two things during these trials: they seem to be just encoding reward prediction errors in a way well-predicted by theory.

Negative Reward Prediction Errors.

In the same paper mentioned above, Schultz et al. ( 30 ) also examined what happens when an expected reward is omitted and the animal experiences a negative prediction error. To examine this, monkeys were first trained to anticipate a reward after a visual cue as described above and then, on rare trials, they simply omitted the water reward at the end of the trial. Under these conditions, Schultz et al. ( 30 ) found that the neurons responded to the omitted reward with a decrement in their firing rates from baseline levels ( Fig. 3 ).

An external file that holds a picture, illustration, etc.
Object name is pnas.1014269108fig03.jpg

“When a reward is cued and delivered, dopamine neurons respond only to the cue. When an expected reward is omitted after a cue the neuron responds with a suppression of activity as indicated by the oval” ( 29 ). [Reproduced with permission from ref. 29 (Copyright 1997, American Association for the Advancement of Science).]

Montague et al. ( 28 , 29 ) realized that this makes sense from the point of view of a TD class—and in this case, a Bush and Mosteller ( 23 , 24 ) class—reward prediction error. In this case, an unexpected visual cue predicted a reward. The neurons produced a burst of action potentials in response to this prediction error. Then, the predicted reward was omitted. This yields a negative prediction error, and indeed, the neurons respond after the omitted reward with a decrease in firing rates. One interesting feature of this neuronal response, however, is that the neurons do not respond with much of a decrease. The presentation of an unexpected reward may increase firing rates to 20 or 30 Hz from their 3- to 5-Hz baseline. Omitting the same reward briefly decreases firing rates to 0 Hz, but this is a decrease of only 3–5 Hz in total rate.

If one were to assume that firing rates above and below baseline were linearly related to the reward prediction error in TD class models, then one would have to conclude that primates should be less influenced in their valuations by negative prediction errors than by positive prediction errors, but we know that primates are much more sensitive to losses below expectation than to gains above expectation ( 32 – 35 ). Thus, the finding of Schultz et al. ( 27 , 30 ) that positive prediction errors shift dopamine firing rates more than negative prediction errors suggests either that the relationship between this firing rate and actual learning is strongly nonlinear about the zero point or that dopamine codes positive and negative prediction errors in tandem with a second system specialized for the negative component. This latter possibility was first raised by Daw et al. ( 36 ), who specifically proposed that two systems might work together to encode prediction errors, one for coding positive errors and one for coding negative errors.

TD Models and Dopamine Firing Rates.

The TD class models, however, predict much more than simply that some neurons must respond positively to positive prediction errors and negatively to negative prediction errors. These iterative computations also tell us about how these neurons must combine recent rewards in their reward prediction. Saying a system recursively estimates value by computing ( Eq. 6 )

equation image

is mathematically equivalent to saying that the computation of value averages recent rewards using an exponential weighting function of ( Eq. 7 )

equation image

where α, the learning rate, is a number between one and zero. If, for example, α has a value of 0.5, then ( Eq. 8 )

equation image

If the dopamine neurons really do encode an RPE, they encode the difference between expected and obtained rewards. In a simple conditioning or choice task, that means that they encode something like ( Eq. 9 )

equation image

The TD model presented by Sutton and Barto ( 26 ) tells us little about the value α should take under any specific set of conditions (here, it is arbitrarily set to 0.5), but we do know that the decay rate for the weights in the bracketed part of the equation above should decline exponentially for any stationary environment. We also know something else: when the prediction equals the obtained reward, then the prediction error should equal zero. That means that the actual value of R obtained should be exactly equal to the sum of the exponentially declining weights in the bracketed part of the equation.

Bayer and Glimcher ( 37 ) tested these predictions by recording from dopamine neurons while monkeys engaged in a learning and choice task. In their experiment, monkeys had to precisely time when in a trial they would make a response for a reward. One particular response time would yield the most reward but that best time shifted unexpectedly (with a roughly flat hazard function) across large blocks of trials. On each trial, the monkey could cumulate information from previous trials to make a reward prediction. Then, the monkey made his movement and received his reward. The difference between these two should have been the reward prediction error and thus, should be correlated with dopamine firing rates.

To test that prediction, Bayer and Glimcher ( 37 ) performed a linear regression between the history of rewards given to the monkey and the firing rates of dopamine neurons. The linear regression determines the weighting function that combines information about these previous rewards in a way that best predicts dopamine firing rates. If dopamine neurons are an iteratively computed reward prediction error system, then increasing reward on the current trial should increase firing rates. Increasing rewards on trials before that should decrease firing rates and should do so with an exponentially declining weight. Finally, the regression should indicate that the sum of old weights should be equal (and opposite in sign) to the weight attached to the current reward. In fact, this is exactly what Bayer and Glimcher ( 37 ) found ( Fig. 4 ).

An external file that holds a picture, illustration, etc.
Object name is pnas.1014269108fig04.jpg

“The linear weighting function which best relates dopamine activity to reward history” ( 65 ). [Reproduced with permission from Oxford University Press from ref. 65 (Copyright 2011, Paul W. Glimcher).]

The dopamine firing rates could be well-described as computing an exponentially weighted sum of previous rewards and subtracting from that value the magnitude of the most recent reward. Furthermore, they found, as predicted, that the integral of the declining exponential weights was equal to the weight attributed to the most recent reward. It is important to note that this was not required by the regression in any way. Any possible weighting function could have come out of this analysis, but the observed weighting function was exactly that predicted by the TD model.

A second observation that Bayer and Glimcher ( 37 ) made, however, was that the weighting functions for positive and negative prediction errors (as opposed to rewards) were quite different. Comparatively speaking, the dopamine neurons seemed fairly insensitive to negative prediction errors. Although Bayer et al. ( 15 ) later showed that, with a sufficiently complex nonlinearity, it was possible to extract positive and negative reward prediction errors from dopamine firing rates, their data raise again the possibility that negative prediction errors might well be coded in tandem with another unidentified system.

Dopamine Neurons and Probability of Reward.

Following on these observations, Schultz et al. ( 27 , 30 ) observed yet another interesting feature of the dopamine neurons well-described by the TD model. In a widely read paper, Fiorillo et al. ( 38 ) showed that dopamine neurons in classical conditioning tasks seem to show a ramp of activity between cue and reward whenever the rewards are delivered probabilistically, as shown in Fig. 5 .

An external file that holds a picture, illustration, etc.
Object name is pnas.1014269108fig05.jpg

“Peri-stimulus time histogram of dopamine neuron activity during a cued and probabilistically rewarded task” ( 37 ). [Reproduced with permission from ref. 37 (Copyright 2003, American Association for the Advancement of Science).]

Recall that TD class models essentially propagate responsibility for rewards backward in time. This is how responses to unexpected rewards move through time and attach to earlier stimuli that predict those later rewards. Of course, the theory predicts that both negative and positive prediction errors should propagate backward in time in the same way.

Now, with that in mind, consider what happens when a monkey sees a visual cue and receives a 1-mL water reward with a probability of 0.5 1 s after the tone. The average value of the tone is, thus, 0.5 mL. In one-half of all trials, the monkey gets a reward (a positive prediction error of 0.5), and in one-half of all trials, it gets does not get a reward (a negative prediction error of 0.5). One would imagine that these two signals would work their way backward in trial time to the visual cue. Averaging across many trials, one would expect to see these two propagating signals cancel out each other. However, what would happen if the dopamine neurons responded more strongly to positive than negative prediction errors ( 37 )? Under that set of conditions, the TD class models would predict that average dopaminergic activity would show the much larger positive prediction error propagating backward in time as a ramp—exactly what Schultz et al. ( 27 , 30 ) observed.

This observation of the ramp has been quite controversial and has led to a lot of confusion. Schultz et al. ( 27 , 30 ) said two things about the ramp: that the magnitude and shape of the ramp carried information about the history of previous rewards and that this was a feature suggesting that the neurons encoded uncertainty in a way not predicted by theory. The first of these observations is unarguably true. The second is true only if we assume that positive and negative prediction errors are coded as precise mirror images of one another. If instead, as the Bayer and Glimcher ( 37 ) data indicate, negative and positive prediction errors are encoded differentially in the dopamine neurons, then the ramp is not only predicted by existing theory, it is required. This is a point first made in print by Niv et al. ( 39 ).

Axiomatic Approaches.

How sure are we that dopamine neurons encode a reward prediction error? It is certainly the case that the average firing rates of dopamine neurons under a variety of conditions conform to the predictions of the TD model, but just as the TD class succeeded the Bush and Mosteller ( 23 , 24 ) class, we have every reason to believe that future models will improve on the predictions of TD. Therefore, can there ever be a way to say conclusively that the activity of dopamine neurons meets some absolute criteria of necessity and sufficiency with regard to reinforcement learning? To begin to answer that question, Caplin and Dean ( 40 ) used a standard set of economic tools for the study of dopamine. Caplin and Dean ( 40 ) asked whether there was a compact, testable, mathematically axiomatic way to state the current dopamine hypothesis.

After careful study, Caplin and Dean ( 40 ) were able to show that the entire class of reward prediction error-based models could be reduced to three compact and testable mathematical statements called axioms—common mathematical features that all reward prediction error-based models must include irrespective of their specific features.

  • i ) Consistent prize ordering. When the probabilities of obtaining specific rewards are fixed and the magnitudes of those rewards are varied, the ordering of obtained reward outcomes by neural activity (e.g., which reward produces more activity, regardless of how much more) must be the same regardless of the environmental conditions under which the rewards were received.
  • ii ) Consistent lottery ordering. When rewards are fixed and the probabilities of obtaining specific rewards are varied, the ordering of rewards by neural activity (e.g., which reward outcome produces more activity) should be the same for all of the reward outcomes that can occur under a given set of probabilities.
  • iii ) No surprise equivalence. The final criterion of necessity and sufficiency identified by Caplin and Dean ( 41 ) was that RPE signals must respond identically to all fully predicted outcomes (whether good or bad), conditions under which the reward prediction error is zero.

Caplin and Dean ( 40 , 41 ) showed that any RPE system, whether a Bush and Mosteller ( 23 , 24 ) class or TD class model, must meet these three axiomatic criteria. Saying that an observed system violated one or more of these axioms, they showed, was the same as saying that it could not, in principle, serve as a reward prediction error system. Conversely, they showed that, for any system that obeyed these three rules, neuronal activity could without a doubt be accurately described using at least one member of the reward prediction error model class. Thus, what was important about the axiomatization of the class of all RPE models by Caplin and Dean ( 40 , 41 ) is that it provided a clear way to test this entire class of hypotheses.

In a subsequent experiment, Caplin et al. ( 42 ) then performed an empirical test of the axioms on brain activations (measured with functional MRI) in areas receiving strong dopaminergic inputs by constructing a set of monetary lotteries and having human subjects play those lotteries for real money. In those experiments, subjects either won or lost $5 on each trial, and the probabilities of winning or losing were systematically manipulated. The axioms indicate that for a reward prediction error encoding system under these conditions, three things will occur.

  • i ) Winning $5 must always give rise to more activity than losing $5, regardless of the probability (from consistent prize ordering).
  • ii ) The more certain you are that you will win, the lower must be the neural activation to winning, and conversely, the more certain you are that you will lose, the higher must be the activity to losing (from consistent lottery ordering).
  • iii ) If you are certain of an outcome, whether it be winning or losing, neural activity should be the same, regardless of whether you win or lose $5 (from no surprise equivalence).

What they found was that activations in the insula violated the first two axioms of the reward prediction error theory. This was an unambiguous indication that the blood oxygen level-dependent (BOLD) activity in the insula could not, in principle, serve as an RPE signal for learning under the conditions that they studied. In contrast, activity in the ventral striatum obeyed all three axioms and thus, met the criteria of both necessity and sufficiency for serving as an RPE system. Finally, activity in the medial prefrontal cortex and the amygdala yielded an intermediate result. Activations in these areas seemed to weakly violate one of the axioms, raising the possibility that future theories of these areas would have to consider the possibility that RPEs either were not present or were only a part of the activation pattern here.

The paper by Caplin et al. ( 42 ) was important, because it was, in a real sense, the final proof that some areas activated by dopamine, the ventral striatum in particular, can serve as a reward prediction error encoder of the type postulated by TD models. The argument that this activation only looks like an RPE signal can now be entirely dismissed. The pattern of activity that the ventral striatum shows is both necessary and sufficient for use in an RPE system. That does not mean that it has to be such a system, but it draws us closer and closer to that conclusion.

Cellular Mechanisms of Reinforcement Learning

In the 1940s and 1950s, Hebb ( 43 ) was among the first to propose that alterations of synaptic strength based local patterns of activation might serve to explain how conditioned reflexes operated at the biophysical level. Bliss and Lomo ( 44 ) succeeded in relating these two sets of concepts when they showed long-term potentiation (LTP) in the rabbit hippocampus. Subsequent biophysical studies have shown several other mechanisms for altering synaptic strength that are closely related to both the theoretical proposal of Hebb ( 43 ) and the biophysical mechanism of Bliss and Lomo ( 44 ). Wickens ( 45 ) and Wickens and Kotter ( 46 ) proposed the most relevant of these for our discussion, which is often known as the three-factor rule. What Wickens ( 45 ) and Wickens and Kotter ( 46 ) proposed was that synapses would be strengthened whenever presynaptic and postsynaptic activities co-occurred with dopamine, and these same synapses would be weakened when presynaptic and postsynaptic activities occurred in the absence of dopamine. Indeed, there is now growing understanding at the biophysical level of the many steps by which dopamine can alter synaptic strengths ( 47 ).

Why is this important for models of reinforcement learning? An animal experiences a large positive reward prediction error: he just earned an unexpected reward. The TD model tells us that, under these conditions, we want to increment the value attributed to all actions or sensations that have just occurred. Under these conditions, we know that the dopamine neurons release dopamine throughout the frontocortical–basal ganglia loops and do so in a highly homogenous manner. The three-factor rule implies that any dopamine receptor-equipped neuron, active because it just participated in, for example, a movement to a lever, will have its active synapses strengthened. Thus, whenever a positive prediction error occurs and dopamine is released throughout the frontal cortices and the basal ganglia, any segment of the frontocortical–basal ganglia loop that is already active will have its synapses strengthened.

To see how this would play out in behavior, consider that neurons of the dorsal striatum form maps of all possible movements into the extrapersonal space. Each time that we make one of those movements, the neurons associated with that movement are active for a brief period and that activity persists after the movement is complete ( 48 , 49 ). If any movement is followed by a positive prediction error, then the entire topographic map is transiently bathed in the global prediction error signal carried by dopamine into this area. What would this combination of events produce? It would produce a permanent increment in synaptic strength only among those neurons associated with recently produced movements. What would that synapse come to encode after repeated exposure to dopamine? It would come to encode the expected value (or perhaps, more precisely, the expected subjective value) of the movement.

What is critical to understand here is that essentially everything in this story is a preexisting observation of properties of the nervous system. We know that neurons in the striatum are active after movements as required of (the eligibility traces of) TD models. We know that a blanket dopaminergic prediction error is broadcast throughout the frontocortical–basal ganglia loops. We know that dopamine produces LTP-like phenomena in these areas when correlated with underlying activity. In fact, we even know that, after conditioning, synaptically driven action potential rates in these areas encode the subjective values of actions ( 48 – 51 ). Therefore, all of these biophysical components exist, and they exist in a configuration that could implement TD class models of learning.

We even can begin to see how the prediction error signal coded by the dopamine neurons could be produced. We know that neurons in the striatum encode, in their firing rates, the learned values of actions. We know that these neurons send outputs to the dopaminergic nuclei—a reward prediction. We also know that the dopaminergic neurons receive fairly direct inputs from sensory areas that can detect and encode the magnitudes of consumed rewards. The properties of sugar solutions encoded by the tongue, for example, have an almost direct pathway through which these signals can reach the dopaminergic nuclei. Given that this is true, constructing a prediction error signal at the dopamine neurons simply requires that excitatory and inhibitory synapses take the difference between predicted and experienced reward in the voltage of the dopamine neurons themselves or their immediate antecedents.

Summary and Conclusion

The basic outlines of the dopamine reward prediction error model seem remarkably well-aligned with both biological level and behavioral data; a wide range of behavioral and physiological phenomena seem well-described in a parsimonious way by this hypothesis. The goal of this presentation has been to communicate the key features of that alignment, which has been mediated by rigorous computational theory. It is important to note, however, that many observations do exist that present key challenges to the existing dopamine reward prediction error model. Most of these challenges are reviewed in Dayan and Niv ( 52 ). * It is also true that the reward prediction error hypothesis has focused almost entirely on the phasic responses of the dopamine neurons. It is unarguably true that the tonic activity of these neurons is also an important clinical and physiological feature ( 55 ) that is only just beginning to receive computational attention ( 56 , 57 ).

One more recent challenge that deserves special mention arises from the work of Matsumoto and Hikosaka ( 58 ), who have recently documented the existence of neurons in the ventro-lateral portion of the SNc that clearly do not encode a reward prediction error. They hypothesize that these neurons form a second physiologically distinct population of dopamine neurons that plays some alternative functional role. Although it has not yet been established that these neurons do use dopamine as their neurotransmitter (which can be difficult) ( 11 ), this observation might suggest the existence of a second group of dopamine neurons whose activity lies outside the boundaries of current theory.

In a similar way, Ungless et al. ( 59 ) have shown that, in anesthetized rodents, some dopamine neurons in the VTA respond positively to aversive stimuli. Of course, for an animal that predicts a very aversive event, the occurrence of an only mildly aversive event would be a positive prediction error. Although it is hard to know what predictions the nervous system of an anesthetized rat might make, the observation that some dopamine neurons respond to aversive stimuli poses another important challenge to existing theory that requires further investigation.

Despite these challenges, the dopamine reward prediction error has proven remarkably robust. Caplin et al. ( 42 ) have shown axiomatically that dopamine-related signals in the ventral striatum can, by definition, be described accurately with models of this class. Montague et al. ( 29 ) have shown that the broad features of dopamine activity are well-described by TD class ( 26 ) models. More detailed analyses like those by Bayer and Glimcher ( 37 ) have shown quantitative agreement between dopamine firing rates and key structural features of the model. Work in humans ( 60 , 61 ) has shown that activity in dopaminergic target areas is also well-accounted for by the general features of the model in this species. Similar work in rats also reveals the existence of a reward prediction error-like signal in midbrain dopamine neurons of that species ( 62 ). Additionally, it is also true that many of the components of larger reinforcement learning circuits in which the dopamine neurons are believed to be embedded have also now been identified ( 48 – 51 , 63 – 65 ). Although it is always true that existing scientific models turn out to be incorrect at some point in the future with new data, there can be little doubt that the quantitative and computational study of dopamine neurons is a significant accomplishment of contemporary integrative neuroscience.

The author declares no conflict of interest.

This paper results from the Arthur M. Sackler Colloquium of the National Academy of Sciences, “Quantification of Behavior” held June 11–13, 2010, at the AAAS Building in Washington, DC. The complete program and audio files of most presentations are available on the NAS Web site at www.nasonline.org/quantification .

This article is a PNAS Direct Submission.

*It is important to acknowledge that there are alternative views of the function of these neurons. Berridge ( 53 ) has argued that dopamine neurons play a role closely related to the one described here that is referred to as incentive salience. Redgrave and Gurney ( 54 ) have argued that dopamine plays a central role in processes related to attention.

Cookies on this website

We use cookies to ensure that we give you the best experience on our website. If you click 'Accept all cookies' we'll assume that you are happy to receive all cookies and you won't see this message again. If you click 'Reject all non-essential cookies' only necessary cookies providing core functionality such as security, network management, and accessibility will be enabled. Click 'Find out more' for information on how to change your cookie settings.

Oxford Neuroscience

  • Accessibility

New insights into the role of dopamine in reinforcement learning

11 March 2022

A new study from Dr Yanfeng Zhang has uncovered the first evidence that dopamine-dependent long-term potentiation is also gated by the pause of striatal cholinergic interneurons and the depolarisation of the striatal spiny projection neurons. This discovery overturns previous ideas that the phasic dopamine release is the only factor gate corticostriatal synaptic plasticity, thus changing our understanding of dopamine functions in reinforcement learning.

A striatal spiny projection neuron labelled with neurobiotin (scale bar 20 µm; inset 2 µm).

The neurotransmitter dopamine is vital to reinforcement learning. The phasic activity of dopamine neurons, which in part encodes reward prediction error, is thought to reinforce the behaviour of animals to maximise chances of receiving reward in the future. The cellular mechanism of this reinforcement learning is believed to be dopamine-dependent long-term plasticity, or more specifically, potentiation of the efficacy of corticostriatal synapses on spiny projection neurons (SPNs) due to phasic dopamine release into the striatum.

However, previous theories have been unable to explain heterogeneous effects on corticostriatal synapses of widespread dopamine activity seen in animal studies ( in vivo) . Not all events that drive phasic activity of dopamine neurons potentiate all corticostriatal synapses on SPNs. Firstly, phasic activity in dopamine neurons encodes information other than reward prediction error (RPE), such as motivation. Consequently, long-term plasticity should only be induced by the RPE dopamine activity, but not the others. Secondly, out of thousands of active synapses, the dopamine signal should only potentiate or depress the synapses involved in the behaviour that leads to the reward. These issues strike at the very heart of fundamental principles of reinforcement learning. 

Striatal cholinergic interneurons (ChIs) develop an excitation-pause-rebound multiphasic response during learning. This pause phase coincides with phasic activity in dopamine neurons. ChIs have therefore been speculated to provide a time window for differentiating ‘rewarding’ phasic dopamine from the others. However, this hypothesis has never been appropriately tested due to technical challenges. Simulating pause in ChIs requires sufficient tonic background firing activity in ChIs. This only happens in intact brains. It is almost technically impossible to manipulate the sparse distributed ChIs to form a synchronised pause in vivo , and it is even harder to trigger phasic activity in dopamine neurons simultaneously.

In a 2018 study published in Neuron and first authored by Dr Yanfeng Zhang ,  researchers identified the local field potential could be used as the proxy of the firing pattern of striatal ChIs after cortical stimulations, or during the slow oscillation in urethane anaesthetised rats. This finding finally enables the researchers to trigger phasic activity in dopamine neurons to coincide with either the pause or excitation of ChIs.

In a new study published today in Nature Communications, researchers revealed that long-term potentiation (LTP) was induced only when the ChI pause coincides with phasic activity of dopamine neurons, regardless of whether the dopamine neurons were activated by a physiologically meaningful visual stimulation or a train of electrical stimulation. This was the first evidence that the ChI pause is required for dopamine-dependent LTP.

In addition to identifying ChIs as involved in dopamine-dependent LTP, Dr Zhang and colleagues further tested the hypothesis that depolarisation of postsynaptic SPNs is required. Although it has been proposed for decades with ex vivo evidence, this hypothesis has not been confirmed in vivo for dopamine-dependent LTP. The team then performed control experiments and found LTD and not LTP was induced if the SPNs were not depolarised during the period of phasic dopamine release. Therefore, they have provided the first in vivo evidence that LTP is only induced at synapses with depolarised SPNs in dopamine-dependent plasticity.

By using two distinct in vivo preparations, Dr Zhang and colleagues have now demonstrated that long-term potentiation of corticostriatal synapses on SPNs requires the coincidence of phasic activity of dopamine neurons, the pause of striatal cholinergic interneurons, and the depolarisation of SPNs. 

The full paper "Coincidence of cholinergic pauses, dopaminergic activation and depolarisation of spiny projection neurons drives synaptic plasticity in the striatum" is available to read in Nature Communications

Text credit to Dr Yanfeng Zhang

  • Awards and Honours
  • In Memoriam
  • International
  • Major Funding Awards
  • Public Engagement
  • Research Highlights
  • Strategic Developments

Reinforcement learning model, algorithms and its application

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Play with a live Neptune project -> Take a tour 📈

The Best Reinforcement Learning Papers from the ICLR 2020 Conference

Last week I had a pleasure to participate in the International Conference on Learning Representations ( ICLR ), an event dedicated to the research on all aspects of representation learning, commonly known as deep learning . The conference went virtual due to the coronavirus pandemic, and thanks to the huge effort of its organizers, the event attracted an even bigger audience than last year. Their goal was for the conference to be inclusive and interactive, and from my point of view, as an attendee, it was definitely the case! 

Inspired by the presentations from over 1300 speakers, I decided to create a series of blog posts summarizing the best papers in four main areas. You can catch up with the first post about the best deep learning papers here , and today it’s time for 15 best reinforcement learning papers from the ICLR. 

The Best Reinforcement Learning Papers

1. never give up: learning directed exploration strategies.

We propose a reinforcement learning agent to solve hard exploration games by learning a range of directed exploratory policies.

(TL;DR, from OpenReview.net )

essay on reinforcement learning

Main authors: 

Adrià Puigdomènech Badia

Adrià Puigdomènech Badia

LinkedIn | GitHub 

Pablo Sprechmann

Pablo Sprechmann

  Twitter | LinkedIn

2. Program Guided Agent

We propose a modular framework that can accomplish tasks specified by programs and achieve zero-shot generalization to more complex tasks.

essay on reinforcement learning

First author: Shao-Hua Sun

Twitter | LinkedIn | GitHub

3. Model Based Reinforcement Learning for Atari

We use video prediction models, a model-based reinforcement learning algorithm and 2h of gameplay per game to train agents for 26 Atari games.

Paper | Code

essay on reinforcement learning

Main authors:

Łukasz Kaiser

Łukasz Kaiser

Błażej Osiński

Błażej Osiński

4. finding and visualizing weaknesses of deep reinforcement learning agents.

We generate critical states of a trained RL algorithms to visualize potential weaknesses.

essay on reinforcement learning

First author: Christian Rupprecht

5. meta-learning without memorization.

We identify and formalize the memorization problem in meta-learning and solve this problem with novel meta-regularization method, which greatly expand the domain that meta-learning can be  applicable to and effective on.

essay on reinforcement learning

Main authors

essay on reinforcement learning

Mingzhang Yin

Chelsea Finn

Chelsea Finn

Twitter | GitHub | Website

6. Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning?

Exponential lower bounds for value-based and policy-based reinforcement learning with function approximation.

Paper 

essay on reinforcement learning

First author: Simon S. Du

Twitter | LinkedIn | Website

7. The Ingredients of Real World Robotic Reinforcement Learning

System to learn robotic tasks in the real world with reinforcement learning without instrumentation.

Paper  

essay on reinforcement learning

First author: Henry Zhu

LinkedIn | Website

8. Improving Generalization in Meta Reinforcement Learning using Learned Objectives

We introduce MetaGenRL, a novel meta reinforcement learning algorithm. Unlike prior work, MetaGenRL can generalize to new environments that are entirely different from those used for meta-training.

essay on reinforcement learning

First author: Louis Kirsh

9. making sense of reinforcement learning and probabilistic inference.

Popular algorithms that cast “RL as Inference” ignore the role of uncertainty and exploration. We highlight the importance of these issues and present a coherent framework for RL and inference that handles them gracefully.

essay on reinforcement learning

First author: Brendan O’Donoghue

10. seed rl: scalable and efficient deep-rl with accelerated central inference.

SEED RL, a scalable and efficient deep reinforcement learning agent with accelerated central inference. State of the art results, reduces cost and can process millions of frames per second.

essay on reinforcement learning

First author: Lasse Espeholt

LinkedIn | GitHub

11. Multi-agent Reinforcement Learning for Networked System Control

This paper proposes a new formulation and a new communication protocol for networked multi-agent control problems.

essay on reinforcement learning

First author: Tianshu Chu

12. a generalized training approach for multiagent learning.

This paper studies and extends Policy-Spaced Response Oracles (PSRO). It’s a population-based learning method that uses game theory principles. Authors extend the method so that it’s applicable to multi-player games, while providing convergence guarantees in multiple settings.

essay on reinforcement learning

First author: Paul Muller

13. implementation matters in deep rl: a case study on ppo and trpo.

Sometimes an implementation detail may play a role in your research. Here, two policy search algorithms were evaluated: Proximal Policy Optimization (PPO) and Trust Region Policy Optimization (TRPO). “Code-level optimizations”, should be negligible form the learning dynamics. Surprisingly, it turns out that h optimizations turn out to have a major impact on agent behavior.

essay on reinforcement learning

Logan Engstrom

Aleksander Madry

Aleksander Madry

Twitter | GitHub | Website 

14. A Closer Look at Deep Policy Gradients

This is in-depth, empirical study of the behavior of the deep policy gradient algorithms. Authors analyse SOTA methods based on gradient estimation, value prediction, and optimization landscapes.

essay on reinforcement learning

Andrew Ilyas

Aleksander Madry

15. Meta-Q-Learning

MQL is a simple off-policy meta-RL algorithm that recycles data from the meta-training replay buffer to adapt to new tasks.

essay on reinforcement learning

Rasool Fakoor

Twitter | LinkedIn | GitHub | Website

Alexander J. Smola

Alexander J. Smola

Depth and breadth of the ICLR publications is quite inspiring. Here, I just presented the tip of an iceberg focusing on the “reinforcement learning” topic. However, as you can read in this analysis , there were four main areas discussed at the conference:

  • Deep learning (covered in our previous post )
  • Reinforcement learning (covered in this post)
  • Generative models ( here )
  • Natural Language Processing/Understanding ( here )

In order to create a more complete overview of the top papers at ICLR, we are building a series of posts, each focused on one topic mentioned above. You may want to check them out for a more complete overview.

Feel free to share with us other interesting papers on reinforcement learning and we will gladly add them to the list.

Enjoy reading!

Was the article useful?

More about the best reinforcement learning papers from the iclr 2020 conference, check out our product resources and related articles below:, scaling machine learning experiments with neptune.ai and kubernetes, building mlops capabilities at gitlab as a one-person ml platform team, how to optimize hyperparameter search using bayesian optimization and optuna, customizing llm output: post-processing techniques, explore more content topics:, manage your model metadata in a single place.

Join 50,000+ ML Engineers & Data Scientists using Neptune to easily log, compare, register, and share ML metadata.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 19 February 2024

The impact of using reinforcement learning to personalize communication on medication adherence: findings from the REINFORCE trial

  • Julie C. Lauffenburger   ORCID: orcid.org/0000-0002-4940-4140 1 ,
  • Elad Yom-Tov 2 ,
  • Punam A. Keller 3 ,
  • Marie E. McDonnell 4 ,
  • Katherine L. Crum   ORCID: orcid.org/0000-0002-0074-0645 1 ,
  • Gauri Bhatkhande 1 ,
  • Ellen S. Sears 1 ,
  • Kaitlin Hanken 1 ,
  • Lily G. Bessette   ORCID: orcid.org/0000-0003-1088-8579 1 ,
  • Constance P. Fontanet 1 ,
  • Nancy Haff 1 ,
  • Seanna Vine 1 &
  • Niteesh K. Choudhry 1  

npj Digital Medicine volume  7 , Article number:  39 ( 2024 ) Cite this article

1473 Accesses

1 Citations

11 Altmetric

Metrics details

  • Drug therapy
  • Health services
  • Outcomes research
  • Public health
  • Type 2 diabetes

Text messaging can promote healthy behaviors, like adherence to medication, yet its effectiveness remains modest, in part because message content is rarely personalized. Reinforcement learning has been used in consumer technology to personalize content but with limited application in healthcare. We tested a reinforcement learning program that identifies individual responsiveness (“adherence”) to text message content and personalizes messaging accordingly. We randomized 60 individuals with diabetes and glycated hemoglobin A1c [HbA1c] ≥ 7.5% to reinforcement learning intervention or control (no messages). Both arms received electronic pill bottles to measure adherence. The intervention improved absolute adjusted adherence by 13.6% (95%CI: 1.7%–27.1%) versus control and was more effective in patients with HbA1c 7.5- < 9.0% (36.6%, 95%CI: 25.1%–48.2%, interaction p  < 0.001). We also explored whether individual patient characteristics were associated with differential response to tested behavioral factors and unique clusters of responsiveness. Reinforcement learning may be a promising approach to improve adherence and personalize communication at scale.

Similar content being viewed by others

essay on reinforcement learning

Randomized controlled study using text messages to help connect new medicaid beneficiaries to primary care

essay on reinforcement learning

Overcoming barriers to patient adherence: the case for developing innovative drug delivery systems

essay on reinforcement learning

Technology-supported behavior change interventions for reducing sodium intake in adults: a systematic review and meta-analysis

Introduction.

Text messages can be delivered at low cost and provide reminders, education, and motivational support for health behaviors on an ongoing basis 1 . They have demonstrated effectiveness for supporting physical activity, medication adherence, and other daily self-management activities that are guideline recommended for managing chronic diseases, like type 2 diabetes 2 , 3 . However, many prior text messaging interventions have used generic message content (i.e., the same messages delivered to all patients) 4 , 5 , 6 .

Yet, a key principle for changing health behaviors is personalization and how information is presented to match an individual’s specific needs, which may also change over time 7 , 8 , 9 , 10 . Personalization can be based upon simple characteristics, such as name, age, or health metrics 11 . More detailed personalization could potentially be achieved by incorporating routines or behavioral barriers, and adjusting frequently 12 , 13 .

A major obstacle to achieving personalization based on underlying behavioral tendencies is the ability to predict what patients will actually respond to. Traditionally, theory-based assessments, or expert opinion (like barrier elicitation by clinicians), have been used to tailor behavioral messaging particularly at the outset, and sometimes, with updates at intervals 14 , 15 , 16 , 17 , 18 . For example, the REACH trial used interactive texts that asked participants directly about their adherence through weekly feedback, and another recent trial used dynamic tailoring based on patients’ implementation intention plan 17 , 18 . An alternative approach is to use observations of what content patients actually respond to and use that as the basis for what they will respond to in the future. This process is made feasible by mobile health tools (like electronic pill bottles) that passively measure health behaviors on an ongoing basis. Consistent with this, there is emerging interest in just-in-time adaptive interventions (JITAIs), or an intervention design that adapts support (e.g., type, timing, intensity) over time in response to an individual’s changing status and context 19 , 20 .

An efficient approach to achieve such personalized intervention is with the use of reinforcement learning 21 , 22 . This machine learning method trains a statistical model based on rewards from actions of the model in an environment. In the context of behavior change, the model observes individual behaviors in response to cues it provides (like text messages) and learns to optimize response (like adherence) through systematic trial-and-error 23 , 24 . This technique has technological underpinnings applied in computer gaming and robotics 21 , 25 , 26 , 27 . In contrast to other approaches to achieving personalization, reinforcement learning uses approaches that predict the effectiveness of different intervention components and also can use latently derived estimates for tailoring (rather than end user input); and, as interventions are deployed, updates the predictions based on their successes and failures (both at the individual and group level) 28 . That is, the algorithm “learns” to personalize as it experiments, or “adapts” 29 .

Reinforcement learning has thus far had limited use in health care 27 , 28 , 30 , 31 , 32 and has not been applied to medication adherence, an essential daily activity for most patients with chronic disease, and especially diabetes, which affects 529 million individuals globally 2 , 33 . While machine learning generally has been shown to be helpful in measuring suboptimal adherence 34 , 35 , there remains much opportunity to explore how it and related techniques can improve adherence. Accordingly, we launched the RE inforcement learning to I mprove N on-adherence F or diabetes treatments by O ptimizing R esponse and C ustomizing E ngagement trial (REINFORCE) to evaluate the impact of a text messaging program tailored using reinforcement learning on medication adherence for patients with type 2 diabetes 22 .

The trial design has been published 22 with expanded details in the Methods. In brief, 60 patients with type 2 diabetes (with their latest glycated hemoglobin A1c [HbA1c] lab value ≥ 7.5% in the past 180 days) were randomized to a reinforcement learning intervention or control (no intervention) based on pre-specified power calculations. In both arms, patients received a separate electronic pill bottle for each of their diabetes medications, with bottles that look like those dispensed by retail pharmacies but with an electronic cap that recorded the dates and times in which participants took their medications. A figure of the infrastructure was previously published 22 . The reinforcement learning algorithm personalized daily texts based on adherence, patient characteristics, and message history using the following 5 behavioral factors: (1) how the messages are structured (“Framing”; classified as neutral, positive [invoking positive outcomes of medication use], or negative [invoking consequences of medication non-use]), (2) observed feedback (“History”, i.e., including the number of days in the prior week the patient was adherent), (3) social reinforcement (“Social”, i.e., referring to loved ones), (4) whether content was a reminder or informational (“Content”), and (5) whether the text included a reflective question (“Reflective”). Individual messages contained elements from these different factor sets, examples of which have been published previously 22 . The primary outcome was average pill bottle-measured adherence over a 6-month follow-up. After trial completion, we described the performance of the reinforcement learning algorithm process itself and explored responsiveness to behavioral factors using subgroup analyses and clustering methods, as prior work has suggested that there may be important differences in responsiveness 24 .

Among 60 patients, 29 and 31 were randomized to the intervention and control arms, respectively, of which 1 intervention and 3 control patients did not complete follow-up (Fig. 1 ). All 60 patients were included in the intention-to-treat analysis.

figure 1

This diagram shows a visual representation of the flow of patients through the trial.

In total, 26 patients (43%) were female and 35 (58%) were White (Table 1 ). Baseline characteristics were slightly different between the arms based on absolute standardized differences but were well-balanced on key metrics including age, sex, baseline HbA1c values, and baseline adherence. Intervention group patients had less formal education (e.g., 24.1% vs. 16.1% having no more than a high school education) and took more oral diabetes medications (31% vs. 19% taking ≥2 medications) versus control patients.

Description of the reinforcement learning algorithm learning process

In total, 5143 text messages were sent to patients in the intervention arm ( n  = 29) during the 6-month study period. Intervention patients received daily messages; an average of 27.7 (SD: 5.9) unique messages were sent to each patient (Table 2 ). In aggregate, 514 (10.0%), 2473 (48.1%), and 2058 (40.0%) of text messages contained ≥3, ≥4, and ≥5 behavioral factors respectively.

The reinforcement learning algorithm also adapted its selection of behavioral factors in the text messages; the proportions of intervention arm patients who received the five factors over the trial are shown descriptively in Supplemental Fig. 1 panels. For example, positive framing as a factor (Supplemental Fig. 1a ) was initially not frequently selected by the algorithm during the first two months of the trial but became more prevalent later. By contrast, negative framing was more commonly selected at first but decreased over time (Supplemental Fig. 1b ). Other plots for receipt of history, social reinforcement, content, and reflection are shown in Supplemental Fig. 1c–f , respectively. More patients were selected to receive social reinforcement, content, and reflection as factors as the trial progressed, while the proportion of patients receiving history (observed feedback) remained relatively equal over time.

Figure 2 shows the change in adjusted R 2 of the reinforcement learning algorithm over the trial. This statistic, which describes the extent to which adherence is explained by algorithm predictions for behavior following a message sent to participants, increased over time, indicating that the algorithm learned to send more effective messages to patients.

figure 2

The adjusted R 2 from trial calendar day 31 to 279 (March 13, 2021–December 19, 2021) is plotted from each day’s model. We calculated adjusted R 2 from the proportion of variance in daily adherence that is explained by the five intervention factors in the reinforcement learning model. We selected those windows as they each had a minimum of 5 patient observations that day.

The most influential features and interactions from the reinforcement learning algorithm are shown in Fig. 3 . Fixed characteristics that carried the most weight within the model were baseline HbA1c, self-reported level of patient activation, number of medications included in electronic pill bottles, concomitant insulin use, and employment status based on their interactions. The behavioral factors with the largest weight included positive framing, observed feedback, and social reinforcement.

figure 3

This figure shows the model weights from the feature importance score from the reinforcement learning algorithm, which indicates which features were more or less importance to the model. The weights above were the 20 most influential features, ranked from highest to lowest. Abbreviations: HbA1c, glycated hemoglobin A1c; SGLT2, sodium-glucose cotransporter-2.

Effect of the reinforcement learning intervention on the primary outcome

Over the 6-month follow-up, average adherence to medication was 74.3% (SD: 30.8%) in the reinforcement learning intervention arm compared with 67.7% (SD: 29.4%) in the control arm (Fig. 4 ). After adjusting for the block randomized design and baseline characteristics, average adherence among intervention patients was 13.6% (95%CI: 1.7%, 27.1%, p  = 0.047) higher than control (shown in Fig. 4 ). Sensitivity analyses, including omitting the first two weeks of pill bottle data and censoring patients in both arms after 30 days of pill bottle non-use (3 patient and 1 patient in intervention and control arms, respectively) did not change the results (Supplemental Table 1 ).

figure 4

We used generalized estimating equations with an identity link and normally-distributed errors to evaluate the effect of the intervention on adherence to medication measured by pill bottles compared with control. The points on the figure are the point estimates and the error bars are the 95% confidence intervals from the relevant sample sizes. These models were adjusted for baseline characteristics and the block randomized design. The primary outcome is shown at the top. The results of exploratory subgroup analyses by key demographic and clinical characteristics also shown; these were performed by repeating the same models within each subgroup, using interaction p-values to assess between subgroups.

Hypothesis-generating demographic and clinical subgroup analyses that explored interactions between patient characteristics and the intervention’s effectiveness on adherence are also shown in Fig. 4 . The strongest interaction between the overall effectiveness of reinforcement learning and adherence was by baseline HbA1c level. Specifically, in patients with HbA1c 7.5- < 9.0%, the intervention improved adherence by 33.6% (95%CI: 15.9%, 51.4%) versus control contrasted with those with baseline HbA1c ≥ 9% (interaction p value: 0.001) in which there was no significant difference compared with control. In patients who were non-adherent at baseline (i.e., self-reported missing >1 medication dose in the 30 days before enrollment), the intervention improved adherence by 33.0% (95%CI: 13.1%, 52.8%) versus control, but this interaction was not significant (interaction p value: 0.214).

Exploratory analyses of responsiveness to behavioral factors

In hypothesis-generating analyses, we whether responsiveness to the tested behavioral factors (determined by optimal adherence) differed by patient baseline characteristics. As shown in Fig. 5 , patients who were aged <65 years (compared with ≥65), were of White race/ethnicity (compared with non-White), had HbA1c < 9% (compared with ≥9%), were of other marital status (compared with married/partnered), and were taking multiple medications (compared with 1) responded better than their counterparts to most or almost all behavioral factors. In contrast, women were more responsive to messages reporting their medication-taking history than men but were less responsive to other factors. Finally, patients who were more non-adherent at baseline (self-reported missing >1 dose, compared to those who reported missing ≤1 doses in the last 30 days) were more responsive to positively-framed messages and less responsive to messages reporting their medication-taking history, but had similar responsiveness to all other factors.

figure 5

This figure shows the results of these exploratory analyses with the outcome being optimal adherence (adherence=1) for the day after the factor was selected and sent within the text message. We used generalized estimating equations for each behavioral factor with a log link and binary-distributed errors, adjusted for patient baseline characteristics but unadjusted for patient-level clustering. Light red indicates a negative association (Relative risk 0.50–0.99); Light blue indicates a positive association (Relative risk 1.01–1.50); and Dark blue indicates a strong positive association (Relative risk ≥1.50). Abbreviations: CI Confidence interval, HbA1c glycated hemoglobin A1c.

Adherence differed based on whether that behavioral factor had been sent the prior day (Fig. 6 ). For instance, adherence was highest when negatively-framed messages and messages containing observed medication feedback were sent two days in a row (i.e., red columns). By contrast, no difference in adherence was observed when text messages including and not including the behavioral factor were alternated.

figure 6

This figure shows the average daily adherence measured by pill bottle over the course of the trial among the 29 intervention arm participants. These results are stratified by the text message sent in the prior day and/or the same day contained that intervention factor (e.g., positive framing). For example, the dark blue bar for “positive framing” indicates the level of adherence if the prior’s day text message contained positive framing but that day’s text message did not.

Using k- means clustering analysis of average adherence given the behavioral factors, we identified three unique patient clusters (Fig. 7 ). These clusters included: (1) Group 1 (Orange, n  = 9) responding best to observed feedback, (2) Group 2 (Yellow, n  = 4) responding best to social reinforcement and observed feedback, and (3) Group 3 (Blue, n  = 16) responding equally to all message types. Individuals who were married/partnered were more likely to be in Group 1 compared with the other two groups, but most associations were non-significant owing at least in part to small sample size (Table 3 ).

figure 7

Each color represents one of the three different patient groups identified from the exploratory k -means clustering analysis for the average pill bottle adherence measured over 6 months (primary outcome). These groups include: (1) Group 1 (Orange, n  = 9) was the most adherent in response to observed feedback (“history”), (2) Group 2 (Yellow, n  = 4) was the most adherent in response to social reinforcement or observed feedback, and (3) Group 3 (Blue, n  = 16) was equally adherent in response to all types of messages. The error bars show the standard error for the cluster based on the underlying sample size (i.e., a threshold of ≥25 observations was applied).

In this randomized-controlled trial of a reinforcement learning intervention that personalized text messaging content for patients with diabetes (and HbA1c ≥ 7.5%, above most guideline targets), we found that the intervention improved adherence to medication over a 6-month follow-up. The intervention was particularly effective among patients with HbA1c between 7.5 and 9.0%. Adherence changes of this magnitude have been associated with differences in patient outcomes and health care spending 36 , 37 .

Numerous trials have demonstrated that text messages support adherence to medication 1 , 4 , 5 , 6 , 11 , 38 , 39 , 40 . However, the effectiveness of many prior approaches has been limited, in part because they have not personalized the content and presentation of the messages patients receive 38 . To our knowledge, no study has personalized text messages for adherence in real-time on a daily basis through latent measurement of adherence and response, especially using reinforcement learning. Some prior work has personalized text messages for adherence based on simple user characteristics, preferences or self-reported adherence, at pre-specified intervals, or through relatively static “if-then” rules, but have not adapted based on observing what patients respond to 11 , 17 , 18 , 19 . Reinforcement learning has indicated early promise for other health behaviors. For example, a 3-arm trial of 27 patients tested the impact on physical activity of different text messaging approaches for individuals with type 2 diabetes, finding that text messaging using reinforcement learning resulted in significant more physical activity and lower HbA1c values than non-personalized weekly texting strategies 24 , 27 . Reinforcement learning interventions for titrating anti-epilepsy medications and selecting sepsis protocols have also demonstrated effectiveness 31 , 32 , 41 .

The reinforcement learning intervention appears to have learned from patient observations and changed the messages that it selected over time. This was particularly evident in its approach to message framing. The algorithm initially favored negatively framed messages (e.g., highlighting the negative disease consequences of non-adherence to medication) but over time, there was a noticeable shift such that more patients received either a neutral tone or positively framed message (e.g., highlighting positive consequences of adherence). This change was also seen quantitatively with the increasing proportion of variance in daily adherence explained by the behavioral factors in the text messages (i.e., the adjusted R 2 ). By the end of the trial, the adjusted R 2 was consistently over 0.40, meaning that much of the difference in adherence on a given day could be explained by the five algorithm factors. Additional features, for example, the interaction between positive framing and observed feedback as well as higher HbA1c and patient activation provided the greatest weight to the model prediction, suggesting that the algorithm incorporated learned information. Together, these findings suggest that the reinforcement learning algorithm not only changed its strategy over time but also improved its performance in predicting what types of messages would improve individuals’ adherence.

The intervention was also particularly effective in patients with HbA1c between 7.5 and 9.0%. The reason for this can be explained in two ways; first, patients further from guideline targets may need treatment intensification in addition to better adherence to their existing medications 2 , 3 . Second, a prior trial also suggested that individuals have varying preferences for how to escalate diabetes care at different levels of HbA1c values; those with HbA1c between 7.5 and 9.0% were more interested in adherence support than other interventions 16 . While less pronounced and not statistically significant, patients reporting worse adherence at baseline also tended to respond more to the intervention. This may also have been due to the fact that individuals who report missing multiple doses in the last 30 days most likely have substantial non-adherence 42 and are an ideal target population for an adherence intervention.

Supporting the potential benefits of personalization and for generating future hypotheses, we explored characteristics of patients who responded differently to different message types. The most notable was that women responded better to receiving observed feedback about their medication-taking than men but responded less well to positive framing, social messaging, informational messaging rather than reminders, and messages that were intended to provoke reflection. One explanation could be that some women are already aware of how their own health can benefit loved ones and may prefer more straightforward reminders and feedback about their medication-taking performance 24 , 43 , although future work should explore further within larger sample sizes.

In our exploratory analyses, there were clusters of patients who responded to different types of messages. In specific, one group responded best to observed feedback, and a second group responded best to social reinforcement and observed feedback, while a third responded equally to all types of messages. We also found higher adherence when negatively-framed messages and messages that contained observed feedback were provided two days in a row, perhaps reflecting the need to reinforce these types of messages, but not others. The fact that the algorithm de-prioritized negative framing on average over time but that it was effective in combination with observed feedback is also worthy of consideration. This could in part be explained by underlying heterogeneity of the patient population in their responsiveness, and in how the information was sequenced, emphasizing the potential impact of personalization but should be explored further.

Future work could extend these findings in several ways. First, researchers should test the added impact of using reinforcement learning with non-personalized text messages. Second, the impact of a reinforcement learning intervention should be tested on long-term clinical outcomes and in a larger and more diverse sample to confirm some of the exploratory analyses about responsiveness to different behavioral factors. Finally, this work could be applied in other ways, for example to other disease states or related guideline-recommended daily activities such as physical activity or diet.

Several limitations should be acknowledged. First, electronic pill bottles could have influenced adherence, especially during the initial period of observation; however, they have been shown to correlate strongly with actual pill consumption 44 , 45 , and we minimized this observer bias by using pill bottles in both arms. While we powered the study to detect a 10% difference in adherence, the standard deviations were wider than anticipated, likely owing to the small sample size and overall heterogeneity in medication-taking than previously observed 46 . The findings may also not generalize to patients with pre-diabetes or gestational diabetes or those without reliable access to a smartphone. The subgroup and responsiveness analyses were also limited by small sample sizes and should be considered exploratory. It is also currently technologically less feasible to passively measure adherence to injectable agents in a scalable manner, and oral diabetes medications are the cornerstone of first and second-line type 2 diabetes treatments. Finally, we also chose not to have a “generic” text messaging arm, in part to test the highest possible efficacy of the intervention so we cannot assess the incremental benefit of personalization versus generic messaging with this design.

In conclusion, the reinforcement learning intervention led to improvements in adherence to oral diabetes medication and was particularly effective in patients with HbA1c between 7.5 and 9.0%. This trial provides insight into how reinforcement learning could be adapted at scale to improve other self-management interventions and provides promising evidence for how it could be improved and tested in a wider population.

Study design

Trial design details have been previously published 22 . The protocol was designed, written, and executed by the investigators (Fig. 1 ). Study enrollment began in February 2021 and completed in July 2021. Follow-up of all patients ended in January 2022; the final study database was available in March 2022.

Study population and randomization

The trial was conducted at Brigham and Women’s Hospital (BWH), a large academic medical center in Massachusetts, USA. Potentially-eligible patients were individuals 18–84 years of age diagnosed with type 2 diabetes and prescribed 1–3 daily oral diabetes medications, with their most recent glycated hemoglobin A1c (HbA1c) level ≥7.5% (i.e., above guideline targets) 47 . These criteria were assessed using BWH electronic health record (EHR) data. To be included, patients also had to have a smartphone with ability to receive text messages, have working knowledge of English, not be enrolled in another diabetes trial at BWH, not use a pillbox or switch to using electronic pill bottles for their diabetes medications for the study, and be independently responsible for taking medications. Smartphone connectivity was essential to measure daily adherence, but they have been widely adopted, even among patients from socioeconomically disadvantaged backgrounds 48 , 49 . Patients using insulin or other diabetes injectables in addition to their oral medication were allowed to be included to enhance generalizability.

As previously described 22 , potentially eligible patients with a recent or upcoming diabetes clinic visit were identified from the EHR on a biweekly basis. Once identified, the patients’ endocrinologists were contacted for permission to include their patient(s) in the study. Patients approved for enrollment were sent a letter on their endocrinologist’s behalf inviting them to participate and were then contacted by telephone. Patients who agreed provided their written informed consent captured through REDCap electronic data capture tools 50 , 51 , completed a baseline survey containing measures including demographics, self-reported adherence 42 , health activation 52 , and automaticity 53 of medication-taking, and were mailed a separate Pillsy ® electronic pill bottle for each of their eligible diabetes medications (i.e., each patient received between 1–3 pill bottles). Electronic pill bottles have been widely used in prior research and have shown high concordance with other measurement methods 44 , 54 . The data from the pill bottles were transmitted through the patients’ smartphones via an app that otherwise had no features enabled for the app or pill bottles (i.e., any latent adherence reminders through the pill bottle were turned off). A figure of the infrastructure has been previously published 22 .

After receiving the pill bottles, patients were randomized in a 1:1 ratio to intervention or control using block randomization based on baseline level of self-reported adherence (i.e., ≤1 dose or >1 doses missed in the last 30 days 42 ) and (2) baseline HbA1c of <9.0% or ≥9.0% 2 . Patients were asked to use these devices instead of regular pill bottles or pillboxes for their eligible oral diabetes medications. After randomization, patients were followed for 6 months for outcomes. At the end of follow-up, patients were contacted to complete a follow-up survey and ensure complete synchronization of their pill bottles. Both arms received a $50 gift card for participation.

Intervention

The intervention was a reinforcement learning text messaging program that personalized daily text messages based on the electronic pill bottle data. Messages were selected by the Microsoft Personalizer ® algorithm 22 , 24 , a reinforcement learning system which aimed to achieve the highest possible sum of “rewards” over time, and which adapted over time by monitoring the success of each message to nudge patients to adhere to their medications.

The messages were based on behavioral science principles of how content influences patient behavior 55 , 56 , 57 . Based on qualitative interviews 58 , we selected 5 behavioral factors for the messages: (1) framing (classified as neutral, positive [invoking positive outcomes of medication use], or negative [invoking consequences of medication non-use]), (2) observed feedback (“History”, i.e., including the number of days in the prior week the patient was adherent), (3) social reinforcement (“Social”), (4) whether content was reminder or informational (“Content”), and (5) whether the text included a reflective question (“Reflective”) 7 , 8 , 24 , 59 , 60 , 61 . We designed ≥2 text messages for each unique set of factors (i.e., 47 unique sets across 128 text messages); examples of the factors sets contributing to the reinforcement learning model have been published 22 .

Every day, adherence from the prior day was measured by the electronic pill bottles, with values ranging from 0 to 1 based on the fraction of daily doses taken across their diabetes medications, averaging if they are taking multiple medications 22 . These served as the “reward” events used to provide feedback to Microsoft Personalizer ® . The algorithm learned to predict which factors should have been included in the message on a given day to maximize the rewards that the algorithm received (i.e., adherence).

The algorithm used several attributes to predict which factors to select. These included patient baseline characteristics (e.g., age, sex, race/ethnicity, number of medications, concomitant insulin use, self-reported patient activation, education level, employment status, marital status, and therapeutic class), the number of days since each factor had last been sent, and whether the medication had already been taken before the algorithm was run for that day. The algorithm was trained to predict whether or not to include each aspect separately using a “contextual bandit” framework 62 , 63 , 64 . The specific message to be sent was randomly selected from messages matching the required aspects.

The text messages were sent on a daily basis to patients using a third-party SMS platform, including an introductory text and simple reminder text to synchronize their pill bottles if they had not been connected for ≥7 days.

Patients in the control arm received the same introductory and simple reminder text to synchronize their pill bottles if they had not been connected for ≥7 days but otherwise received no intervention.

Study outcomes

The trial’s primary outcome was medication adherence assessed in the 6 months after randomization using the average daily adherence for each patient (which already averaged across multiple medications) 22 . While other secondary outcomes were measured, we focus on the primary outcome and related analyses in this manuscript.

Statistical analysis

The overall trial was powered to detect a 10% difference in average adherence over the 6-month follow-up, assuming a SD = 12.5%. We reported key sociodemographic and clinical pre-randomization variables separately for intervention and control using absolute standardized differences (imbalance as a difference >0.1) 65 . Intention-to-treat principles were used for all randomized patients, with a two-sided hypothesis tested at α  = 0.05. We used SAS 9.4 (Cary, NC) for analyses.

The process and performance of the reinforcement learning algorithm were descriptively examined. The average proportion of patients who received each behavioral factor was estimated and plotted over time for each individual patient. To explore the extent to which adherence was explained by algorithm predictions each day, the adjusted R 2 based on the algorithm predictions was estimated for each day over the trial. Specifically, we calculated the proportion of variance in daily adherence that was explained by just the five intervention factors. We also explored the most influential features selected by the model when predicting which messages to send for the entire follow-up period; higher scores indicates more influence to the model.

For the primary outcome, we evaluated the effect of the reinforcement learning intervention on adherence using generalized estimating equations with an identity link function and normally distributed errors. These models were adjusted for the block-randomized design, and given imbalances in some important covariates, also controlled for differences in measured baseline characteristics. Some of these imbalances included: more patients in the intervention arm with no more than a high school education (24.1% vs. 16.1%), fewer intervention patients who were married/or partnered (44.8% vs. 54.8%), and more intervention patients taking multiple diabetes medications (31.0% vs. 19.4%). Each of these characteristics have been shown previously to influence adherence 66 , 67 . Exploratory subgroup analyses were performed according to key demographic/clinical subgroups including age, sex, race/ethnicity, marital status, baseline HbA1c, number of years using oral diabetes medications, baseline self-reported adherence, and number of pill bottle medications. There was no missing data for the primary outcome. Several other sensitivity analyses were also conducted, including omitting the first 14 days for observer effects 44 and censoring patients in both arms when the pill bottles were not connected for ≥30 days.

Additional exploratory and descriptive analyses of adherence in response to the intervention factors were also conducted for future hypothesis generation. First, the associations between key baseline characteristics and optimal adherence (i.e., adherence value = 1 for the subsequent day) by behavioral factor for intervention patients were explored. To do so, we used generalized estimating equations for each behavioral factor (e.g., positive framing) with a log link and binary-distributed errors with optimal adherence, including all patient baseline characteristics but unadjusted for patient-level clustering due to sample size. Then, we described adherence to different behavioral factors based on the sequence of delivered text messages. Finally, patients were clustered by their average response to different text message factors using k- means clustering analysis, using a threshold of ≥25 responses; smaller numbers were replaced by average variable values. Using these clusters, we explored the bivariate association between key baseline patient demographic/clinical characteristics and membership in each group using multinomial logistic regression (Referent: Group 3). Together, these findings may provide a more accurate starting point for future programs.

Data availability

De-identified data necessary to reproduce results reported here are posted on the Harvard Dataverse, an open access repository for research data, at https://dataverse.harvard.edu/ . Some additional data, specifically dates such as for example dates of medication use, will be available upon reasonable request and execution of appropriate data use agreements, because dates are Protected Health Information under 45 CFR §164.154(b).

Code availability

Code necessary to reproduce results reported here are available in the Harvard Dataverse at https://dataverse.harvard.edu/ .

Lauffenburger, J. C. & Choudhry, N. K. Text messaging and patient engagement in an increasingly mobile world. Circulation 133 , 555–556 (2016).

Article   PubMed   Google Scholar  

ElSayed, N. A. et al. Glycemic targets: standards of care in diabetes-2023. Diabetes Care 46 , S97–S110 (2023).

ElSayed, N. A. et al. Pharmacologic approaches to glycemic treatment: standards of care in diabetes-2023. Diabetes Care 46 , S140–S157 (2023).

Article   CAS   PubMed   Google Scholar  

Hamine, S., Gerth-Guyette, E., Faulx, D., Green, B. B. & Ginsburg, A. S. Impact of mHealth chronic disease management on treatment adherence and patient outcomes: a systematic review. J. Med. Internet Res. 17 , e52 (2015).

Article   PubMed   PubMed Central   Google Scholar  

Hartz, J., Yingling, L. & Powell-Wiley, T. M. Use of mobile health technology in the prevention and management of diabetes mellitus. Curr. Cardiol. Rep. 18 , 130 (2016).

Dobson, R., Whittaker, R., Pfaeffli Dale, L. & Maddison, R. The effectiveness of text message-based self-management interventions for poorly-controlled diabetes: a systematic review. Digit. Health 3 , 2055207617740315 (2017).

PubMed   PubMed Central   Google Scholar  

Keller, P. A. Affect, framing, and persuasian. J. Mark. Res. 40 , 54–64 (2003).

Article   Google Scholar  

Gong, J. et al. The framing effect in medical decision-making: a review of the literature. Psychol. Health Med. 18 , 645–653 (2013).

Yokum, D., Lauffenburger, J. C., Ghazinouri, R. & Choudhry, N. K. Letters designed with behavioural science increase influenza vaccination in Medicare beneficiaries. Nat. Hum. Behav. 2 , 743–749 (2018).

Petty R. E. & Cacioppo J. T. The Elaboration Likelihood Model of Persuasion, (Springer Series in Social PsychologyL, 1986).

Thakkar, J. et al. Mobile telephone text messaging for medication adherence in chronic disease: a meta-analysis. JAMA Intern. Med. 176 , 340–349 (2016).

Garofalo, R. et al. A randomized controlled trial of personalized text message reminders to promote medication adherence among HIV-positive adolescents and young adults. AIDS Behav. 20 , 1049–1059 (2016).

Sahin, C., Courtney, K. L., Naylor, P. J. & Rhodes, R. E. Tailored mobile text messaging interventions targeting type 2 diabetes self-management: a systematic review and a meta-analysis. Digit. Health 5 , 2055207619845279 (2019).

Choudhry, N. K. et al. Effect of a remotely delivered tailored multicomponent approach to enhance medication taking for patients with hyperlipidemia, hypertension, and diabetes: the STIC2IT cluster randomized clinical trial. JAMA Intern. Med. 178 , 1182–1189 (2018).

Choudhry, N. K. et al. Rationale and design of the Study of a Tele-pharmacy Intervention for Chronic diseases to Improve Treatment adherence (STIC2IT): A cluster-randomized pragmatic trial. Am. heart J. 180 , 90–97 (2016).

Lauffenburger, J. C. et al. Impact of a novel pharmacist-delivered behavioral intervention for patients with poorly-controlled diabetes: the ENhancing outcomes through Goal Assessment and Generating Engagement in Diabetes Mellitus (ENGAGE-DM) pragmatic randomized trial. PloS one 14 , e0214754 (2019).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Kassavou, A. et al. A highly tailored text and voice messaging intervention to improve medication adherence in patients with either or both hypertension and Type 2 diabetes in a UK primary care setting: feasibility randomized controlled trial of clinical effectiveness. J. Med. Internet Res. 22 , e16629 (2020).

Nelson, L. A. et al. Effects of a tailored text messaging intervention among diverse adults with Type 2 diabetes: evidence from the 15-Month REACH randomized controlled trial. Diabetes Care 44 , 26–34 (2021).

Hornstein, S., Zantvoort, K., Lueken, U., Funk, B. & Hilbert, K. Personalization strategies in digital mental health interventions: a systematic review and conceptual framework for depressive symptoms. Front. Digit. Health 5 , 1170002 (2023).

Tong, H. L. et al. Personalized mobile technologies for lifestyle behavior change: a systematic review, meta-analysis, and meta-regression. Prev. Med. 148 , 106532 (2021).

Trella A. L., et al. Designing Reinforcement Learning Algorithms for Digital Interventions: Pre-Implementation Guidelines. Algorithms . Aug 2022;15 https://doi.org/10.3390/a15080255 .

Lauffenburger, J. C. et al. REinforcement learning to improve non-adherence for diabetes treatments by Optimising Response and Customising Engagement (REINFORCE): study protocol of a pragmatic randomised trial. BMJ open 11 , e052091 (2021).

Jordan S. M., Chandak Y., Cohen D., ZHang M. & Thomas P. S. Evaluating the performance of reinforcement learning algorithms. In Proc. Thirty-Seventh International Conference on Machine Learning . 2020 https://proceedings.mlr.press/v119/jordan20a/jordan20a.pdf .

Yom-Tov, E. et al. Encouraging physical activity in patients with diabetes: intervention using a reinforcement learning system. J. Med. Internet Res. 19 , e338 (2017).

Piette, J. D. et al. The potential impact of intelligent systems for mobile health self-management support: Monte Carlo simulations of text message support for medication adherence. Ann. Behav. Med. 49 , 84–94 (2015).

Liu, D., Yang, X., Wang, D. & Wei, Q. Reinforcement-learning-based robust controller design for continuous-time uncertain nonlinear systems subject to input constraints. IEEE Trans. Cyber. 45 , 1372–1385 (2015).

Hochberg, I. et al. Encouraging physical activity in patients with diabetes through automatic personalized feedback via reinforcement learning improves glycemic control. Diabetes Care 39 , e59–e60 (2016).

Liao P., Greenewald K., Klasnja P. & Murphy S. Personalized HeartSteps: A Reinforcement Learning Algorithm for Optimizing Physical Activity. Proc ACM Interact Mob Wearable Ubiquitous Technol . Mar 2020;4 https://doi.org/10.1145/3381007 .

Liu, X., Deliu, N. & Chakraborty, B. Microrandomized trials: developing just-in-time adaptive interventions for better public health. Am. J. Public Health 113 , 60–69 (2023).

Guez, A., Vincent, R. D., Avoli, M. & Pineau, J. Treatment of epilepsy via batch-mode reinforcement learning. In Proc. Twenty-Third AAAI Conference on Artificial Intelligence . 2008:1671–1678 https://cdn.aaai.org/IAAI/2008/IAAI08-008.pdf .

Klasnja, P. et al. Micro-randomized trials: an experimental design for developing just-in-time adaptive interventions. Health Psychol. 34 , 1220–1228 (2015).

Article   PubMed Central   Google Scholar  

Komorowski, M., Celi, L. A., Badawi, O., Gordon, A. C. & Faisal, A. A. The artificial intelligence Clinician learns optimal treatment strategies for sepsis in intensive care. Nat. Med. 24 , 1716–1720 (2018).

Collaborators GBDD. Global, regional, and national burden of diabetes from 1990 to 2021, with projections of prevalence to 2050: a systematic analysis for the Global Burden of Disease Study 2021. Lancet . Jun 2023; https://doi.org/10.1016/S0140-6736(23)01301-6 .

Kanyongo, W. & Ezugwu, A. E. Feature selection and importance of predictors of non-communicable diseases medication adherence from machine learning research perspectives. Inform. Med. Unlocked. 38 , 101132 (2023).

Kanyongo, W. & Ezugwu, A. E. Machine learning approaches to medication adherence amongst NCD patients: A systematic literature review. Inform. Med. Unlocked. 38 , 101210 (2023).

Cutler, R. L., Fernandez-Llimos, F., Frommer, M., Benrimoj, C. & Garcia-Cardenas, V. Economic impact of medication non-adherence by disease groups: a systematic review. BMJ open 8 , e016982 (2018).

Bitton, A., Choudhry, N. K., Matlin, O. S., Swanton, K. & Shrank, W. H. The impact of medication adherence on coronary artery disease costs and outcomes: a systematic review. Am. J. Med. 126 , 357 e7–357.e27 (2013).

Arambepola, C. et al. The impact of automated brief messages promoting lifestyle changes delivered via mobile devices to people with Type 2 diabetes: a systematic literature review and meta-analysis of controlled trials. J. Med. Internet Res. 18 , e86 (2016).

Bobrow, K. et al. Mobile phone text messages to support treatment adherence in adults with high blood pressure (SMS-Text Adherence Support [StAR]): a single-blind, randomized trial. Circulation 133 , 592–600 (2016).

Pandey, A., Krumme, A., Patel, T. & Choudhry, N. The impact of text messaging on medication adherence and exercise among postmyocardial infarction patients: randomized controlled pilot trial. JMIR Mhealth Uhealth 5 , e110 (2017).

Paredes P G-BR, Czerwinski M., Roseway A., Rowan K. & Hernandez J. PopTherapy: coping with stress through pop-culture. 109–117 (2014) https://dl.acm.org/doi/10.4108/icst.pervasivehealth.2014.255070 .

Lauffenburger, J. C. et al. Comparison of a new 3-item self-reported measure of adherence to medication with pharmacy claims data in patients with cardiometabolic disease. Am. heart J. 228 , 36–43 (2020).

Shrank, W. H. et al. Are caregivers adherent to their own medications? J. Am. Pharmacists Assoc 51 , 492–498 (2011).

Mehta S. J. et al. Comparison of pharmacy claims and electronic pill bottles for measurement of medication adherence among myocardial infarction patients. Med. care . https://doi.org/10.1097/MLR.0000000000000950 .

Arnsten, J. H. et al. Antiretroviral therapy adherence and viral suppression in HIV-infected drug users: comparison of self-report and electronic monitoring. Clin. Infect. Dis. 33 , 1417–1423 (2001).

Franklin, J. M. et al. Group-based trajectory models: a new approach to classifying and predicting long-term medication adherence. Med. Care 51 , 789–796 (2013).

Garber, A. J. et al. Consensus statement by the american association of clinical endocrinologists and american college of endocrinology on the comprehensive TYPE 2 diabetes management algorithm - 2018 executive summary. Endocr. Pr. 24 , 91–120 (2018).

Baptista, S. et al. User experiences with a Type 2 diabetes coaching app: qualitative study. JMIR Diabetes 5 , e16692 (2020).

Aguilera, A. et al. mHealth app using machine learning to increase physical activity in diabetes and depression: clinical trial protocol for the DIAMANTE Study. BMJ Open 10 , e034723 (2020).

Harris, P. A. et al. The REDCap consortium: building an international community of software platform partners. J. Biomed. Inf. 95 , 103208 (2019).

Harris, P. A. et al. Research electronic data capture (REDCap)–a metadata-driven methodology and workflow process for providing translational research informatics support. J. Biomed. Inf. 42 , 377–381 (2009).

Wolf, M. S. et al. Development and validation of the consumer health activation index. Med. Decis. Mak. 38 , 334–343 (2018).

Gardner, B., Abraham, C., Lally, P. & de Bruijn, G. J. Towards parsimony in habit measurement: testing the convergent and predictive validity of an automaticity subscale of the Self-Report Habit Index. Int. J. Behav. Nutr. Phys. Act. 9 , 102 (2012).

Volpp, K. G. et al. Effect of electronic reminders, financial incentives, and social support on outcomes after myocardial infarction: the heartstrong randomized clinical trial. JAMA Intern. Med. 177 , 1093–1101 (2017).

Bandura, A. Self-efficacy: toward a unifying theory of behavioral change. Psychol. Rev. 84 , 191–215 (1977).

Gintis, H. A framework for the unification of the behavioral sciences. Behav. Brain Sci. 30 , 1–16 (2007).

Tzeng, O. C. & Jackson, J. W. Common methodological framework for theory construction and evaluation in the social and behavioral sciences. Genet. Soc. Gen. Psychol. Monogr. 117 , 49–76 (1991).

CAS   PubMed   Google Scholar  

Lauffenburger, J. C. et al. Preferences for mHealth technology and text messaging communication in patients with Type 2 diabetes: qualitative interview study. J. Med. Internet Res. 23 , e25958 (2021). Jun.

Baron, R. M. Social reinforcement effects as a function of social reinforcement history. Psychol. Rev. 73 , 527–539 (1966).

Lauffenburger J. C., Khan N. F., Brill G., Choudhry N. K. Quantifying social reinforcement among family members on adherence to medications for chronic conditions: a us-based retrospective cohort study. J. General Intern. Med . https://doi.org/10.1007/s11606-018-4654-9 .

Viswanathan, M. et al. Interventions to improve adherence to self-administered medications for chronic diseases in the United States: a systematic review. Ann. Intern. Med. 157 , 785–795 (2012).

Krakow, E. F. et al. Tools for the precision medicine era: how to develop highly personalized treatment recommendations from cohort and registry data using Q-learning. Am. J. Epidemiol. 186 , 160–172 (2017).

Laber, E. B., Linn, K. A. & Stefanski, L. A. Interactive model building for Q-learning. Biometrika 101 , 831–847 (2014).

Article   MathSciNet   PubMed   Google Scholar  

Nahum-Shani, I. et al. Q-learning: a data analysis method for constructing adaptive interventions. Psychol. Methods 17 , 478–494 (2012).

Austin, P. C. Balance diagnostics for comparing the distribution of baseline covariates between treatment groups in propensity-score matched samples. Stat. Med. 28 , 3083–3107 (2009).

Article   MathSciNet   PubMed   PubMed Central   Google Scholar  

Lauffenburger, J. C. et al. Prevalence and impact of having multiple barriers to medication adherence in nonadherent patients with poorly controlled cardiometabolic disease. Am. J. Cardiol. 125 , 376–382 (2020).

Easthall, C., Taylor, N. & Bhattacharya, D. Barriers to medication adherence in patients prescribed medicines for the prevention of cardiovascular disease: a conceptual framework. Int. J. Pharm. Pract. 27 , 223–231 (2019).

Download references

Acknowledgements

Research reported in this publication was supported by the National Institute on Aging of the National Institutes of Health under Award Number P30AG064199 to BWH (N.K.C. PI). J.C.L. was supported by a career development grant (K01HL141538) from the NIH. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The authors wish to thank the Digital Care Transformation team at BWH, the team responsible for managing Microsoft Dynamics 365 SMS Texting.

Author information

Authors and affiliations.

Center for Healthcare Delivery Sciences, Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA, USA

Julie C. Lauffenburger, Katherine L. Crum, Gauri Bhatkhande, Ellen S. Sears, Kaitlin Hanken, Lily G. Bessette, Constance P. Fontanet, Nancy Haff, Seanna Vine & Niteesh K. Choudhry

Microsoft Research, Herzliya, Israel

Elad Yom-Tov

Tuck School of Business, Dartmouth College, Hanover, NH, USA

Punam A. Keller

Division of Endocrinology, Diabetes and Hypertension, Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA, USA

Marie E. McDonnell

You can also search for this author in PubMed   Google Scholar

Contributions

All authors meet International Committee of Medical Journal Editors (ICMJE) criteria. JCL had overall responsibility for the trial design and drafted the trial protocol and manuscript. NKC is the co-principal investigator, had overall responsibility for the trial design and trial protocol, and helped draft the trial protocol and manuscript. EYT, PAK, MEM, LGB, CPF, ESS, KC, GB, KH, and NH contributed meaningfully to trial or intervention design and implementation as well as the manuscript. All authors contributed to the refinement of the study protocol and approved the final manuscript.

Corresponding author

Correspondence to Julie C. Lauffenburger .

Ethics declarations

Competing interests.

At the time this study was conducted, E.Y.-T. was an employee of Microsoft. N.K.C. serves as a consultant to Veracity Healthcare Analytics and holds equity in RxAnte and DecipherHealth; unrelated to the current work, N.K.C. has also received unrestricted grant funding payable to Brigham and Women’s Hospital from Humana. N.H. has received personal fees from Cerebral unrelated to the current work. The remainder of the authors report no conflicts of interest.

Ethical approval

The trial was approved by the institutional review board (IRB) of Mass General Brigham and registered with clinicaltrials.gov (NCT04473326). The authors were responsible for performing study analyses, writing the manuscript, substantive edits, and submitting final contents for publication. Patients were not blinded due to the nature of the interventions. No data monitoring committee was deemed necessary by the IRB.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplemental information, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Lauffenburger, J.C., Yom-Tov, E., Keller, P.A. et al. The impact of using reinforcement learning to personalize communication on medication adherence: findings from the REINFORCE trial. npj Digit. Med. 7 , 39 (2024). https://doi.org/10.1038/s41746-024-01028-5

Download citation

Received : 01 August 2023

Accepted : 05 February 2024

Published : 19 February 2024

DOI : https://doi.org/10.1038/s41746-024-01028-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

essay on reinforcement learning

essay on reinforcement learning

Essays etc. on AI, academia, and all the world and his wife; by Thilo Stadelmann.

Lecture notes on Reinforcement Learning

I recently took David Silver’s online class on reinforcement learning ( syllabus & slides and video lectures ) to get a more solid understanding of his work at DeepMind on AlphaZero ( paper and more explanatory blog post ) etc. I enjoyed it as a very accessible yet practical introduction to RL. Here are the notes I took during the class.

Slide numbers refer to the downloadbale slides (might differ slightly from those in the videos), and therein to the page number in parantheses (like e.g. “15” for “[15 of 47]”).

Lecture 1: Introduction to Reinforcement Learning

  • RL Hypothesis: all goal achievement can be cast as maximizing cumulative future reward (“all goals can be described by maximizing expected cumulative reward”)
  • History (times series of all actions/rewards/observations until now) is impractical for decision making (if the agent has a long life) -> use state instead as a condensed form of all that matters: s_t=f(H_t)
  • When we talk state, we mean the agent’s state (where we can control the function f), not the environment state (invisible to us, may contain other agents)
  • Our state (Information State) has to have the Markov property: future is independent of the past given the present (env. state is, complete history is too)
  • Great example for how our decision depends on the choice of state: rat & cheese/electricity (order or count as state representation) -> see slide 25
  • If the environment is fully obserable (obervation = environment state = agent state), the decision process is a MDP
  • For partially observable environments: POMDP -> have to build our own state representation (e.g., Bayesian: have a vector of beliefs/probabilities of what state the environment is in; ML: have a RNN combine the last state and latest observation into a new state)
  • An agent may include one or more of these: policy (behaviour function [a,s]->[s’]), value function (predicted [discounted] future reward, depending on some behaviour), model (predict what the environment will do next)
  • V: State-value function; Q: Action-value function
  • Transition model: predicts environment dynamics; Reward model: predicts next immediate reward
  • Good example for policy/value function/model: simple maze with numbers/arrows -> see slides 35-37
  • Taxonomy: value-based (policy implicit); policy-based (value function implicit); actor-critic (stores both policy and value function)
  • Taxonomy contd.: model-free (policy and/or value function); model-based: has a model and policy/value function
  • RL: env. initially unknown, agent has to interact to find good policy
  • Planning: model of env. is known; agent performs computations with model to look ahead and choose policy (e.g., tree-search)
  • Both are intimately linked, e.g. learn how env. works first (i.e., build model) and then do planning
  • Exploration / exploitation: e.g. go to your favourite / a new restaurant
  • Prediction / control: predict future given policy (find value function) vs. optimise future (find policy) -> usually we need to solve prediction to optimally control

Lecture 2: Markov Decision Process

  • MDPs formally describe an environment for RL
  • Almost all RL problems can be formalised as MDPs
  • Def. Markov process: a random sequence of states with the Markov property, drawn from a distribution: [S,P] state space S and transition probability matrix P
  • Good example: student markov chain (making it through a day at university) -> slide 8
  • It is a MP with value judgments (how good it is to be in a state): [S, P, R, gamma] with reward function R (immediate reward for being in a state) and discount factor
  • In RL, we actually care about the total (cumulated, discounted) reward, called the return (or goal) G
  • gamma is to quantify the present value of future rewards (i.e., because of uncertainty: now they are not yet fully sure, also because our model not being perfect; also because it is just mathematically convenient to do so)
  • Value function: the long-term value of being in a state (the thing we care about in RL) V(s)=E[G_t|S_t=s] (expectation because we are not talking here about one concrete sample from the MRP, but about the stochastic process as a whole, i.e. the average over all possible episodes from s to the end)
  • Great example -> slide 17
  • Bellman equation for MRPs: to break up the value function into two parts: immediate reward R_{t+1} and discounted future reward gamma*v(S_{t+1})
  • The Bellman euqation is not just for estimating the value function; it is an identity: every proper value function has to obey this decompositon into immediate reward and discounted averaged one-step look-ahead
  • MRP with decisions, i.e. we not just want to evaluate return, but maximize it: [S, A, P, R, gamma] with actions A
  • What does it mean to make decisions? => policy (distribution over actions given states) completely defines an agent’s behaviour
  • (Given an MDP and a fixed policy, the resulting sequence of states is a Markov process, and the state and reward sequence is an MRP)
  • state-value function v_{pi}(s): how good is it to be in s if I am following pi
  • action-value function q_{pi}(s,a): how good is it to take action a in state s if following pi afterwards
  • Bellman equations can be constructed exactly the same way as above for v_{pi} and q_{pi}: immediate reward plus particular value function of where you end up
  • Bellman euqations for both need a 2-step lookahead: over the (stochastic) policy, and over the (stochastic) dynamics of the environment
  • The optimal value function is the maximum v/q over all pi
  • When you know q*, you are done: you have everything to behave optimally within your MDP -> the optimal policy follows directly from it
  • There is always at least one deterministic optimal policy (greater or equal value v(s) for each s, compared to all other policies) -> we don’t need combinations of policies for doing well on different parts of the MDP
  • How arrive at q*? Take Bellman equation for q and “work backwards” from terminal state
  • (before we looked at Bellman expectation equations; what now follows are the Bellman optimality equations, or just “Bellman equations” in the literature)
  • Here, we maximize over the actions we can choose, and average over where the process dynamics send us to (2-step lookahead)
  • Bellman equations (in contrast to the version in MRPs) are non-linear (because of max) -> no direct solving through matrix inversion
  • => need to solve iteratively (e.g. by dynamic programming: value or policy iteration)

Lecture 3: Planning by Dynamic Programming

  • Dynamic (it is about a sequence/temporal problem), programming (about optimizing a program/policy)
  • Method: solve complex problems by divide&conquer
  • Works if subproblems come up again and again, and their solution tells us something about the optimal overall solution (MDPs satisfy both properties, see Bellman equation [decomposition] and value function [cache for recurring solutions])
  • Prediction: not the full RL problem, but when we are given the full reward function + dynamics of system + policy -> output is the corresponding value function
  • Control: no policy given -> output is optimal value function
  • We care about control, so we use prediction as an inner loop to solve control
  • Each iteration (synchronous update): update every state (we know the dynamics, it is planning!) in the value function using the Bellman expectation equation and the lookahead (just one step, not recursively!)
  • Good example on slide 9/10: the value function helps us finding better policies (e.g., greedy according to the value function), even if it is created using a different policy (e.g., random)
  • Evaluate the policy (i.e., compute its value function)
  • Act greedily w.r.t. the computed value function
  • => will always converge to optimal policy (after usually many iterations)
  • This works, because acting greedily for one step using the current q is at least as good (or better) than just following the current policy immediately -> see slide 19
  • If it is only equally good, the current policy is already optimal
  • Acting greedily doesn’t mean to greedily look for instantaneous rewards: we only (greedily) take the best current action and then look at the value function, which sums up all expected future rewards
  • Policy evaluation has not to be done until convergence -> a few steps suffice to arrive at an estimate that will improve the policy in the next policy improvement step (if k=1, this is just called “value iteration” or “modified policy iteration”)
  • This uses the Bellman optimality equation
  • Intuition: think you have been told the optimal value of the states next to the goal state, and you are figuring out the other states’ values from there on backwards
  • No explicit policy (intermediate value functions might not be achievable by any real policy, only in the end the policy will be optimal)
  • Summary so far: -> slide 30 (using v instead of q so far is less complex, but only possible because we know the dynamics [it is still planning]; and doing value iteration is a simplification of policy iteration)
  • Asynchronous backup: in each iteration, update just one state (saves computation and works as long as all states are still selected for update [in any order])
  • Prioritised sweeping: in which order to update states? those first that change their value the most (as it has largest influence on result)
  • Real-time DP: update only those states that a real agent using the current policy visits
  • Biggest problem with DP are the full-width backups (consider all possible next actions and states) -> use sampling instead

Lecture 4: Model-Free Prediction

  • Last lecture was estimating/optimizing the value function of a known MDP; now we estimate for an unknown MDP (no dynamics / reward function given) -> from interaction (with environment) to value function
  • Planning is model-based (dynamics given), RL is model-free (no one tells); prediction is evaluating a known policy, control is finding new policy
  • Learn directly from complete episodes (i.e., update every state after the end of an episode)
  • Basic idea: replace the expectation in v_{pi}(s)=E_{pi}[G_t|S_t=s] with the empirical mean
  • Problem: how to deal with getting into a state we already have been in, again (to create several values to average over), and how to visit all states just from trajectories -> by following policy pi
  • Blackjack example: only consider states with an interesting decision to make (i.e., do not learn actions for the sum of cards below 12, as you would always twist then as no risk is attached to it)
  • Slide 11: axes of value function diagrams are two of the three values in the state; the third (usable ace) is displayed by the 2 rows of figures
  • TD learns from incomplete episodes (i.e., online, “bootstrapping”) by replacing the return (used in the MC approach above after the episode run to the end) by the TD target (immediate reward plus discounted current estimate of v_s_{t+1})
  • TD is superior to MC in several respects (e.g., more efficient, it has less variance but is biased); but TD does not always converge to v_{pi} when using function approximation
  • MC converges to minimum MSE between estimated v and return; TD(0) converges to solution of maximum likelihood MDP that best fits the observed episodes (implicitly)
  • TD(0) exploits the Markov property, thus it is more efficient in Markov environments (otherwise MC is more efficient)
  • We can map all of RL on two axes: whether the algorithm does full backups vs. samples (i.e. averages over all possible actions/successor states [e.g., dynamic programming, exhaustive search]), or just uses samples (e.g., TD(0), MC), and whether backups are shallow (i.e., 1-step lookahead [e.g., TD(0)]) or deep (full trajectories [e.g., MC]) -> see Fig. 3 in survey paper by Arulkumaran et al., 2017
  • lambda enables us to target the continuum on the “shallow/deep backups” axis
  • The optimal lookahead depends on the problem, which is dissatisfactory; thus, the lambda-return averages all n-step returns, weighted by look-ahead (more look-ahead, less weight) -> slide 39
  • TD(lambda) comes at the same computational cost as TD(0), thanks to the (memoryless) geometric weighting

Lecture 5: Model-Free Control

  • On-policy (learning on the job) vs. Off-policy (learning while following some else’s idea; looking over someone’s shoulder)
  • Last lecture: evaluate given policy in realistic setting; now: optimize it (find v*)
  • General framework: generalised policy iteration -> slide 6
  • 2 problems with just plugin in Monte Carlo simulation into this general framework: (1) it is not model-free (we need a model of the environment since we only have V, not Q); (2) we don’t explore if we always greedily follow the policy => so it would work with Q instead of V and acting epsilon-greedily instead of just greedily
  • epsilon-greedy is guaranteed to improve (proven)
  • typical RL (here: with SARSA): it is slow in the beginning, but as soon as it learns something, it becomes faster and faster with doing better
  • MC learning off policy doesn’t work -> have to use TD learning
  • What works best off-policy (gets rid of importance sampling): Q-learning (as it is usually referred to) -> slide 36
  • Summary so far: TD methods are samples of the full updates done by DP methods -> slide 41

Lecture 6: Value Function Approximation

  • It is not supervised learning: iid training methods usually don’t work well because of the correlation in the samples of the same trajectory
  • How “close” to optimum TD(0) with linear value function approximation converges depends on things like the discount factor -> slide 18
  • In TD we are always pushing things to “later” because we trust in our estimate of later return
  • In continuous control, you ofton don’t need to account for the differnces between maximum and minimum (say) acceleration -> so it becomes discrete again
  • Bootstrapping (using lambda>0 in TD(lambda)) usually helps, need to find a sweet spot (lambda=1 usually is very bad)
  • TD is not stable per se (isn’t guaranteed to converge) -> slide 30 shows when it is safe to use (for prediction), even when it practice it often works well
  • For control, we basically have no guarantee that we will make progress (best case that it oscillates around the true q*)
  • Experience replay is an easy way to converge to the least squares solution over the complete data set of experience (that we didn’t have in the online case considered above)
  • DQN is off-policy TD learning with non-linear function approximation -> it is anyhow stable because of experience replay and fixed (instead of non-stable, because of coming from a changing Q network) Q updates (by means of a fixed, saved few-thousand steps [hyperparameter!] older version of our Q network to which we bootstrap) that together hinder the convergence to diverge (“spiral out of control”)

Lecture 7: Policy Gradient Methods

  • Simplest method: policy gradient methods change the policy in the direction that makes it better
  • Policy-based methods tend to be more stable (better convergence properties) and are especially good in continious or high-dimensional action spaces (because of the max() over actions in value-based methods like Q-learning or SARSA)
  • Policy-based methods can learn a stochastic policy, that can find the goal much quicker if there is doubt (aliasing) about the state of the world (i.e., partial observability) -> slide 9
  • The score function is a very familiar term from ML (maximum likelihood) and tells you in which direction to go to get “more” of something -> slide 16
  • The whole point of the likelihood ratio trick is to get an expectation again for the gradient -> slide 19
  • REINFORCE is the most straightforward approach to policy gradient
  • MC policy gradient methods have nice learning curves but are very slow (very high variance because we plug in samples of the return [that vary a lot]) -> slide 22
  • Actor Critic methods: bring in a critic (estimate of value function) again to retain nice stability properties of policy gradient methods while reducing variance
  • Critic is built using methods from previous lectures for policy evaluation; then, the estimated Q is plugged into the gradient-of-objective-function equation
  • Q-AC is just an instance of generalised policy iteration, just with the gradient step instead of the epsilon-greedy improvement
  • Summary -> slide 41

Lecture 8: Integrating Learning and Planning

  • A model (in RL) is the agents understanding of the environment (1. state transitions; 2. how reward is given); that’s why building a model first is a 3rd way (besides value- and policy-based methods) to train an agent
  • Advantage: can be efficiently trained by supervised learning (helps in evironments with complicated policies [sharp/tactically decisive decisions like in chess where one move can decide winning or loosing] like games that need lookahead) -> it is a more compact/useful representation of the environment
  • Sample-based planning: most simple yet powerful approach, uses learnt model only to sample simulated experience from it
  • it helps because it breaks the curse of dimensionality (or rather branching factor for successive events): we sacrifice the detailed probabilities given by the learnt model and thus focus on the more likely stuff -> slide 18
  • Slide 19: reasoning for our approach taken in the “Complexity 4.0” project & chapter (use a simulation model to learn an ML model)
  • How to trade off learning the model vs. learning the “real thing” (value function/policy)? you act everytime you have to (gives real experience, used to build best model possible), then plan (simulate trajectories form model to improve q/pi) as long as you have time to think before you have to act again
  • Dyna architecture does what was just proposed and is much more data efficient (w.r.t. real experience, as more data can be generated) already with 5 (and much more with 50) sampling (“thinking”) steps between 2 real observations -> slide 28
  • Forward search: Key idea is to not explore the entire state space, but focus on what matters from the current state onwards (i.e., we only solve the sub-MDP starting from “now”)
  • Simulation-based search: forward search using sample-based planning based on a model (i.e., not build/consider whole tree from now on, but sample trajectories, then apply model-free RL to them) -> slide 33
  • Monte-Carlo tree search: search tree is built from scratch starting from current state and contains all states we visit in the course of action together with the actions we took, together with MC evaluations (q-values)
  • MCTS process: repeat {evaluation (see above); improvement of tree (simulation) policy by methods from last lectures, e.g. epsilon-greedy} => this is just MC control (from previous lectures) applied to simulated experience -> slide 37
  • MCTS converges to q*
  • MCTS advantages: breaks “curse of dimensionality” by sampling; focuses on the “now” and the most likely successful actions through sampling; nice computational properties (parallelization, scaling, efficient)
  • TD search has the advantage to potentially reduce the variance and being more efficient (than MC; more so if choosing lambda well), thanks to bootstrapping
  • Recap on TD/MC: instead of waiting until the end of each simulated episode and taking the final reward to build up statistics of the value of our “now” state by taking the average (MC), we bootstrap a new estimate of the value of each intermediate state by means of current reward plus discounted expected reward according to current q estimate (TD)
  • TD is especially effective in environments where states can be reached from many different paths (so that you already might to know something about the next state and have it encoded in your current q estimate) => so only difference between MC and TD search is in how we update our q values -> slide 51
  • Slide 53: black is MCTS, blue is Dyna-2 (long-term memory from real experience, short-term memory from simulated experience)
  • Final word: tree helps to focus “imagination” (planning) on the relevant part of the state/action space, and thus learning from simulation is highly effective

Lecture 9: Exploration and Exploitation

  • Decaying epsilon-greedy is best exploration strategy, but depends on schedule (which depends on unknown optimal value function)
  • Optimism in the face of uncertainty (uncertainty = fat tails of a distribution): if you have 3 distributions for e.g. reward, pick not from the one with the highest mean, but with the fattest tails towards the maximum; if those extend beyond the highest mean, this distribution has the highest potential to have an even higher mean when seeing more examples -> slide 15
  • Upper Confidence Bound (UCB): select the action that maximizes the upper confidence bound on its q value (higher the more uncertainty we have, U_t(a) shrinks while we visit this action more often => it characterizes the “tail” from above) -> slide 17
  • The UCB term helps us exploring without knowing more about the true q values except that they are bounded
  • UCB vs. epsilon-greedy: UCB performs really well, epsilon-greedy can to this to but can be a disaster for wrong epsilon -> slide 21
  • If you have prior knowledge about the bandit problem, you can use Thompson sampling, which can be shown to be asymptotically optimal (but is still, as UCB, a heuristic) -> slide 25
  • UCB (and other “optimism in the face of uncertainty” methods like Thompson) will explore forever (accumulating lots of unnecessary regret) in case of huge/infinite action spaces, and don’t allow save exploration
  • UCB is not quite optimal for full MDPs as we have uncertainty about our current q-values in two ways (1. because we haven’t seen enough examples yet in evaluation; 2. because we haven’t improved enough yet), and it is hard to account for both with U_t(a) -> slide 42 (not same as in video)

Lecture 10: Classic Games

(For a corrected video with visible slides, see here .)

  • We have done our job if we found a RL policy that is a Nash equilibrium (in the context of RL: a joint policy for all players such that every player’s policy is a best response) -> it is the best overall policy, but not necessarily the best against a very specific opponent’s policy
  • In self-play, the best response is solution to single-agent RL (where all other players are treated as part of the environment)
  • If we can solve playing a game (by adapting to the environment dynamically created by the other players trhough self-play) and converge to a fixed point (i.e., all other players declare they found an optimal policy in return), we have found a nash equilibrium -> slide 7
  • Two-player zero-sum game (perfect information): equal and opposite rewards for each player; a minimax policy (that ahieves max value for white and minimum for black) is a Nash equilibrium
  • Search is very important in successin games, intuitively so because it helps formuing tactics for the concrete current situation the player is in
  • Self-play RL, we always play (and improve the policy) for both players (minimax), and all the previous machinery applies (MC, TD variants) -> slide 20
  • Logitello: tree search to come up with good moves in self-play was crucial (then used generalised policy iteration with MC evaluation of self-play games)
  • TD Gammon: binary state vector had separate feature for each possible number of stones of each color in each position (i.e., one-hot encoded -> then neural network as value function approximator and TD(lambda) with greedy policy improvement without exploration [worked without exploration because of the stochasticity introduced by the dice that helps in anyhow seeing a lot of the state space])
  • TD Root (A. Samuel’s Checkers): backup value of s_t not from v(s_{t+1}), but from the result of a tree search on s_{t+1} (first ever TD algorithm)
  • TD Leaf: update also the decisive leaf node and the rest of its branch (not just the root) in the tree of the search on s_t with the “winning” node’s value in the minimax search of s_{t+1}
  • TreeStrap: different from TD Leaf, this is also effective in self play and from random weights (works by updating/learning to predict any value in the search tree; this doesn’t mix the backup from search with the backup from randomly searching as previous ideas did and which is not effective)
  • Naively applying MCTS/UCT etc. (that are so effective in fully observable games like Go) to games of imperfect information usually “blows up”/diverges
  • Need a search tree per player built by smooth-UCT search (that remembers the average policy of the opponent by counting avery action they ever played during self-play)
  • v: often binary linear, in future more NN
  • RL: TD(lambda) with self-play and search (crucial, for tactics)

Automatic Essay Scoring Incorporating Rating Schema via Reinforcement Learning

Yucheng Wang , Zhongyu Wei , Yaqian Zhou , Xuanjing Huang

Export citation

  • Preformatted

Markdown (Informal)

[Automatic Essay Scoring Incorporating Rating Schema via Reinforcement Learning](https://aclanthology.org/D18-1090) (Wang et al., EMNLP 2018)

  • Automatic Essay Scoring Incorporating Rating Schema via Reinforcement Learning (Wang et al., EMNLP 2018)
  • Yucheng Wang, Zhongyu Wei, Yaqian Zhou, and Xuanjing Huang. 2018. Automatic Essay Scoring Incorporating Rating Schema via Reinforcement Learning . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages 791–797, Brussels, Belgium. Association for Computational Linguistics.

IMAGES

  1. Basics of Reinforcement Learning (Algorithms, Applications & Advantages

    essay on reinforcement learning

  2. An Introduction To Reinforcement Learning Reinforcement Learning

    essay on reinforcement learning

  3. The reinforcement learning framework

    essay on reinforcement learning

  4. Reinforcement Learning: A Beginner's Guide

    essay on reinforcement learning

  5. Positive reinforcement within the classroom Essay Example

    essay on reinforcement learning

  6. Reinforcement Learning: Summary and Review

    essay on reinforcement learning

VIDEO

  1. #72

  2. Reinforcement Learning

  3. Reinforcement Learning Explained

  4. Reinforcement Learning Initial Terminologies

  5. Reinforcement Learning Final Paper Presentation

  6. Reinforcement Learning Week 6 Quiz Assignment Solution

COMMENTS

  1. Reinforcement Role in Learning Essay (Critical Writing)

    Support for this argument comes from several studies done on learning and the role of reinforcement. One of the examples of such studies is an experiment done with rats, where results showed that the animals failed to form conditioned responses (Iordanova, Good, & Honey, 2008).

  2. Reinforcement learning for robot research: A comprehensive review and

    Reinforcement learning (RL), 1 one of the most popular research fields in the context of machine learning, effectively addresses various problems and challenges of artificial intelligence. It has led to a wide range of impressive progress in various domains, such as industrial manufacturing, 2 board games, 3 robot control, 4 and autonomous driving. 5 Robot has become one of the research hot ...

  3. The History of Reinforcement Learning

    The term reinforcement was formally used in the context of animal learning in 1927 by Pavlov, who described reinforcement as the strengthening of a pattern of behaviour due to an animal receiving a stimulus - a reinforcer - in a time-dependent relationship with another stimulus or with a response. Thorndike's Cat Box.

  4. A Concise Introduction to Reinforcement Learning

    Reinforcement learning (RL) is a comprehensive mathematical framework in which agents engage in direct interaction with their environment . Through engaging in various actions and encountering a ...

  5. Reinforcement Learning in Education: A Literature Review

    The utilization of reinforcement learning (RL) within the field of education holds the potential to bring about a significant shift in the way students approach and engage with learning and how teachers evaluate student progress. The use of RL in education allows for personalized and adaptive learning, where the difficulty level can be adjusted based on a student's performance. As a result ...

  6. An introduction to reinforcement learning for neuroscience

    An introduction to reinforcement learning for neuroscience Kristopher T. Jensen1,2 1 Sainsbury Wellcome Centre, University College London 2 Computational and Biological Learning Lab, University of Cambridge [email protected] Abstract Reinforcement learning has a rich history in neuroscience, from early work on dopamine as a reward

  7. Reinforcement learning: A brief guide for philosophers of mind

    Reinforcement learning methods and, by extension, neuroeconomic paradigms thus provide a significant avenue for philosophers interested in moving beyond introspection and self-report to more systematic and quantifiable methods for studying certain capacities in the brain (significant issues concerning our cognitive taxonomies notwithstanding). ...

  8. Reinforcement Learning for Generative AI: A Survey

    Reinforcement learning has demonstrated its power and flexibility to inject new training signals such as human inductive bias to build a performant model. Thereby, reinforcement learning has become a trending research field and has stretched the limits of generative AI in both model design and application. It is reasonable to summarize

  9. Easy Introduction to Reinforcement Learning

    The elements of reinforcement learning. Reinforcement learning involves the following key elements: Environment is the context in which a computer program operates. This can be virtual, like a video game, or physical, like a house. Agent refers to the learner or decision-maker (i.e., the computer program) within the environment. The agent ...

  10. Colloquium Paper: Understanding dopamine and reinforcement learning

    The story of reinforcement learning described up to this point is a story largely from psychology and mostly focused on associative learning. That story changed abruptly in the 1990s when computer scientists Sutton and Barto ( 26 ) began to think seriously about these preexisting theories and noticed two key problems with them:

  11. The Advance of Reinforcement Learning and Deep Reinforcement Learning

    Then, this paper discusses the advanced reinforcement learning work at present, including distributed deep reinforcement learning algorithms, deep reinforcement learning methods based on fuzzy theory, Large-Scale Study of Curiosity-Driven Learning, and so on. Finally, this essay discusses the challenges faced by reinforcement learning.

  12. Reinforcement Learning 101. Learn the essentials of Reinforcement…

    2. How to formulate a basic Reinforcement Learning problem? Some key terms that describe the basic elements of an RL problem are: Environment — Physical world in which the agent operates State — Current situation of the agent Reward — Feedback from the environment Policy — Method to map agent's state to actions Value — Future reward that an agent would receive by taking an action ...

  13. Dopamine regulates decision thresholds in human reinforcement learning

    Dopamine fundamentally contributes to reinforcement learning, but recent accounts also suggest a contribution to specific action selection mechanisms and the regulation of response vigour. Here ...

  14. New insights into the role of dopamine in reinforcement learning

    A striatal spiny projection neuron (SPN) during dopamine-dependent reinforcement learning. The neurotransmitter dopamine is vital to reinforcement learning. The phasic activity of dopamine neurons, which in part encodes reward prediction error, is thought to reinforce the behaviour of animals to maximise chances of receiving reward in the future.

  15. Reinforcement learning model, algorithms and its application

    Then, we roundly present the main reinforcement learning algorithms, including Sarsa, temporal difference, Q-learning and function approximation. Finally, we briefly introduce some applications of reinforcement learning and point out some future research directions of reinforcement learning.

  16. The Best Reinforcement Learning Papers

    The Best Reinforcement Learning Papers. 1. Never Give Up: Learning Directed Exploration Strategies. We propose a reinforcement learning agent to solve hard exploration games by learning a range of directed exploratory policies. (TL;DR, from OpenReview.net) Paper.

  17. Robotics

    In robotics, the ultimate goal of reinforcement learning is to endow robots with the ability to learn, improve, adapt and reproduce tasks with dynamically changing constraints based on exploration and autonomous learning. We give a summary of the state-of-the-art of reinforcement learning in the context of robotics, in terms of both algorithms and policy representations. Numerous challenges ...

  18. Automatic Essay Scoring Incorporating Rating Schema via Reinforcement

    Automatic essay scoring: Almost all the auto-scoring models are learning-based and treat the task of scoring as a supervised learning task (Ke and Ng, 2019) with a few using reinforcement learning ...

  19. The impact of using reinforcement learning to personalize ...

    Description of the reinforcement learning algorithm learning process In total, 5143 text messages were sent to patients in the intervention arm ( n = 29) during the 6-month study period.

  20. Lecture notes on Reinforcement Learning

    Lecture 8: Integrating Learning and Planning. Introduction. A model (in RL) is the agents understanding of the environment (1. state transitions; 2. how reward is given); that's why building a model first is a 3rd way (besides value- and policy-based methods) to train an agent.

  21. PDF Automatic Essay Scoring Incorporating Rating Schema via Reinforcement

    Automatic essay scoring (AES) is the task of assigning grades to essays without human in-terference. Existing systems for AES are typ-ically trained to predict the score of each sin-gle essay at a time without considering the rat-ing schema. In order to address this issue, we propose a reinforcement learning framework for essay scoring that ...

  22. Automatic Essay Scoring Incorporating Rating Schema via Reinforcement

    Automatic essay scoring (AES) is the task of assigning grades to essays without human interference. Existing systems for AES are typically trained to predict the score of each single essay at a time without considering the rating schema. In order to address this issue, we propose a reinforcement learning framework for essay scoring that ...

  23. What are some real-life applications of reinforcement learning?

    Reinforcement learning (RL) is a learning mode in which a computer interacts with an environment, receives feedback and, ... Using AI writing tools (like ChatGPT) to write your essay is usually considered plagiarism and may result in penalization, unless it is allowed by your university. Text generated by AI tools is based on existing texts and ...