Decaying Evidence and Contextual Bandits — Bayesian Reinforcement Learning (Part 2)

2019-06-02

This is the second of a two-part series about Bayesian bandit algorithms. Check out the first post here.

Previously, I introduced the multi-armed bandit problem, and a Bayesian approach to solving/modelling it (Thompson sampling). We saw that conjugate models made it possible to run the bandit algorithm online: the same is even true for non-conjugate models, so long as the rewards are bounded.

In this follow-up blog post, we’ll take a look at two extensions to the multi-armed bandit. The first allows the bandit to model nonstationary rewards distributions, whereas the second allows the bandit to model context. Jump in!

Cartoon of a multi-armed bandit — An example of a multi-armed bandit situation. Source: Inverse.

Nonstationary Bandits

Up until now, we’ve concerned ourselves with stationary bandits: in other words, we assumed that the rewards distribution for each arm did not change over time. In the real world though, rewards distributions need not be stationary: customer preferences change, trading algorithms deteriorate, and news articles rise and fall in relevance.

Nonstationarity could mean one of two things for us:

either we are lucky enough to know that rewards are similarly distributed throughout all time (e.g. the rewards are always normally distributed, or always binomially distributed), and that it is merely the parameters of these distributions that are liable to change,
or we aren’t so unlucky, and the rewards distributions are not only changing, but don’t even have a nice parametric form.

Good news, though: there is a neat trick to deal with both forms of nonstationarity!

Decaying evidence and posteriors

But first, some notation. Suppose we have a model with parameters $\theta$. We place a prior $\color{purple}{\pi_0(\theta)}$ on it¹, and at the $t$‘th time step, we observe data $D_t$, compute the likelihood $\color{blue}{P(D_t | \theta)}$ and update the posterior from $\color{red}{\pi_t(\theta | D_{1:t})}$ to $\color{green}{\pi_{t+1}(\theta | D_{1:t+1})}$.

This is a quintessential application of Bayes’ Theorem. Mathematically:

$$ \color{green}{\pi_{t+1}(\theta | D_{1:t+1})} \propto \color{blue}{P(D_{t+1} | \theta)} \cdot \color{red}{\pi_t (\theta | D_{1:t})} \tag{1} \label{1} $$

However, for problems with nonstationary rewards distributions, we would like data points observed a long time ago to have less weight than data points observed recently. This is only prudent: in the absence of recent data, we would like to adopt a more conservative “no-data” prior, rather than allow our posterior to be informed by outdated data. This can be achieved by modifying the Bayesian update to:

$$ \color{green}{\pi_{t+1}(\theta | D_{1:t+1})} \propto \color{magenta}{[} \color{blue}{P(D_{t+1} | \theta)} \cdot \color{red}{\pi_t (\theta | D_{1:t})} {\color{magenta}{]^{1-\epsilon}}} \cdot \color{purple}{\pi_0(\theta)}^\color{magenta}{\epsilon} \tag{2} \label{2} $$

for some $0 < \color{magenta}{\epsilon} \ll 1$. We can think of $\color{magenta}{\epsilon}$ as controlling the rate of decay of the evidence/posterior (i.e. how quickly we should distrust past data points). Notice that if we stop observing data points at time $T$, then $\color{red}{\pi_t(\theta | D_{1:T})} \rightarrow \color{purple}{\pi_0(\theta)}$ as $t \rightarrow \infty$.

Decaying the evidence (and therefore the posterior) can be used to address both types of nonstationarity identified above. Simply use $(\ref{2})$ as a drop-in replacement for $(\ref{1})$ when updating the hyperparameters. Whether you’re using a conjugate model or the algorithm by Agarwal and Goyal (introduced in the previous blog post), using $(\ref{2})$ will decay the evidence and posterior, as desired.

For more information (and a worked example for the Beta-Binomial model!), check out Austin Rochford’s talk for Boston Bayesians about Bayesian bandit algorithms for e-commerce.

Contextual Bandits

We can think of the multi-armed bandit problem as follows²:

A policy chooses an arm $a$ from $k$ arms.
The world reveals the reward $R_a$ of the chosen arm.

However, this formulation fails to capture an important phenomenon: there is almost always extra information that is available when making each decision. For instance, online ads occur in the context of the web page in which they appear, and online store recommendations are given in the context of the user’s current cart contents (among other things).

To take advantage of this information, we might think of a different formulation where, on each round:

The world announces some context information $x$.
A policy chooses an arm $a$ from $k$ arms.
The world reveals the reward $R_a$ of the chosen arm.

In other words, contextual bandits call for some way of taking context as input and producing arms/actions as output.

Alternatively, if you think of regular multi-armed bandits as taking no input whatsoever (but still producing outputs, the arms to pull), you can think of contextual bandits as algorithms that both take inputs and produce outputs.

Bayesian contextual bandits

Contextual bandits give us a very general framework for thinking about sequential decision making (and reinforcement learning). Clearly, there are many ways to make a bandit algorithm take context into account. Linear regression is a straightforward and classic example: simply assume that the rewards depend linearly on the context.

For a refresher on the details of Bayesian linear regression, refer to Pattern Recognition and Machine Learning by Christopher Bishop: specifically, section 3.3 on Bayesian linear regression and exercises 3.12 and 3.13³. Briefly though, if we place a Gaussian prior on the regression weights and an inverse gamma prior on the noise parameter (i.e., the noise of the observations), then their joint prior will be conjugate to a Gaussian likelihood, and the posterior predictive distribution for the rewards will be a Student’s $t$.

Since we need to maintain posteriors of the rewards for each arm (so that we can do Thompson sampling), we need to run a separate Bayesian linear regression for each arm. At every iteration we then Thompson sample from each Student’s $t$ posterior, and select the arm with the highest sample.

However, Bayesian linear regression is a textbook example of a model that lacks expressiveness: in most circumstances, we want something that can model nonlinear functions as well. One (perfectly valid) way of doing this would be to hand-engineer some nonlinear features and/or basis functions before feeding them into a Bayesian linear regression. However, in the 21st century, the trendier thing to do is to have a neural network learn those features for you. This is exactly what is proposed in a ICLR 2018 paper from Google Brain. They find that this model — which they call NeuralLinear — performs decently well across a variety of tasks, even compared to other bandit algorithms. In the words of the authors:

We believe [NeuralLinear’s] main strength is that it is able to simultaneously learn a data representation that greatly simplifies the task at hand, and to accurately quantify the uncertainty over linear models that explain the observed rewards in terms of the proposed representation.

For more information, be sure to check out the Google Brain paper and the accompanying TensorFlow code.

⁂ George Ho

Decaying Evidence and Contextual Bandits — Bayesian Reinforcement Learning (Part 2)

Nonstationary Bandits

Decaying evidence and posteriors

Contextual Bandits

Bayesian contextual bandits

Further Reading