This is a continuation of the first blog post on Bayesian bandits.
We introduce two extensions of the vanilla Bayesian bandit with Thompson sampling: one to deal with nonstationary rewards, and one to deal with extra contextual information.
Previously we assumed that our rewards distribution was stationary: that is, we might not know the average payout of each bandit, or how consistently it delivers these payouts, but at least we know that it doesn’t change in time. In general however, this is not the case. For instance, if we were advertisers or marketers, we know that users’ preferences don’t stay the same: they change in time, possibly even with some seasonality.
where $\theta$ are our parameters,
With a decayed posterior, we have:
where typically $0 < \epsilon « 1$
If we use a conjugate prior, then the decayed posterior is also conjugate.
Bayesian Contextual Bandits