Jekyll2019-07-23T00:55:54+00:00https://eigenfoo.xyz/feed.xmlEigenfooGeorge HoGraduated Cooper Union, Joining Point722019-07-22T00:00:00+00:002019-07-22T00:00:00+00:00https://eigenfoo.xyz/joining-point72<p>Some exciting personal news: I’ve <em>(finally)</em> graduated from <a href="http://cooper.edu/welcome">The Cooper
Union</a>, and I’m joining <a href="https://www.point72.com/">Point72 Asset
Management</a> as a data scientist/research analyst!</p>
<p>Point72 is an American hedge fund, headquartered in Connecticut. I’ll be based
in New York, working out of their <a href="https://www.hudsonyardsnewyork.com/work/55-hudson-yards">Hudson
Yards</a> offices.</p>
<p><img src="https://www.point72.com/wp-content/uploads/2017/03/point72-recropped.png" alt="Point72 logo" /></p>
<p>In this next chapter of my life, my professional focuses are:</p>
<ol>
<li><strong>Keep learning.</strong> Bayesian methods and deep learning, mostly.</li>
<li><strong>Open source.</strong> I’ve been involved with developing
<a href="https://github.com/pymc-devs/pymc4">PyMC4</a>. These are exciting times for the
PyMC project: I hope to keep contributing!</li>
</ol>
<p>My four years of college were incredibly rewarding, but I’m excited to enter the
real world. Stay tuned!</p>George HoSome exciting personal news: I've graduated from The Cooper Union and I'm joining Point72 Asset Management as a data scientist/research analyst!Python Port of _Common Statistical Tests are Linear Models_2019-06-28T00:00:00+00:002019-06-28T00:00:00+00:00https://eigenfoo.xyz/stat-tests-are-linear-models<p>I ported <a href="https://lindeloev.net">Jonas Lindeløv</a>’s post, <a href="https://lindeloev.github.io/tests-as-linear/"><em>Common Statistical
Tests are Linear Models</em></a> from R
to Python. Check it out on <a href="https://eigenfoo.xyz/tests-as-linear/">my blog</a>,
<a href="https://github.com/eigenfoo/tests-as-linear">GitHub</a>, or
<a href="https://gke.mybinder.org/v2/gh/eigenfoo/tests-as-linear/master?filepath=tests-as-linear.ipynb">Binder</a>!</p>George HoI ported Jonas Lindeløv’s post, Common Statistical Tests are Linear Models from R to Python. Check it out on my blog, GitHub, or Binder!Decaying Evidence and Contextual Bandits — Bayesian Reinforcement Learning (Part 2)2019-06-02T00:00:00+00:002019-06-02T00:00:00+00:00https://eigenfoo.xyz/bayesian-bandits-2<blockquote>
<p>This is the second of a two-part series about Bayesian bandit algorithms.
Check out the first post <a href="https://eigenfoo.xyz/bayesian-bandits/">here</a>.</p>
</blockquote>
<p><a href="https://eigenfoo.xyz/bayesian-bandits/">Previously</a>, I introduced the
multi-armed bandit problem, and a Bayesian approach to solving/modelling it
(Thompson sampling). We saw that conjugate models made it possible to run the
bandit algorithm online: the same is even true for non-conjugate models, so long
as the rewards are bounded.</p>
<p>In this follow-up blog post, we’ll take a look at two extensions to the
multi-armed bandit. The first allows the bandit to model nonstationary rewards
distributions, whereas the second allows the bandit to model context. Jump in!</p>
<figure>
<a href="https://fsmedia.imgix.net/29/fd/a4/56/8363/4fb0/8c62/20e80649451b/the-multi-armed-bandit-determines-what-you-see-on-the-internet.jpeg?rect=0%2C34%2C865%2C432&auto=format%2Ccompress&dpr=2&w=650"><img src="https://fsmedia.imgix.net/29/fd/a4/56/8363/4fb0/8c62/20e80649451b/the-multi-armed-bandit-determines-what-you-see-on-the-internet.jpeg?rect=0%2C34%2C865%2C432&auto=format%2Ccompress&dpr=2&w=650" alt="Cartoon of a multi-armed bandit" /></a>
<figcaption>An example of a multi-armed bandit situation. Source: <a href="https://www.inverse.com/article/13762-how-the-multi-armed-bandit-determines-what-ads-and-stories-you-see-online">Inverse</a>.</figcaption>
</figure>
<h2 id="nonstationary-bandits">Nonstationary Bandits</h2>
<p>Up until now, we’ve concerned ourselves with stationary bandits: in other words,
we assumed that the rewards distribution for each arm did not change over time.
In the real world though, rewards distributions need not be stationary: customer
preferences change, trading algorithms deteriorate, and news articles rise and
fall in relevance.</p>
<p>Nonstationarity could mean one of two things for us:</p>
<ol>
<li>either we are lucky enough to know that rewards are similarly distributed
throughout all time (e.g. the rewards are always normally distributed, or
always binomially distributed), and that it is merely the parameters of these
distributions that are liable to change,</li>
<li>or we aren’t so unlucky, and the rewards distributions are not only changing,
but don’t even have a nice parametric form.</li>
</ol>
<p>Good news, though: there is a neat trick to deal with both forms of
nonstationarity!</p>
<h3 id="decaying-evidence-and-posteriors">Decaying evidence and posteriors</h3>
<p>But first, some notation. Suppose we have a model with parameters <script type="math/tex">\theta</script>. We
place a prior <script type="math/tex">\color{purple}{\pi_0(\theta)}</script> on it<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>, and at the <script type="math/tex">t</script>‘th
time step, we observe data <script type="math/tex">D_t</script>, compute the likelihood <script type="math/tex">\color{blue}{P(D_t
| \theta)}</script> and update the posterior from <script type="math/tex">\color{red}{\pi_t(\theta |
D_{1:t})}</script> to <script type="math/tex">\color{green}{\pi_{t+1}(\theta | D_{1:t+1})}</script>.</p>
<p>This is a quintessential application of Bayes’ Theorem. Mathematically:</p>
<script type="math/tex; mode=display">\color{green}{\pi_{t+1}(\theta | D_{1:t+1})} \propto \color{blue}{P(D_{t+1} |
\theta)} \cdot \color{red}{\pi_t (\theta | D_{1:t})} \tag{1} \label{1}</script>
<p>However, for problems with nonstationary rewards distributions, we would like
data points observed a long time ago to have less weight than data points
observed recently. This is only prudent: in the absence of recent data, we would
like to adopt a more conservative “no-data” prior, rather than allow our
posterior to be informed by outdated data. This can be achieved by modifying the
Bayesian update to:</p>
<script type="math/tex; mode=display">\color{green}{\pi_{t+1}(\theta | D_{1:t+1})} \propto \color{magenta}{[}
\color{blue}{P(D_{t+1} | \theta)} \cdot \color{red}{\pi_t (\theta | D_{1:t})}
{\color{magenta}{]^{1-\epsilon}}} \cdot
\color{purple}{\pi_0(\theta)}^\color{magenta}{\epsilon} \tag{2} \label{2}</script>
<p>for some <script type="math/tex">% <![CDATA[
0 < \color{magenta}{\epsilon} \ll 1 %]]></script>. We can think of
<script type="math/tex">\color{magenta}{\epsilon}</script> as controlling the rate of decay of the
evidence/posterior (i.e. how quickly we should distrust past data points).
Notice that if we stop observing data points at time <script type="math/tex">T</script>, then <script type="math/tex">\color{red}{\pi_t(\theta | D_{1:T})} \rightarrow \color{purple}{\pi_0(\theta)}</script> as <script type="math/tex">t \rightarrow \infty</script>.</p>
<p>Decaying the evidence (and therefore the posterior) can be used to address both
types of nonstationarity identified above. Simply use <script type="math/tex">(\ref{2})</script> as a drop-in
replacement for <script type="math/tex">(\ref{1})</script> when updating the hyperparameters. Whether you’re
using a conjugate model or the algorithm by <a href="https://arxiv.org/abs/1111.1797">Agarwal and
Goyal</a> (introduced in <a href="https://eigenfoo.xyz/bayesian-bandits">the previous blog
post</a>), using <script type="math/tex">(\ref{2})</script> will decay
the evidence and posterior, as desired.</p>
<p>For more information (and a worked example for the Beta-Binomial model!), check
out <a href="https://austinrochford.com/resources/talks/boston-bayesians-2017-bayes-bandits.slides.html#/3">Austin Rochford’s talk for Boston
Bayesians</a>
about Bayesian bandit algorithms for e-commerce.</p>
<h2 id="contextual-bandits">Contextual Bandits</h2>
<p>We can think of the multi-armed bandit problem as follows<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>:</p>
<ol>
<li>A policy chooses an arm <script type="math/tex">a</script> from <script type="math/tex">k</script> arms.</li>
<li>The world reveals the reward <script type="math/tex">R_a</script> of the chosen arm.</li>
</ol>
<p>However, this formulation fails to capture an important phenomenon: there is
almost always extra information that is available when making each decision.
For instance, online ads occur in the context of the web page in which they
appear, and online store recommendations are given in the context of the user’s
current cart contents (among other things).</p>
<p>To take advantage of this information, we might think of a different formulation
where, on each round:</p>
<ol>
<li>The world announces some context information <script type="math/tex">x</script>.</li>
<li>A policy chooses an arm <script type="math/tex">a</script> from <script type="math/tex">k</script> arms.</li>
<li>The world reveals the reward <script type="math/tex">R_a</script> of the chosen arm.</li>
</ol>
<p>In other words, contextual bandits call for some way of taking context as input
and producing arms/actions as output.</p>
<p>Alternatively, if you think of regular multi-armed bandits as taking no input
whatsoever (but still producing outputs, the arms to pull), you can think of
contextual bandits as algorithms that both take inputs and produce outputs.</p>
<h3 id="bayesian-contextual-bandits">Bayesian contextual bandits</h3>
<p>Contextual bandits give us a very general framework for thinking about
sequential decision making (and reinforcement learning). Clearly, there are many
ways to make a bandit algorithm take context into account. Linear regression is
a straightforward and classic example: simply assume that the rewards depend
linearly on the context.</p>
<p>For a refresher on the details of Bayesian linear regression, refer to <a href="https://www.microsoft.com/en-us/research/people/cmbishop/#!prml-book"><em>Pattern
Recognition and Machine
Learning</em></a>
by Christopher Bishop: specifically, section 3.3 on Bayesian linear regression
and exercises 3.12 and 3.13<sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup>. Briefly though, if we place a Gaussian prior on
the regression weights and an inverse gamma prior on the noise parameter (i.e.,
the noise of the observations), then their joint prior will be conjugate to a
Gaussian likelihood, and the posterior predictive distribution for the rewards
will be a Student’s <script type="math/tex">t</script>.</p>
<p>Since we need to maintain posteriors of the rewards for each arm (so that we can
do Thompson sampling), we need to run a separate Bayesian linear regression for
each arm. At every iteration we then Thompson sample from each Student’s <script type="math/tex">t</script>
posterior, and select the arm with the highest sample.</p>
<p>However, Bayesian linear regression is a textbook example of a model that lacks
expressiveness: in most circumstances, we want something that can model
nonlinear functions as well. One (perfectly valid) way of doing this would be to
hand-engineer some nonlinear features and/or basis functions before feeding them
into a Bayesian linear regression. However, in the 21st century, the trendier
thing to do is to have a neural network learn those features for you. This is
exactly what is proposed in a <a href="https://arxiv.org/abs/1802.09127">ICLR 2018 paper from Google
Brain</a>. They find that this model — which they
call <code class="highlighter-rouge">NeuralLinear</code> — performs decently well across a variety of tasks, even
compared to other bandit algorithms. In the words of the authors:</p>
<blockquote>
<p>We believe [<code class="highlighter-rouge">NeuralLinear</code>’s] main strength is that it is able to
<em>simultaneously</em> learn a data representation that greatly simplifies the task
at hand, and to accurately quantify the uncertainty over linear models that
explain the observed rewards in terms of the proposed representation.</p>
</blockquote>
<p>For more information, be sure to check out the <a href="https://arxiv.org/abs/1802.09127">Google Brain
paper</a> and the accompanying <a href="https://github.com/tensorflow/models/tree/master/research/deep_contextual_bandits">TensorFlow
code</a>.</p>
<h2 id="further-reading">Further Reading</h2>
<p>For non-Bayesian approaches to contextual bandits, <a href="https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Contextual-Bandit-algorithms">Vowpal
Wabbit</a>
is a great resource: <a href="http://hunch.net/~jl/">John Langford</a> and the team at
<a href="https://www.microsoft.com/research/">Microsoft Research</a> has <a href="https://arxiv.org/abs/1402.0555v2">extensively
researched</a> contextual bandit algorithms.
They’ve provided blazingly fast implementations of recent algorithms and written
good documentation for them.</p>
<p>For the theory and math behind bandit algorithms, <a href="https://banditalgs.com/">Tor Lattimore and Csaba
Szepesvári’s book</a> covers a breathtaking amount of
ground.</p>
<blockquote>
<p>This is the second of a two-part series about Bayesian bandit algorithms.
Check out the first post <a href="https://eigenfoo.xyz/bayesian-bandits/">here</a>.</p>
</blockquote>
<hr />
<div class="footnotes">
<ol>
<li id="fn:1">
<p>Did you know you can make <a href="http://adereth.github.io/blog/2013/11/29/colorful-equations/">colored equations with MathJax</a>? Technology frightens me sometimes. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>This explanation is largely drawn from <a href="http://hunch.net/?p=298">from John Langford’s <code class="highlighter-rouge">hunch.net</code></a>. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>If you don’t want to do Bishop’s exercises, there’s a partially complete solutions manual <a href="https://github.com/GoldenCheese/PRML-Solution-Manual/">on GitHub</a> :wink: <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>George HoIn this blog post, we'll take a look at two extensions to the multi-armed bandit. The first allows the bandit to model nonstationary rewards distributions, whereas the second allows the bandit to model context.Autoregressive Models in Deep Learning — A Brief Survey2019-03-09T00:00:00+00:002019-03-09T00:00:00+00:00https://eigenfoo.xyz/deep-autoregressive-models<p>My current project involves working with deep autoregressive models: a class of
remarkable neural networks that aren’t usually seen on a first pass through deep
learning. These notes are a quick write-up of my reading and research: I assume
basic familiarity with deep learning, and aim to highlight general trends and
similarities across autoregressive models, instead of commenting on individual
architectures.</p>
<p><strong>tldr:</strong> <em>Deep autoregressive models are sequence models, yet feed-forward
(i.e. not recurrent); generative models, yet supervised. They are a compelling
alternative to RNNs for sequential data, and GANs for generation tasks.</em></p>
<h2 id="deep-autoregressive-models">Deep Autoregressive Models</h2>
<p>To be explicit (at the expense of redundancy), this blog post is about <em>deep
autoregressive generative sequence models</em>. That’s quite a mouthful of jargon
(and two of those words are actually unnecessary), so let’s unpack that.</p>
<ol>
<li>Deep
<ul>
<li>Well, these papers are using TensorFlow or PyTorch… so they must be “deep”
:wink:</li>
<li>You would think this word is unnecessary, but it’s actually not!
Autoregressive linear models like
<a href="https://en.wikipedia.org/wiki/Autoregressive%E2%80%93moving-average_model">ARMA</a>
or
<a href="https://en.wikipedia.org/wiki/Autoregressive_conditional_heteroskedasticity">ARCH</a>
have been used in statistics, econometrics and financial modelling for ages.</li>
</ul>
</li>
<li>Autoregressive
<ul>
<li><a href="https://deepgenerativemodels.github.io/notes/autoregressive/">Stanford has a good
introduction</a>
to autoregressive models, but I think a good way to explain these models is
to compare them to recurrent neural networks (RNNs), which are far more
well-known.</li>
</ul>
<figure>
<a href="https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/RNN-unrolled.png"><img src="https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/RNN-unrolled.png" alt="Recurrent neural network (RNN) block diagram, both rolled and unrolled" /></a>
<figcaption>Obligatory RNN diagram. Source: <a href="https://colah.github.io/posts/2015-08-Understanding-LSTMs/">Chris Olah</a>.</figcaption>
</figure>
<ul>
<li>Like an RNN, an autoregressive model’s output <script type="math/tex">h_t</script> at time <script type="math/tex">t</script>
depends on not just <script type="math/tex">x_t</script>, but also <script type="math/tex">x</script>’s from previous time steps.
However, <em>unlike</em> an RNN, the previous <script type="math/tex">x</script>’s are not provided via some
hidden state: they are given as just another input to the model.</li>
<li>The following animation of Google DeepMind’s WaveNet illustrates this
well: the <script type="math/tex">t</script>th output is generated in a <em>feed-forward</em> fashion from
several input <script type="math/tex">x</script> values.<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></li>
</ul>
<figure>
<a href="https://storage.googleapis.com/deepmind-live-cms/documents/BlogPost-Fig2-Anim-160908-r01.gif"><img src="https://storage.googleapis.com/deepmind-live-cms/documents/BlogPost-Fig2-Anim-160908-r01.gif" alt="WaveNet animation" /></a>
<figcaption>WaveNet animation. Source: <a href="https://deepmind.com/blog/wavenet-generative-model-raw-audio/">Google DeepMind</a>.</figcaption>
</figure>
<ul>
<li>Put simply, <strong>an autoregressive model is merely a feed-forward model which
predicts future values from past values.</strong></li>
<li>I’ll explain this more later, but it’s worth saying now: autoregressive
models offer a compelling bargain. You can have stable, parallel and
easy-to-optimize training, faster inference computations, and completely
do away with the fickleness of <a href="https://en.wikipedia.org/wiki/Backpropagation_through_time">truncated backpropagation through
time</a>, if you
are willing to accept a model that (by design) <em>cannot have</em> infinite
memory. There is <a href="http://www.offconvex.org/2018/07/27/approximating-recurrent/">recent
research</a> to
suggest that this is a worthwhile tradeoff.</li>
</ul>
</li>
<li>Generative
<ul>
<li>Informally, a generative model is one that can generate new data after
learning from the dataset.</li>
<li>More formally, a generative model models the joint distribution <script type="math/tex">P(X, Y)</script>
of the observation <script type="math/tex">X</script> and the target <script type="math/tex">Y</script>. Contrast this to a
discriminative model that models the conditional distribution <script type="math/tex">P(Y|X)</script>.</li>
<li>GANs and VAEs are two families of popular generative models.</li>
<li>This is unnecessary word #1: any autoregressive model can be run
sequentially to generate a new sequence! Start with your seed <script type="math/tex">x_1, x_2,
..., x_k</script> and predict <script type="math/tex">x_{k+1}</script>. Then use <script type="math/tex">x_2, x_3, ..., x_{k+1}</script> to
predict <script type="math/tex">x_{k+2}</script>, and so on.</li>
</ul>
</li>
<li>Sequence model
<ul>
<li>Fairly self explanatory: a model that deals with sequential data, whether it
is mapping sequences to scalars (e.g. language models), or mapping sequences
to sequences (e.g. machine translation models).</li>
<li>Although sequence models are designed for sequential data (duh), there has
been success at applying them to non-sequential data. For example, PixelCNN
(discussed below) can generate entire images, even though images are not
sequential in nature: the model generates a pixel at a time, in
sequence!<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup></li>
<li>Notice that an autoregressive model must be a sequence model, so it’s
redundant to further describe these models as sequential (which makes this
unnecessary word #2).</li>
</ul>
</li>
</ol>
<p>A good distinction is that “generative” and “sequential” describe <em>what</em> these
models do, or what kind of data they deal with. “Autoregressive” describes <em>how</em>
these models do what they do: i.e. they describe properties of the network or
its architecture.</p>
<h2 id="some-architectures-and-applications">Some Architectures and Applications</h2>
<p>Deep autoregressive models have seen a good degree of success: below is a list
of some of examples. Each architecture merits exposition and discussion, but
unfortunately there isn’t enough space here to devote to do any of them justice.</p>
<ul>
<li><a href="https://arxiv.org/abs/1601.06759">PixelCNN by Google DeepMind</a> was probably
the first deep autoregressive model, and the progenitor of most of the other
models below. Ironically, the authors spend the bulk of the paper discussing a
recurrent model, PixelRNN, and consider PixelCNN as a “workaround” to avoid
excessive computation. However, PixelCNN is probably this paper’s most lasting
contribution.</li>
<li><a href="https://arxiv.org/abs/1701.05517">PixelCNN++ by OpenAI</a> is, unsurprisingly,
PixelCNN but with various improvements.</li>
<li><a href="https://deepmind.com/blog/wavenet-generative-model-raw-audio/">WaveNet by Google
DeepMind</a> is
heavily inspired by PixelCNN, and models raw audio, not just encoded music.
They had to pull <a href="https://en.wikipedia.org/wiki/%CE%9C-law_algorithm">a neat trick from telecommunications/signals
processing</a> in order to
cope with the sheer size of audio (high-quality audio involves at least 16-bit
precision samples, which means a 65,536-way-softmax per time step!)</li>
<li><a href="https://arxiv.org/abs/1706.03762">Transformer, a.k.a. <em>the “attention is all you need” model</em> by Google
Brain</a> is now a mainstay of NLP, performing
very well at many NLP tasks and being incorporated into subsequent models like
<a href="https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html">BERT</a>.</li>
</ul>
<p>These models have also found applications: for example, <a href="https://arxiv.org/abs/1610.10099">Google DeepMind’s
ByteNet can perform neural machine translation (in linear
time!)</a> and <a href="https://arxiv.org/abs/1610.00527">Google DeepMind’s Video Pixel
Network can model video</a>.<sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup></p>
<h2 id="some-thoughts-and-observations">Some Thoughts and Observations</h2>
<ol>
<li>Given previous values <script type="math/tex">x_1, x_2, ..., x_t</script>, these models do not output a
<em>value</em> for <script type="math/tex">x_{t+1}</script>, they output the <em>predictive probability
distribution</em> <script type="math/tex">P(x_{t+1} | x_1, x_2, ..., x_t)</script> for <script type="math/tex">x_{t+1}</script>.
<ul>
<li>If the <script type="math/tex">x</script>’s are discrete, then you can do this by outputting an <script type="math/tex">N</script>-way
softmaxxed tensor, where <script type="math/tex">N</script> is the number of discrete classes. This is
what the original PixelCNN did, but gets problematic when <script type="math/tex">N</script> is large
(e.g. in the case of WaveNet, where <script type="math/tex">N = 2^{16}</script>).</li>
<li>If the <script type="math/tex">x</script>’s are continuous, you can model the probability distribution
itself as the sum of basis functions, and having the model output the
parameters of these basis functions. This massively reduces the memory
footprint of the model, and was an important contribution of PixelCNN++.</li>
<li>Theoretically you could have an autoregressive model that <em>doesn’t</em> model
the conditional distribution… but most recent models do.</li>
</ul>
</li>
<li>Autoregressive models are supervised.
<ul>
<li>With the success and hype of GANs and VAEs, it is easy to assume that all
generative models are unsupervised: this is not true!</li>
<li>This means that that training is stable and highly parallelizable, that it
is straightfoward to tune hyperparameters, and that inference is
computationally inexpensive. We can also break out all the good stuff from
ML-101: train-valid-test splits, cross validation, loss metrics, etc. These
are all things that we lose when we resort to e.g. GANs.</li>
</ul>
</li>
<li>Autoregressive models work on both continuous and discrete data.
<ul>
<li>Autoregressive sequential models have worked for audio (WaveNet), images
(PixelCNN++) and text (Transformer): these models are very flexible in the
kind of data that they can model.</li>
<li>Contrast this to GANs, which (as far as I’m aware) cannot model discrete
data.</li>
</ul>
</li>
<li>Autoregressive models are very amenable to conditioning.
<ul>
<li>There are many options for conditioning! You can condition on both discrete
and continuous variables; you can condition at multiple time scales; you can
even condition on latent embeddings or the outputs of other neural networks.</li>
<li>There is one ostensible problem with using autoregressive models as
generative models: you can only condition on your data’s labels. I.e.
unlike a GAN, you cannot condition on random noise and expect the model to
shape the noise space into a semantically (stylistically) meaningful latent
space.</li>
<li>Google DeepMind followed up their original PixelRNN paper with <a href="https://arxiv.org/abs/1606.05328">another
paper</a> that describes one way to overcome
this problem. Briefly: to condition, they incorporate the latent vector into
the PixelCNN’s activation functions; to produce/learn the latent vectors,
they use a convolutional encoder; and to generate an image given a latent
vector, they replace the traditional deconvolutional decoder with a
conditional PixelCNN.</li>
<li>WaveNet goes even futher and employs “global” and “local” conditioning (both
are achieved by incorporating the latent vectors into WaveNet’s activation
functions). The authors devise a battery of conditioning schemes to capture
speaker identity, linguistic features of input text, music genre, musical
instrument, etc.</li>
</ul>
</li>
<li>Generating output sequences of variable length is not straightforward.
<ul>
<li>Neither WaveNet nor PixelCNN needed to worry about a variable output length:
both audio and images are comprised of a fixed number of outputs (i.e. audio
is just <script type="math/tex">N</script> samples, and images are just <script type="math/tex">N^2</script> pixels).</li>
<li>Text, on the other hand, is different: sentences can be of variable length.
One would think that this is a nail in a coffin, but thankfully text is
discrete: the standard trick is to have a “stop token” that indicates that
the sentence is finished (i.e. model a full stop as its own token).</li>
<li>As far as I am aware, there is no prior literature on having both problems:
a variable-length output of continuous values.</li>
</ul>
</li>
<li>Autoregressive models can model multiple time scales
<ul>
<li>In the case of music, there are important patterns to model at multiple time
scales: individual musical notes drive correlations between audio samples at
the millisecond scale, and music exhibits rhythmic patterns over the course
of minutes. This is well illustrated by the following animation:</li>
</ul>
<figure>
<a href="https://storage.googleapis.com/deepmind-live-cms/documents/BlogPost-Fig1-Anim-160908-r01.gif"><img src="https://storage.googleapis.com/deepmind-live-cms/documents/BlogPost-Fig1-Anim-160908-r01.gif" alt="Audio at multiple time scales" /></a>
<figcaption>Audio exhibits patterns at multiple time scales. Source: <a href="https://deepmind.com/blog/wavenet-generative-model-raw-audio/">Google DeepMind</a>.</figcaption>
</figure>
<ul>
<li>There are two main ways capture these many patterns at these many
different time scales: either make the receptive field of your model
<em>extremely</em> wide (e.g. through dilated convolutions), or condition your
model on a subsampled version of your generated output, which is in turn
produced by an unconditioned model.
<ul>
<li>Google DeepMind composes an unconditional PixelRNN with one or more
conditional PixelRNNs to form a so-called “multi-scale” PixelRNN: the
first PixelRNN generates a lower-resolution image that conditions the
subsequent PixelRNNs.</li>
<li>WaveNet employs a technique and calls them “context stacks”.</li>
</ul>
</li>
</ul>
</li>
<li>How the hell can any of this stuff work?
<ul>
<li>RNNs are theoretically more expressive and powerful than autoregressive
models. However, recent work suggests that such infinite-horizon memory is
seldom achieved in practice.</li>
<li>To quote <a href="http://www.offconvex.org/2018/07/27/approximating-recurrent/">John Miller at the Berkeley AI Research
lab</a>:</li>
</ul>
<blockquote>
<p><strong>Recurrent models trained in practice are effectively feed-forward.</strong>
This could happen either because truncated backpropagation through time
cannot learn patterns significantly longer than <script type="math/tex">k</script> steps, or, more
provocatively, because models <em>trainable by gradient descent</em> cannot have
long-term memory.</p>
</blockquote>
</li>
</ol>
<hr />
<div class="footnotes">
<ol>
<li id="fn:1">
<p>There’s actually a lot more nuance than meets the eye in this animation, but all I’m trying to illustrate is the feed-forward nature of autoregressive models. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>I personally think it’s breathtakingly that machines can do this. Imagine your phone keyboard’s word suggestions (those are autoregressive!) spitting out an entire novel. Or imagine weaving a sweater but you had to choose the color of every stitch, in order, in advance. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>In case you haven’t noticed, Google DeepMind seemed to have had an infatuation with autoregressive models back in 2016. <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>George HoMy current project involves working with a class of fairly niche and interesting neural networks that aren't usually seen on a first pass through deep learning. I thought I'd write up my reading and research and post it.Modern Computational Methods for Bayesian Inference — A Reading List2019-01-02T00:00:00+00:002019-01-02T00:00:00+00:00https://eigenfoo.xyz/bayesian-inference-reading<p>Lately I’ve been troubled by how little I actually knew about how Bayesian
inference <em>really worked</em>. I could explain to you <a href="https://maria-antoniak.github.io/2018/11/19/data-science-crash-course.html">many other machine learning
techniques</a>,
but with Bayesian modelling… well, there’s a model (which is basically the
likelihood, I think?), and then there’s a prior, and then, um…</p>
<p>What actually happens when you run a sampler? What makes inference
“variational”? And what is this automatic differentiation doing in my
variational inference? <em>Cue long sleepless nights, contemplating my own
ignorance.</em></p>
<p>So to celebrate the new year<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>, I compiled a list of things to read — blog
posts, journal papers, books, anything that would help me understand (or at
least, appreciate) the math and computation that happens when I press the <em>Magic
Inference Button™</em>. Again, this reading list isn’t focused on how to use
Bayesian modelling for a <em>specific</em> use case<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>; it’s focused on how modern
computational methods for Bayesian inference work <em>in general</em>.</p>
<p>So without further ado…</p>
<h2 id="markov-chain-monte-carlo">Markov-Chain Monte Carlo</h2>
<h3 id="for-the-uninitiated">For the uninitiated</h3>
<ol>
<li><a href="https://twiecki.github.io/blog/2015/11/10/mcmc-sampling/">MCMC Sampling for
Dummies</a> by Thomas
Wiecki. A basic introduction to MCMC with accompanying Python snippets. The
Metropolis sampler is used an introduction to sampling.</li>
<li><a href="http://www.mcmchandbook.net/HandbookChapter1.pdf">Introduction to Markov Chain Monte
Carlo</a> by Charles Geyer.
The first chapter of the aptly-named <a href="http://www.mcmchandbook.net/"><em>Handbook of Markov Chain Monte
Carlo</em></a>.</li>
</ol>
<h3 id="hamiltonian-monte-carlo-and-the-no-u-turn-sampler">Hamiltonian Monte Carlo and the No-U-Turn Sampler</h3>
<ol>
<li><a href="https://arogozhnikov.github.io/2016/12/19/markov_chain_monte_carlo.html">Hamiltonian Monte Carlo
explained</a>.
A visual and intuitive explanation of HMC: great for starters.</li>
<li><a href="https://arxiv.org/abs/1701.02434">A Conceptual Introduction to Hamiltonian Monte
Carlo</a> by Michael Betancourt. An excellent
paper for a solid conceptual understanding and principled intuition for HMC.</li>
<li><a href="https://colindcarroll.com/2019/04/06/exercises-in-automatic-differentiation-using-autograd-and-jax/">Exercises in Automatic Differentiation using <code class="highlighter-rouge">autograd</code> and
<code class="highlighter-rouge">jax</code></a>
by Colin Carroll. This is the first in a series of blog posts that explain
HMC from the very beginning. See also <a href="https://colindcarroll.com/2019/04/11/hamiltonian-monte-carlo-from-scratch/">Hamiltonian Monte Carlo from
Scratch</a>,
<a href="https://colindcarroll.com/2019/04/21/step-size-adaptation-in-hamiltonian-monte-carlo/">Step Size Adaptation in Hamiltonian Monte
Carlo</a>,
and <a href="https://colindcarroll.com/2019/04/28/choice-of-symplectic-integrator-in-hamiltonian-monte-carlo/">Choice of Symplectic Integrator in Hamiltonian Monte
Carlo</a>.</li>
<li><a href="https://arxiv.org/abs/1111.4246">The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte
Carlo</a> by Matthew Hoffman and Andrew Gelman.
The original NUTS paper.</li>
<li><a href="http://www.mcmchandbook.net/HandbookChapter5.pdf">MCMC Using Hamiltonian
Dynamics</a> by Radford Neal.</li>
<li><a href="https://colindcarroll.com/talk/hamiltonian-monte-carlo/">Hamiltonian Monte Carlo in
PyMC3</a> by Colin
Carroll.</li>
</ol>
<h3 id="sequential-monte-carlo-and-particle-filters">Sequential Monte Carlo and particle filters</h3>
<ol>
<li><a href="https://www.stats.ox.ac.uk/~doucet/doucet_defreitas_gordon_smcbookintro.pdf">An Introdution to Sequential Monte Carlo
Methods</a>
by Arnaud Doucet, Nando de Freitas and Neil Gordon. This chapter from <a href="https://www.springer.com/us/book/9780387951461">the
authors’ textbook on SMC</a>
provides motivation for using SMC methods, and gives a brief introduction to
a basic particle filter.</li>
<li><a href="http://www.stats.ox.ac.uk/~doucet/smc_resources.html">Sequential Monte Carlo Methods & Particle Filters
Resources</a> by Arnaud
Doucet. A list of resources on SMC and particle filters: way more than you
probably ever need to know about them.</li>
</ol>
<h3 id="other-sampling-methods">Other sampling methods</h3>
<ol>
<li>Chapter 11 (Sampling Methods) of <a href="https://www.microsoft.com/en-us/research/people/cmbishop/#!prml-book">Pattern Recognition and Machine
Learning</a>
by Christopher Bishop. Covers rejection, importance, Metropolis-Hastings,
Gibbs and slice sampling. Perhaps not as rampantly useful as NUTS, but good
to know nevertheless.</li>
<li><a href="https://chi-feng.github.io/mcmc-demo/">The Markov-chain Monte Carlo Interactive
Gallery</a> by Chi Feng. A fantastic
library of visualizations of various MCMC samplers.</li>
</ol>
<h2 id="variational-inference">Variational Inference</h2>
<h3 id="for-the-uninitiated-1">For the uninitiated</h3>
<ol>
<li><a href="http://willwolf.io/2018/11/11/em-for-lda/">Deriving
Expectation-Maximization</a> by Will
Wolf. The first blog post in a series that builds from EM all the way to VI.
Also check out <a href="http://willwolf.io/2018/11/23/mean-field-variational-bayes/">Deriving Mean-Field Variational
Bayes</a>.</li>
<li><a href="https://arxiv.org/abs/1601.00670">Variational Inference: A Review for
Statisticians</a> by David Blei, Alp
Kucukelbir and Jon McAuliffe. An high-level overview of variational
inference: the authors go over one example (performing VI on GMMs) in depth.</li>
<li>Chapter 10 (Approximate Inference) of <a href="https://www.microsoft.com/en-us/research/people/cmbishop/#!prml-book">Pattern Recognition and Machine
Learning</a>
by Christopher Bishop.</li>
</ol>
<h3 id="automatic-differentiation-variational-inference-advi">Automatic differentiation variational inference (ADVI)</h3>
<ol>
<li><a href="https://arxiv.org/abs/1603.00788">Automatic Differentiation Variational
Inference</a> by Alp Kucukelbir, Dustin Tran
et al. The original ADVI paper.</li>
<li><a href="https://papers.nips.cc/paper/5758-automatic-variational-inference-in-stan">Automatic Variational Inference in
Stan</a>
by Alp Kucukelbir, Rajesh Ranganath, Andrew Gelman and David Blei.</li>
</ol>
<h2 id="open-source-software-for-bayesian-inference">Open-Source Software for Bayesian Inference</h2>
<p>There are many open-source software libraries for Bayesian modelling and
inference, and it is instructive to look into the inference methods that they do
(or do not!) implement.</p>
<ol>
<li><a href="http://mc-stan.org/">Stan</a></li>
<li><a href="http://docs.pymc.io/">PyMC3</a></li>
<li><a href="http://pyro.ai/">Pyro</a></li>
<li><a href="https://www.tensorflow.org/probability/">Tensorflow Probability</a></li>
<li><a href="http://edwardlib.org/">Edward</a></li>
<li><a href="https://greta-stats.org/">Greta</a></li>
<li><a href="https://dotnet.github.io/infer/">Infer.NET</a></li>
<li><a href="https://www.mrc-bsu.cam.ac.uk/software/bugs/">BUGS</a></li>
<li><a href="http://mcmc-jags.sourceforge.net/">JAGS</a></li>
</ol>
<h2 id="further-topics">Further Topics</h2>
<p>Bayesian inference doesn’t stop at MCMC and VI: there is bleeding-edge research
being done on other methods of inference. While they aren’t ready for real-world
use, it is interesting to see what they are.</p>
<h3 id="approximate-bayesian-computation-abc-and-likelihood-free-methods">Approximate Bayesian computation (ABC) and likelihood-free methods</h3>
<ol>
<li><a href="https://arxiv.org/abs/1001.2058">Likelihood-free Monte Carlo</a> by Scott
Sisson and Yanan Fan.</li>
</ol>
<h3 id="expectation-propagation">Expectation propagation</h3>
<ol>
<li><a href="https://arxiv.org/abs/1412.4869">Expectation propagation as a way of life: A framework for Bayesian inference
on partitioned data</a> by Aki Vehtari, Andrew
Gelman, et al.</li>
</ol>
<h3 id="operator-variational-inference-opvi">Operator variational inference (OPVI)</h3>
<ol>
<li><a href="https://arxiv.org/abs/1610.09033">Operator Variational Inference</a> by Rajesh
Ranganath, Jaan Altosaar, Dustin Tran and David Blei. The original OPVI
paper.</li>
</ol>
<p>(I’ve tried to include as many relevant and helpful resources as I could find,
but if you feel like I’ve missed something, <a href="https://twitter.com/@_eigenfoo">drop me a
line</a>!)</p>
<hr />
<div class="footnotes">
<ol>
<li id="fn:1">
<p><a href="https://twitter.com/year_progress/status/1079889949871300608">Relevant tweet here.</a> <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>If that’s what you’re looking for, check out my <a href="https://eigenfoo.xyz/bayesian-modelling-cookbook">Bayesian modelling cookbook</a> or <a href="https://betanalpha.github.io/assets/case_studies/principled_bayesian_workflow.html">Michael Betancourt’s excellent essay on a principles Bayesian workflow</a>. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>George HoAn annotated reading list on modern computational methods for Bayesian inference — Markov chain Monte Carlo (MCMC), variational inference (VI) and some other (more experimental) methods.Modelling Hate Speech on Reddit — A Three-Act Play (Slide Deck)2018-11-03T00:00:00+00:002017-11-08T00:00:00+00:00https://eigenfoo.xyz/reddit-slides<p>This is a follow-up post to my first post on a recent project to <a href="https://eigenfoo.xyz/reddit-clusters/">model hate
speech on Reddit</a>. If you haven’t taken a
look at my first post, please do!</p>
<p>I recently gave a talk on the technical, data science side of the project,
describing not just the final result, but also the trajectory of the whole
project: stumbling blocks, dead ends and all. Below is the slide deck, as well
as the speaker notes. Enjoy!</p>
<h2 id="abstract">Abstract</h2>
<p>Reddit is the one of the most popular discussion websites today, and is famously
broad-minded in what it allows to be said on its forums: however, where there is
free speech, there are invariably pockets of hate speech.</p>
<p>In this talk, I present a recent project to model hate speech on Reddit. In
three acts, I chronicle the thought processes and stumbling blocks of the
project, with each act applying a different form of machine learning: supervised
learning, topic modelling and text clustering. I conclude with the current state
of the project: a system that allows the modelling and summarization of entire
subreddits, and possible future directions. Rest assured that both the talk and
the slides have been scrubbed to be safe for work!</p>
<h2 id="slides">Slides</h2>
<p>(Don’t forget to take a look at the speaker notes!)</p>
<style>
.responsive-wrap iframe{ max-width: 100%;}
</style>
<div class="responsive-wrap">
<!-- this is the embed code provided by Google -->
<iframe src="https://docs.google.com/presentation/d/e/2PACX-1vS9wBAwScepPz3vmvyMrq-osBfIGzL7C3wArXmL3ky_A2dfaqlVSshTz2CyHuMibQBX3Ej6QCsZ0qv_/embed?start=false&loop=false&delayms=3000" frameborder="0" width="960" height="569" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe>
<!-- Google embed ends -->
</div>George HoA talk I gave about a recent project to model hate speech on Reddit. In this blog post, I describe the thought processes behind the project, and the stumbling blocks encountered along the way.Probabilistic and Bayesian Matrix Factorizations for Text Clustering2018-10-13T00:00:00+00:002018-10-13T00:00:00+00:00https://eigenfoo.xyz/matrix-factorizations<p>Natural language processing is in a curious place right now. It was always a
late bloomer (as far as machine learning subfields go), and it’s not immediately
obvious how close the field is to viable, large-scale, production-ready
techniques (in the same way that, say, <a href="https://clarifai.com/models/">computer vision
is</a>). For example, <a href="https://ruder.io">Sebastian
Ruder</a> predicted that the field is <a href="https://thegradient.pub/nlp-imagenet/">close to a watershed
moment</a>, and that soon we’ll have
downloadable language models. However, <a href="https://amarasovic.github.io/">Ana
Marasović</a> points out that there is <a href="https://thegradient.pub/frontiers-of-generalization-in-natural-language-processing/">a tremendous
amount of work demonstrating
that</a>:</p>
<blockquote>
<p>“despite good performance on benchmark datasets, modern NLP techniques are
nowhere near the skill of humans at language understanding and reasoning when
making sense of novel natural language inputs”.</p>
</blockquote>
<p>I am confident that I am <em>very</em> bad at making lofty predictions about the
future. Instead, I’ll talk about something I know a bit about: simple solutions
to concrete problems, with some Bayesianism thrown in for good measure
:grinning:.</p>
<p>This blog post summarizes some literature on probabilistic and Bayesian
matrix factorization methods, keeping an eye out for applications to one
specific task in NLP: text clustering. It’s exactly what it sounds like, and
there’s been a fair amount of success in applying text clustering to many other
NLP tasks (e.g. check out these examples in <a href="https://www-users.cs.umn.edu/~hanxx023/dmclass/scatter.pdf">document
organization</a>,
<a href="http://jmlr.csail.mit.edu/papers/volume3/bekkerman03a/bekkerman03a.pdf">corpus</a>
<a href="https://www.cs.technion.ac.il/~rani/el-yaniv-papers/BekkermanETW01.pdf">summarization</a>
and <a href="http://www.kamalnigam.com/papers/emcat-aaai98.pdf">document
classification</a>).</p>
<p>What follows is a literature review of three matrix factorization techniques for
machine learning: one classical, one probabilistic and one Bayesian. I also
experimented with applying these methods to text clustering: I gave a guest
lecture on my results to a graduate-level machine learning class at The Cooper
Union (the slide deck is below). Dive in!</p>
<h2 id="non-negative-matrix-factorization-nmf">Non-Negative Matrix Factorization (NMF)</h2>
<p>NMF is a <a href="https://en.wikipedia.org/wiki/Non-negative_matrix_factorization">very
well-known</a>
<a href="http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html">matrix
factorization</a>
<a href="https://arxiv.org/abs/1401.5226">technique</a>, perhaps most famous for its
applications in <a href="http://blog.echen.me/2011/10/24/winning-the-netflix-prize-a-summary/">collaborative filtering and the Netflix
Prize</a>.</p>
<p>Factorize your (entrywise non-negative) <script type="math/tex">m \times n</script> matrix <script type="math/tex">V</script> as
<script type="math/tex">V = WH</script>, where <script type="math/tex">W</script> is <script type="math/tex">m \times p</script> and <script type="math/tex">H</script> is <script type="math/tex">p \times n</script>. <script type="math/tex">p</script>
is the dimensionality of your latent space, and each latent dimension usually
comes to quantify something with semantic meaning. There are several algorithms
to compute this factorization, but Lee and Seung’s <a href="https://dl.acm.org/citation.cfm?id=3008829">multiplicative update
rule</a> (originally published in NIPS
2000) is most popular.</p>
<p>Fairly simple: enough said, I think.</p>
<h2 id="probabilistic-matrix-factorization-pmf">Probabilistic Matrix Factorization (PMF)</h2>
<p>Originally introduced as a paper at <a href="https://papers.nips.cc/paper/3208-probabilistic-matrix-factorization">NIPS
2007</a>,
<em>probabilistic matrix factorization</em> is essentially the exact same model as NMF,
but with uncorrelated (a.k.a. “spherical”) multivariate Gaussian priors placed
on the rows and columns of <script type="math/tex">U</script> and <script type="math/tex">V</script>. Expressed as a graphical model, PMF
would look like this:</p>
<figure>
<a href="/assets/images/pmf.png"><img style="float: middle" src="/assets/images/pmf.png" alt="Graphical model (using plate notation) for probabilistic matrix factorization (PMF)" /></a>
</figure>
<p>Note that the priors are placed on the <em>rows</em> of the <script type="math/tex">U</script> and <script type="math/tex">V</script> matrices.</p>
<p>The authors then (somewhat disappointing) proceed to find the MAP estimate of
the <script type="math/tex">U</script> and <script type="math/tex">V</script> matrices. They show that maximizing the posterior is
equivalent to minimizing the sum-of-squared-errors loss function with two
quadratic regularization terms:</p>
<script type="math/tex; mode=display">\frac{1}{2} \sum_{i=1}^{N} \sum_{j=1}^{M} {I_{ij} (R_{ij} - U_i^T V_j)^2} +
\frac{\lambda_U}{2} \sum_{i=1}^{N} |U|_{Fro}^2 +
\frac{\lambda_V}{2} \sum_{j=1}^{M} |V|_{Fro}^2</script>
<p>where <script type="math/tex">|\cdot|_{Fro}</script> denotes the Frobenius norm, and <script type="math/tex">I_{ij}</script> is 1 if document
<script type="math/tex">i</script> contains word <script type="math/tex">j</script>, and 0 otherwise.</p>
<p>This loss function can be minimized via gradient descent, and implemented in
your favorite deep learning framework (e.g. Tensorflow or PyTorch).</p>
<p>The problem with this approach is that while the MAP estimate is often a
reasonable point in low dimensions, it becomes very strange in high dimensions,
and is usually not informative or special in any way. Read <a href="https://www.inference.vc/high-dimensional-gaussian-distributions-are-soap-bubble/">Ferenc Huszár’s blog
post</a>
for more.</p>
<h2 id="bayesian-probabilistic-matrix-factorization-bpmf">Bayesian Probabilistic Matrix Factorization (BPMF)</h2>
<p>Strictly speaking, PMF is not a Bayesian model. After all, there aren’t any
priors or posteriors, only fixed hyperparameters and a MAP estimate. <em>Bayesian
probabilistic matrix factorization</em>, originally published by <a href="https://dl.acm.org/citation.cfm?id=1390267">researchers from
the University of Toronto</a> is a
fully Bayesian treatment of PMF.</p>
<p>Instead of saying that the rows/columns of U and V are normally distributed with
zero mean and some precision matrix, we place hyperpriors on the mean vector and
precision matrices. The specific priors are Wishart priors on the covariance
matrices (with scale matrix <script type="math/tex">W_0</script> and <script type="math/tex">\nu_0</script> degrees of freedom), and
Gaussian priors on the means (with mean <script type="math/tex">\mu_0</script> and covariance equal to the
covariance given by the Wishart prior). Expressed as a graphical model, BPMF
would look like this:</p>
<figure>
<a href="/assets/images/bpmf.png"><img style="float: middle" src="/assets/images/bpmf.png" alt="Graphical model (using plate notation) for Bayesian probabilistic matrix factorization (BPMF)" /></a>
</figure>
<p>Note that, as above, the priors are placed on the <em>rows</em> of the <script type="math/tex">U</script> and <script type="math/tex">V</script>
matrices, and that <script type="math/tex">n</script> is the dimensionality of latent space (i.e. the number
of latent dimensions in the factorization).</p>
<p>The authors then sample from the posterior distribution of <script type="math/tex">U</script> and <script type="math/tex">V</script> using
a Gibbs sampler. Sampling takes several hours: somewhere between 5 to 180,
depending on how many samples you want. Nevertheless, the authors demonstrate
that BPMF can achieve more accurate and more robust results on the Netflix data
set.</p>
<p>I would propose two changes to the original paper:</p>
<ol>
<li>Use an LKJ prior on the covariance matrices instead of a Wishart prior.
<a href="https://docs.pymc.io/notebooks/LKJ.html">According to Michael Betancourt and the PyMC3 docs, this is more numerically
stable</a>, and will lead to better
inference.</li>
<li>Use a more robust sampler such as NUTS (instead of a Gibbs sampler), or even
resort to variational inference. The paper makes it clear that BPMF is a
computationally painful endeavor, so any speedup to the method would be a
great help. It seems to me that for practical real-world applications to
collaborative filtering, we would want to use variational inference. Netflix
ain’t waiting 5 hours for their recommendations.</li>
</ol>
<h2 id="application-to-text-clustering">Application to Text Clustering</h2>
<p>Most of the work in these matrix factorization techniques focus on
dimensionality reduction: that is, the problem of finding two factor matrices
that faithfully reconstruct the original matrix when multiplied together.
However, I was interested in applying the exact same techniques to a separate
task: text clustering.</p>
<p>A natural question is: why is matrix factorization<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup> a good technique to use
for text clustering? Because it is simultaneously a clustering and a feature
engineering technique: not only does it offer us a latent representation of the
original data, but it also gives us a way to easily <em>reconstruct</em> the original
data from the latent variables! This is something that <a href="https://eigenfoo.xyz/lda-sucks">latent Dirichlet
allocation</a>, for instance, cannot do.</p>
<p>Matrix factorization lives an interesting double life: clustering technique by
day, feature transformation technique by night. <a href="http://charuaggarwal.net/text-cluster.pdf">Aggarwal and
Zhai</a> suggest that chaining matrix
factorization with some other clustering technique (e.g. agglomerative
clustering or topic modelling) is common practice and is called <em>concept
decomposition</em>, but I haven’t seen any other source back this up.</p>
<p>I experimented with using these techniques to cluster subreddits (<a href="https://eigenfoo.xyz/reddit-clusters">sound
familiar?</a>). In a nutshell, nothing seemed
to work out very well, and I opine on why I think that’s the case in the slide
deck below. This talk was delivered to a graduate-level course in frequentist
machine learning. Don’t forget to take a look at the speaker notes too!</p>
<style>
.responsive-wrap iframe{ max-width: 100%;}
</style>
<div class="responsive-wrap">
<!-- this is the embed code provided by Google -->
<iframe src="https://docs.google.com/presentation/d/e/2PACX-1vT_yB6dMJCnnwKRtkGbdx90lhYGGH329QAGrYw8SaR2mCh0VuocMWGEVJ2XhFNp44JQtPV_vOlQkslo/embed?start=false&loop=false&delayms=3000" frameborder="0" width="960" height="569" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe>
<!-- Google embed ends -->
</div>
<hr />
<div class="footnotes">
<ol>
<li id="fn:1">
<p>which is, by the way, a <a href="http://scikit-learn.org/stable/modules/decomposition.html">severely underappreciated technique in machine learning</a> <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>George HoThis blog post summarizes some literature on probabilistic and Bayesian matrix factorization methods, keeping an eye out for applications to one specific task in NLP: text clustering.Multi-Armed Bandits and Conjugate Models — Bayesian Reinforcement Learning (Part 1)2018-08-31T00:00:00+00:002018-08-31T00:00:00+00:00https://eigenfoo.xyz/bayesian-bandits<blockquote>
<p>This is the first of a two-part series about Bayesian bandit algorithms. Check
out the second post <a href="https://eigenfoo.xyz/bayesian-bandits-2/">here</a>.</p>
</blockquote>
<p>Let’s talk about Bayesianism. It’s developed a reputation (not entirely
justified, but not entirely unjustified either) for being too mathematically
sophisticated or too computationally intensive to work at scale. For instance,
inferring from a Gaussian mixture model is fraught with computational problems
(hierarchical funnels, multimodal posteriors, etc.), and may take a seasoned
Bayesian anywhere between a day and a month to do well. On the other hand, other
blunt hammers of estimation are as easy as a maximum likelihood estimate:
something you could easily get a SQL query to do if you wanted to.</p>
<p>In this blog post I hope to show that there is more to Bayesianism than just
MCMC sampling and suffering, by demonstrating a Bayesian approach to a classic
reinforcement learning problem: the <em>multi-armed bandit</em>.</p>
<p>The problem is this: imagine a gambler at a row of slot machines (each machine
being a “one-armed bandit”), who must devise a strategy so as to maximize
rewards. This strategy includes which machines to play, how many times to play
each machine, in which order to play them, and whether to continue with the
current machine or try a different machine.</p>
<p>This problem is a central problem in decision theory and reinforcement learning:
the agent (our gambler) starts out in a state of ignorance, but learns through
interacting with its environment (playing slots). For more details, Cam
Davidson-Pilon has a great introduction to multi-armed bandits in Chapter 6 of
his book <a href="https://nbviewer.jupyter.org/github/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/blob/master/Chapter6_Priorities/Ch6_Priors_PyMC3.ipynb"><em>Bayesian Methods for
Hackers</em></a>,
and Tor Lattimore and Csaba Szepesvári cover a breathtaking amount of the
underlying theory in their book <a href="http://banditalgs.com/"><em>Bandit Algorithms</em></a>.</p>
<p>So let’s get started! I assume that you are familiar with:</p>
<ul>
<li>some basic probability, at least enough to know some distributions: normal,
Bernoulli, binomial…</li>
<li>some basic Bayesian statistics, at least enough to understand what a
<a href="https://en.wikipedia.org/wiki/Conjugate_prior">conjugate prior</a> (and
conjugate model) is, and why one might like them.</li>
<li><a href="https://jeffknupp.com/blog/2013/04/07/improve-your-python-yield-and-generators-explained/">Python generators and the <code class="highlighter-rouge">yield</code>
keyword</a>,
to understand some of the code I’ve written<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>.</li>
</ul>
<p>Dive in!</p>
<h2 id="the-algorithm">The Algorithm</h2>
<p>The algorithm is straightforward. The description below is taken from Cam
Davidson-Pilon over at Data Origami<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>.</p>
<p>For each round,</p>
<ol>
<li>Sample a random variable <script type="math/tex">X_b</script> from the prior of bandit <script type="math/tex">b</script>, for all
<script type="math/tex">b</script>.</li>
<li>Select the bandit with largest sample, i.e. select bandit <script type="math/tex">B =
\text{argmax}(X_b)</script>.</li>
<li>Observe the result of pulling bandit <script type="math/tex">B</script>, and update your prior on bandit
<script type="math/tex">B</script> using the conjugate model update rule.</li>
<li>Repeat!</li>
</ol>
<p>What I find remarkable about this is how dumbfoundingly simple it is! No MCMC
sampling, no <script type="math/tex">\hat{R}</script>s to diagnose, no pesky divergences… all it requires is
a conjugate model, and the rest is literally just counting.</p>
<p><strong>NB:</strong> This algorithm is technically known as <em>Thompson sampling</em>, and is only
one of many algorithms out there. The main difference is that there are other
ways to go from our current priors to a decision on which bandit to play
next. E.g. instead of simply sampling from our priors, we could use the
upper bound of the 90% credible region, or some dynamic quantile of the
posterior (as in Bayes UCB). See Data Origami<sup id="fnref:2:1"><a href="#fn:2" class="footnote">2</a></sup> for more information.</p>
<h3 id="stochastic-aka-stationary-bandits">Stochastic (a.k.a. stationary) bandits</h3>
<p>Let’s take this algorithm for a spin! Assume we have rewards which are Bernoulli
distributed (this would be the situation we face when e.g. modelling
click-through rates). The conjugate prior for the Bernoulli distribution is the
Beta distribution (this is a special case of the Beta-Binomial model).</p>
<script src="https://gist.github.com/eigenfoo/3d8d318f5bd8fdea24f7b12936de77b5.js"></script>
<p>Here, <code class="highlighter-rouge">pull</code> returns the result of pulling on the <code class="highlighter-rouge">arm</code>‘th bandit, and
<code class="highlighter-rouge">make_bandits</code> is just a factory function for <code class="highlighter-rouge">pull</code>.</p>
<p>The <code class="highlighter-rouge">bayesian_strategy</code> function actually implements the algorithm. We only need
to keep track of the number of times we win and the number of times we played
(<code class="highlighter-rouge">num_rewards</code> and <code class="highlighter-rouge">num_trials</code>, respectively). It samples from all current
<code class="highlighter-rouge">np.random.beta</code> priors (where the original prior was a <script type="math/tex">\text{Beta}(2,
2)</script>, which is symmetrix about 0.5 and explains the odd-looking <code class="highlighter-rouge">a=2+</code> and
<code class="highlighter-rouge">b=2+</code> there), picks the <code class="highlighter-rouge">np.argmax</code>, <code class="highlighter-rouge">pull</code>s that specific bandit, and updates
<code class="highlighter-rouge">num_rewards</code> and <code class="highlighter-rouge">num_trials</code>.</p>
<p>I’ve omitted the data visualization code here, but if you want to see it, check
out the <a href="https://github.com/eigenfoo/wanderings/blob/afcf37a8c6c2a2ac38f6708c1f3dd50db2ebe71f/bayes/bayesian-bandits.ipynb">Jupyter notebook on my
GitHub</a></p>
<figure>
<a href="/assets/images/beta-binomial.png"><img style="float: middle" src="/assets/images/beta-binomial.png" alt="Posterior distribution after several pulls for the Beta-Binomial model" /></a>
</figure>
<h3 id="generalizing-to-conjugate-models">Generalizing to conjugate models</h3>
<p>In fact, this algorithm isn’t just limited to Bernoulli-distributed rewards: it
will work for any <a href="https://en.wikipedia.org/wiki/Conjugate_prior#Table_of_conjugate_distributions">conjugate
model</a>!
Here I implement the Gamma-Poisson model (that is, Poisson distributed rewards,
with a Gamma conjugate prior) to illustrate how extensible this framework is.
(Who cares about Poisson distributed rewards, you ask? Anyone who worries about
returning customers, for one!)</p>
<p>Here’s what we need to change:</p>
<ul>
<li>The rewards distribution on line 5 (in practice, you don’t get to pick this,
so <em>technically</em> there’s nothing to change if you’re doing this in
production!)</li>
<li>The sampling from the prior on lines 17–18</li>
<li>The variables you need to keep track of and update rule on lines 12–13 and
24–25.</li>
</ul>
<p>Without further ado:</p>
<script src="https://gist.github.com/eigenfoo/e9a9933d94524e6dee717276c6b6f732.js"></script>
<figure>
<a href="/assets/images/gamma-poisson.png"><img style="float: middle" src="/assets/images/gamma-poisson.png" alt="Posterior distribution after several pulls for the Gamma-Poisson model" /></a>
</figure>
<p>This really demonstrates how lean and mean conjugate models can be, especially
considering how much of a pain MCMC or approximate inference methods would be,
compared to literal <em>counting</em>. Conjugate models aren’t just textbook examples:
they’re <em>(gasp)</em> actually useful!</p>
<h3 id="generalizing-to-arbitrary-rewards-distributions">Generalizing to arbitrary rewards distributions</h3>
<p>OK, so if we have a conjugate model, we can use Thompson sampling to solve the
multi-armed bandit problem. But what if our rewards distribution doesn’t have a
conjugate prior, or what if we don’t even <em>know</em> our rewards distribution?</p>
<p>In general this problem is very difficult to solve. Theoretically, we could
place some fairly uninformative prior on our rewards, and after every pull we
could run MCMC to get our posterior, but that doesn’t scale, especially for the
online algorithms that we have in mind. Luckily a recent paper by Agrawal and
Goyal<sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup> gives us some help, <em>if we assume rewards are bounded on the interval
<script type="math/tex">[0, 1]</script></em> (of course, if we have bounded rewards, then we can just normalize
them by their maximum value to get rewards between 0 and 1).</p>
<p>This solutions bootstraps the first Beta-Bernoulli model to this new situation.
Here’s what happens:</p>
<ol>
<li>Sample a random variable <script type="math/tex">X_b</script> from the (Beta) prior of bandit <script type="math/tex">b</script>, for
all <script type="math/tex">b</script>.</li>
<li>Select the bandit with largest sample, i.e. select bandit <script type="math/tex">B =
\text{argmax}(X_b)</script>.</li>
<li>Observe the reward <script type="math/tex">R</script> from bandit <script type="math/tex">B</script>.</li>
<li><strong>Observe the outcome <script type="math/tex">r</script> from a Bernoulli trial with probability of success <script type="math/tex">R</script>.</strong></li>
<li>Update posterior of <script type="math/tex">B</script> with this observation <script type="math/tex">r</script>.</li>
<li>Repeat!</li>
</ol>
<p>Here I do this for the logit-normal distribution (i.e. a random variable whose
logit is normally distributed). Note that <code class="highlighter-rouge">np.expit</code> is the inverse of the logit
function.</p>
<script src="https://gist.github.com/eigenfoo/7a397fef8aaa028c5119c9f86860d72e.js"></script>
<figure>
<a href="/assets/images/bounded.png"><img style="float: middle" src="/assets/images/bounded.png" alt="Posterior distribution after several pulls with an arbitrary reward distribution (e.g. the logit normal)" /></a>
</figure>
<h2 id="final-remarks">Final Remarks</h2>
<p>None of this theory is new: I’m just advertising it :blush:. See Cam
Davidson-Pilon’s great blog post about Bayesian bandits<sup id="fnref:2:2"><a href="#fn:2" class="footnote">2</a></sup> for a much more
in-depth treatment, and of course, read around papers on arXiv if you want to go
deeper!</p>
<p>Also, if you want to see all the code that went into this blog post, check out
<a href="https://github.com/eigenfoo/wanderings/blob/afcf37a8c6c2a2ac38f6708c1f3dd50db2ebe71f/bayes/bayesian-bandits.ipynb">the notebook
here</a>.</p>
<blockquote>
<p>This is the first of a two-part series about Bayesian bandit algorithms. Check
out the second post <a href="https://eigenfoo.xyz/bayesian-bandits-2/">here</a>.</p>
</blockquote>
<hr />
<div class="footnotes">
<ol>
<li id="fn:1">
<p>I’ve hopped on board the functional programming bandwagon, and couldn’t help but think that to demonstrate this idea, I didn’t need a framework, a library or even a class. Just two functions! <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>Davidson-Pilon, Cameron. “Multi-Armed Bandits.” DataOrigami, 6 Apr. 2013, <a href="https://dataorigami.net/blogs/napkin-folding/79031811-multi-armed-bandits">dataorigami.net/blogs/napkin-folding/79031811-multi-armed-bandits</a> <a href="#fnref:2" class="reversefootnote">↩</a> <a href="#fnref:2:1" class="reversefootnote">↩<sup>2</sup></a> <a href="#fnref:2:2" class="reversefootnote">↩<sup>3</sup></a></p>
</li>
<li id="fn:3">
<p><a href="https://arxiv.org/abs/1111.1797">arXiv:1111.1797</a> [cs.LG] <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>George HoIn this blog post I hope to show that there is more to Bayesianism than just MCMC sampling and suffering, by demonstrating a Bayesian approach to a classic reinforcement learning problem: the _multi-armed bandit_.Cookbook — Bayesian Modelling with PyMC32018-06-19T00:00:00+00:002018-06-24T00:00:00+00:00https://eigenfoo.xyz/bayesian-modelling-cookbook<p>Recently I’ve started using <a href="https://github.com/pymc-devs/pymc3">PyMC3</a> for
Bayesian modelling, and it’s an amazing piece of software! The API only exposes
as much of heavy machinery of MCMC as you need — by which I mean, just the
<code class="highlighter-rouge">pm.sample()</code> method (a.k.a., as <a href="http://twiecki.github.io/blog/2013/08/12/bayesian-glms-1/">Thomas
Wiecki</a> puts it, the
<em>Magic Inference Button™</em>). This really frees up your mind to think about your
data and model, which is really the heart and soul of data science!</p>
<p>That being said however, I quickly realized that the water gets very deep very
fast: I explored my data set, specified a hierarchical model that made sense to
me, hit the <em>Magic Inference Button™</em>, and… uh, what now? I blinked at the
angry red warnings the sampler spat out.</p>
<p>So began by long, rewarding and ongoing exploration of Bayesian modelling. This
is a compilation of notes, tips, tricks and recipes that I’ve collected from
everywhere: papers, documentation, peppering my <a href="https://twitter.com/twiecki">more
experienced</a>
<a href="https://twitter.com/aseyboldt">colleagues</a> with questions. It’s still very much
a work in progress, but hopefully somebody else finds it useful!</p>
<p><img style="float: middle" width="600" src="https://cdn.rawgit.com/pymc-devs/pymc3/master/docs/logos/svg/PyMC3_banner.svg" alt="PyMC3 logo" /></p>
<h2 id="for-the-uninitiated">For the Uninitiated</h2>
<ul>
<li>First of all, <em>welcome!</em> It’s a brave new world out there — where statistics
is cool, Bayesian and (if you’re lucky) even easy. Dive in!</li>
</ul>
<h3 id="bayesian-modelling">Bayesian modelling</h3>
<ul>
<li>
<p>If you don’t know any probability, I’d recommend <a href="https://betanalpha.github.io/assets/case_studies/probability_theory.html">Michael
Betancourt’s</a>
crash-course in practical probability theory.</p>
</li>
<li>
<p>For an introduction to general Bayesian methods and modelling, I really liked
<a href="http://camdavidsonpilon.github.io/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/">Cam Davidson Pilon’s <em>Bayesian Methods for
Hackers</em></a>:
it really made the whole “thinking like a Bayesian” thing click for me.</p>
</li>
<li>
<p>If you’re willing to spend some money, I’ve heard that <a href="https://sites.google.com/site/doingbayesiandataanalysis/"><em>Doing Bayesian Data
Analysis</em> by
Kruschke</a> (a.k.a.
<em>“the puppy book”</em>) is for the bucket list.</p>
</li>
<li>
<p>Here we come to a fork in the road. The central problem in Bayesian modelling
is this: given data and a probabilistic model that we think models this data,
how do we find the posterior distribution of the model’s parameters? There are
currently two good solutions to this problem. One is Markov-chain Monte Carlo
sampling (a.k.a. MCMC sampling), and the other is variational inference
(a.k.a. VI). Both methods are mathematical Death Stars: extremely powerful but
incredibly complicated. Nevertheless, I think it’s important to get at least a
hand-wavy understanding of what these methods are. If you’re new to all this,
my personal recommendation is to invest your time in learning MCMC: it’s been
around longer, we know that there are sufficiently robust tools to help you,
and there’s a lot more support/documentation out there.</p>
</li>
</ul>
<h3 id="markov-chain-monte-carlo">Markov-chain Monte Carlo</h3>
<ul>
<li>
<p>For a good high-level introduction to MCMC, I liked <a href="https://www.youtube.com/watch?v=DJ0c7Bm5Djk&feature=youtu.be&t=4h40m9s">Michael Betancourt’s
StanCon 2017
talk</a>:
especially the first few minutes where he provides a motivation for MCMC, that
really put all this math into context for me.</p>
</li>
<li>
<p>For a more in-depth (and mathematical) treatment of MCMC, I’d check out his
<a href="https://arxiv.org/abs/1701.02434">paper on Hamiltonian Monte Carlo</a>.</p>
</li>
</ul>
<h3 id="variational-inference">Variational inference</h3>
<ul>
<li>
<p>VI has been around for a while, but it was only in 2017 (2 years ago, at the
time of writing) that <em>automatic differentiation variational inference</em> was
invented. As such, variational inference is undergoing a renaissance and is
currently an active area of statistical research. Since it’s such a nascent
field, most resources on it are very theoretical and academic in nature.</p>
</li>
<li>
<p>Chapter 10 (on approximate inference) in Bishop’s <em>Pattern Recognition and
Machine Learning</em> and <a href="https://www.cs.princeton.edu/courses/archive/fall11/cos597C/lectures/variational-inference-i.pdf">this
tutorial</a>
by David Blei are excellent, if a bit mathematically-intensive, resources.</p>
</li>
<li>
<p>The most hands-on explanation of variational inference I’ve seen is the docs
for <a href="http://pyro.ai/examples/svi_part_i.html">Pyro</a>, a probabilistic
programming language developed by Uber that specializes in variational
inference.</p>
</li>
</ul>
<h2 id="model-formulation">Model Formulation</h2>
<ul>
<li>
<p>Try thinking about <em>how</em> your data would be generated: what kind of machine
has your data as outputs? This will help you both explore your data, as well
as help you arrive at a reasonable model formulation.</p>
</li>
<li>
<p>Try to avoid correlated variables. Some of the more robust samplers (<strong>cough</strong>
NUTS <strong>cough cough</strong>) can cope with <em>a posteriori</em> correlated random
variables, but sampling is much easier for everyone involved if the variables
are uncorrelated. By the way, the bar is pretty low here: if the
jointplot/scattergram of the two variables looks like an ellipse, thats
usually okay. It’s when the ellipse starts looking like a line that you should
be alarmed.</p>
</li>
<li>
<p>Try to avoid discrete latent variables, and discrete parameters in general.
There is no good method to sample them in a smart way (since discrete
parameters have no gradients); and with “naïve” samplers (i.e. those that do
not take advantage of the gradient), the number of samples one needs to make
good inferences generally scales exponentially in the number of parameters.
For an instance of this, see <a href="https://docs.pymc.io/notebooks/marginalized_gaussian_mixture_model.html">this example on marginal Gaussian
mixtures</a>.</p>
</li>
<li>
<p>The <a href="https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations">Stan GitHub
wiki</a> has
some excellent recommendations on how to choose good priors. Once you get a
good handle on the basics of using PyMC3, I <em>100% recommend</em> reading this wiki
from start to end: the Stan community has fantastic resources on Bayesian
statistics, and even though their APIs are quite different, the mathematical
theory all translates over.</p>
</li>
</ul>
<h3 id="hierarchical-models">Hierarchical models</h3>
<ul>
<li>
<p>First of all, hierarchical models are amazing! <a href="https://docs.pymc.io/notebooks/GLM-hierarchical.html">The PyMC3
docs</a> opine on this at
length, so let’s not waste any digital ink.</p>
</li>
<li>
<p>The poster child of a Bayesian hierarchical model looks something like this
(equations taken from
<a href="https://en.wikipedia.org/wiki/Bayesian_hierarchical_modeling">Wikipedia</a>):</p>
<p><img style="float: center" src="https://wikimedia.org/api/rest_v1/media/math/render/svg/765f37f86fa26bef873048952dccc6e8067b78f4" alt="Example Bayesian hierarchical model equation #1" /></p>
<p><img style="float: center" src="https://wikimedia.org/api/rest_v1/media/math/render/svg/ca8c0e1233fd69fa4325c6eacf8462252ed6b00a" alt="Example Bayesian hierarchical model equation #2" /></p>
<p><img style="float: center" src="https://wikimedia.org/api/rest_v1/media/math/render/svg/1e56b3077b1b3ec867d6a0f2539ba9a3e79b45c1" alt="Example Bayesian hierarchical model equation #3" /></p>
<p>This hierarchy has 3 levels (some would say it has 2 levels, since there are
only 2 levels of parameters to infer, but honestly whatever: by my count there
are 3). 3 levels is fine, but add any more levels, and it becomes harder for
to sample. Try out a taller hierarchy to see if it works, but err on the side
of 3-level hierarchies.</p>
</li>
<li>
<p>If your hierarchy is too tall, you can truncate it by introducing a
deterministic function of your parameters somewhere (this usually turns out to
just be a sum). For example, instead of modelling your observations are drawn
from a 4-level hierarchy, maybe your observations can be modeled as the sum of
three parameters, where these parameters are drawn from a 3-level hierarchy.</p>
</li>
<li>
<p>More in-depth treatment here in <a href="https://arxiv.org/abs/1312.0906">(Betancourt and Girolami,
2013)</a>. <strong>tl;dr:</strong> hierarchical models all
but <em>require</em> you use to use Hamiltonian Monte Carlo; also included are some
practical tips and goodies on how to do that stuff in the real world.</p>
</li>
</ul>
<h2 id="model-implementation">Model Implementation</h2>
<ul>
<li>
<p>At the risk of overgeneralizing, there are only two things that can go wrong
in Bayesian modelling: either your data is wrong, or your model is wrong. And
it is a hell of a lot easier to debug your data than it is to debug your
model. So before you even try implementing your model, plot histograms of your
data, count the number of data points, drop any NaNs, etc. etc.</p>
</li>
<li>
<p>PyMC3 has one quirky piece of syntax, which I tripped up on for a while. It’s
described quite well in <a href="http://twiecki.github.io/blog/2014/03/17/bayesian-glms-3/#comment-2213376737">this comment on Thomas Wiecki’s
blog</a>.
Basically, suppose you have several groups, and want to initialize several
variables per group, but you want to initialize different numbers of variables
for each group. Then you need to use the quirky <code class="highlighter-rouge">variables[index]</code>
notation. I suggest using <code class="highlighter-rouge">scikit-learn</code>’s <code class="highlighter-rouge">LabelEncoder</code> to easily create the
index. For example, to make normally distributed heights for the iris dataset:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Different numbers of examples for each species</span>
<span class="n">species</span> <span class="o">=</span> <span class="p">(</span><span class="mi">48</span> <span class="o">*</span> <span class="p">[</span><span class="s">'setosa'</span><span class="p">]</span> <span class="o">+</span> <span class="mi">52</span> <span class="o">*</span> <span class="p">[</span><span class="s">'virginica'</span><span class="p">]</span> <span class="o">+</span> <span class="mi">63</span> <span class="o">*</span> <span class="p">[</span><span class="s">'versicolor'</span><span class="p">])</span>
<span class="n">num_species</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="nb">set</span><span class="p">(</span><span class="n">species</span><span class="p">)))</span> <span class="c"># 3</span>
<span class="c"># One variable per group</span>
<span class="n">heights_per_species</span> <span class="o">=</span> <span class="n">pm</span><span class="o">.</span><span class="n">Normal</span><span class="p">(</span><span class="s">'heights_per_species'</span><span class="p">,</span>
<span class="n">mu</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">sd</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">shape</span><span class="o">=</span><span class="n">num_species</span><span class="p">)</span>
<span class="n">idx</span> <span class="o">=</span> <span class="n">sklearn</span><span class="o">.</span><span class="n">preprocessing</span><span class="o">.</span><span class="n">LabelEncoder</span><span class="p">()</span><span class="o">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">species</span><span class="p">)</span>
<span class="n">heights</span> <span class="o">=</span> <span class="n">heights_per_species</span><span class="p">[</span><span class="n">idx</span><span class="p">]</span>
</code></pre></div> </div>
</li>
<li>
<p>You might find yourself in a situation in which you want to use a centered
parameterization for a portion of your data set, but a noncentered
parameterization for the rest of your data set (see below for what these
parameterizations are). There’s a useful idiom for you here:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">num_xs</span> <span class="o">=</span> <span class="mi">5</span>
<span class="n">use_centered</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">])</span> <span class="c"># len(use_centered) = num_xs</span>
<span class="n">x_sd</span> <span class="o">=</span> <span class="n">pm</span><span class="o">.</span><span class="n">HalfCauchy</span><span class="p">(</span><span class="s">'x_sd'</span><span class="p">,</span> <span class="n">sd</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">x_raw</span> <span class="o">=</span> <span class="n">pm</span><span class="o">.</span><span class="n">Normal</span><span class="p">(</span><span class="s">'x_raw'</span><span class="p">,</span> <span class="n">mu</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">sd</span><span class="o">=</span><span class="n">x_sd</span><span class="o">**</span><span class="n">use_centered</span><span class="p">,</span> <span class="n">shape</span><span class="o">=</span><span class="n">num_xs</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">pm</span><span class="o">.</span><span class="n">Deterministic</span><span class="p">(</span><span class="s">'x'</span><span class="p">,</span> <span class="n">x_sd</span><span class="o">**</span><span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">use_centered</span><span class="p">)</span> <span class="o">*</span> <span class="n">x_raw</span><span class="p">)</span>
</code></pre></div> </div>
<p>You could even experiment with allowing <code class="highlighter-rouge">use_centered</code> to be <em>between</em> 0 and
1, instead of being <em>either</em> 0 or 1!</p>
</li>
<li>
<p>I prefer to use the <code class="highlighter-rouge">pm.Deterministic</code> function instead of simply using normal
arithmetic operations (e.g. I’d prefer to write <code class="highlighter-rouge">x = pm.Deterministic('x', y +
z)</code> instead of <code class="highlighter-rouge">x = y + z</code>). This means that you can index the <code class="highlighter-rouge">trace</code> object
later on with just <code class="highlighter-rouge">trace['x']</code>, instead of having to compute it yourself with
<code class="highlighter-rouge">trace['y'] + trace['z']</code>.</p>
</li>
</ul>
<h2 id="mcmc-initialization-and-sampling">MCMC Initialization and Sampling</h2>
<ul>
<li>
<p>Have faith in PyMC3’s default initialization and sampling settings: someone
much more experienced than us took the time to choose them! NUTS is the most
efficient MCMC sampler known to man, and <code class="highlighter-rouge">jitter+adapt_diag</code>… well, you get
the point.</p>
</li>
<li>
<p>However, if you’re truly grasping at straws, the more powerful initialization
setting would be <code class="highlighter-rouge">advi</code> or <code class="highlighter-rouge">advi+adapt_diag</code>, which uses variational
inference to initialize the sampler. An even better option would be to use
<code class="highlighter-rouge">advi+adapt_diag_grad</code>, which is (at the time of writing) an experimental
feature in beta.</p>
</li>
<li>
<p>Never initialize the sampler with the MAP estimate! In low dimensional
problems the MAP estimate (a.k.a. the mode of the posterior) is often quite a
reasonable point. But in high dimensions, the MAP becomes very strange. Check
out <a href="http://www.inference.vc/high-dimensional-gaussian-distributions-are-soap-bubble/">Ferenc Huszár’s blog
post</a>
on high-dimensional Gaussians to see why. Besides, at the MAP all the derivatives
of the posterior are zero, and that isn’t great for derivative-based samplers.</p>
</li>
</ul>
<h2 id="mcmc-trace-diagnostics">MCMC Trace Diagnostics</h2>
<ul>
<li>You’ve hit the <em>Magic Inference Button™</em>, and you have a <code class="highlighter-rouge">trace</code> object. Now
what? First of all, make sure that your sampler didn’t barf itself, and that
your chains are safe for consumption (i.e., analysis).</li>
</ul>
<ol>
<li>
<p>Run the chain for as long as you have the patience or resources for. Make
sure that the <code class="highlighter-rouge">tune</code> parameter increases commensurately with the <code class="highlighter-rouge">draws</code>
parameter.</p>
</li>
<li>
<p>Check for divergences. PyMC3’s sampler will spit out a warning if there are
diverging chains, but the following code snippet may make things easier:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Display the total number and percentage of divergent chains</span>
<span class="n">diverging</span> <span class="o">=</span> <span class="n">trace</span><span class="p">[</span><span class="s">'diverging'</span><span class="p">]</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Number of Divergent Chains: {}'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">diverging</span><span class="o">.</span><span class="n">nonzero</span><span class="p">()[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">size</span><span class="p">))</span>
<span class="n">diverging_perc</span> <span class="o">=</span> <span class="n">diverging</span><span class="o">.</span><span class="n">nonzero</span><span class="p">()[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">size</span> <span class="o">/</span> <span class="nb">len</span><span class="p">(</span><span class="n">trace</span><span class="p">)</span> <span class="o">*</span> <span class="mi">100</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Percentage of Divergent Chains: {:.1f}'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">diverging_perc</span><span class="p">))</span>
</code></pre></div> </div>
</li>
<li>
<p>Check the traceplot (<code class="highlighter-rouge">pm.traceplot(trace)</code>). You’re looking for traceplots
that look like “fuzzy caterpillars”. If the trace moves into some region and
stays there for a long time (a.k.a. there are some “sticky regions”), that’s
cause for concern! That indicates that once the sampler moves into some
region of parameter space, it gets stuck there (probably due to high
curvature or other bad topological properties).</p>
</li>
<li>
<p>In addition to the traceplot, there are <a href="https://docs.pymc.io/api/plots.html">a ton of other
plots</a> you can make with your trace:</p>
<ul>
<li><code class="highlighter-rouge">pm.plot_posterior(trace)</code>: check if your posteriors look reasonable.</li>
<li><code class="highlighter-rouge">pm.forestplot(trace)</code>: check if your variables have reasonable credible
intervals, and Gelman–Rubin scores close to 1.</li>
<li><code class="highlighter-rouge">pm.autocorrplot(trace)</code>: check if your chains are impaired by high
autocorrelation. Also remember that thinning your chains is a waste of
time at best, and deluding yourself at worst. See Chris Fonnesbeck’s
comment on <a href="https://github.com/pymc-devs/pymc/issues/23">this GitHub
issue</a> and <a href="https://twitter.com/junpenglao/status/1009748562136256512">Junpeng Lao’s
reply to Michael Betancourt’s
tweet</a></li>
<li><code class="highlighter-rouge">pm.energyplot(trace)</code>: ideally the energy and marginal energy
distributions should look very similar. Long tails in the distribution of
energy levels indicates deteriorated sampler efficiency.</li>
<li><code class="highlighter-rouge">pm.densityplot(trace)</code>: a souped-up version of <code class="highlighter-rouge">pm.plot_posterior</code>. It
doesn’t seem to be wildly useful unless you’re plotting posteriors from
multiple models.</li>
</ul>
</li>
<li>PyMC3 has a nice helper function to pretty-print a summary table of the
trace: <code class="highlighter-rouge">pm.summary(trace)</code> (I usually tack on a <code class="highlighter-rouge">.round(2)</code> for my sanity).
Look out for:
<ul>
<li>the <script type="math/tex">\hat{R}</script> values (a.k.a. the Gelman–Rubin statistic, a.k.a. the
potential scale reduction factor, a.k.a. the PSRF): are they all close to
1? If not, something is <em>horribly</em> wrong. Consider respecifying or
reparameterizing your model. You can also inspect these in the forest plot.</li>
<li>the sign and magnitude of the inferred values: do they make sense, or are
they unexpected and unreasonable? This could indicate a poorly specified
model. (E.g. parameters of the unexpected sign that have low uncertainties
might indicate that your model needs interaction terms.)</li>
</ul>
</li>
<li>
<p>As a drastic debugging measure, try to <code class="highlighter-rouge">pm.sample</code> with <code class="highlighter-rouge">draws=1</code>,
<code class="highlighter-rouge">tune=500</code>, and <code class="highlighter-rouge">discard_tuned_samples=False</code>, and inspect the traceplot.
During the tuning phase, we don’t expect to see friendly fuzzy caterpillars,
but we <em>do</em> expect to see good (if noisy) exploration of parameter space. So
if the sampler is getting stuck during the tuning phase, that might explain
why the trace looks horrible.</p>
</li>
<li>
<p>If you get scary errors that describe mathematical problems (e.g. <code class="highlighter-rouge">ValueError:
Mass matrix contains zeros on the diagonal. Some derivatives might always be
zero.</code>), then you’re <del>shit out of luck</del> exceptionally unlucky: those kinds of
errors are notoriously hard to debug. I can only point to the <a href="http://andrewgelman.com/2008/05/13/the_folk_theore/">Folk Theorem of
Statistical Computing</a>:</p>
<blockquote>
<p>If you’re having computational problems, probably your model is wrong.</p>
</blockquote>
</li>
</ol>
<h3 id="fixing-divergences">Fixing divergences</h3>
<blockquote>
<p><code class="highlighter-rouge">There were N divergences after tuning. Increase 'target_accept' or reparameterize.</code></p>
<p>— The <em>Magic Inference Button™</em></p>
</blockquote>
<ul>
<li>
<p>Divergences in HMC occur when the sampler finds itself in regions of extremely
high curvature (such as the opening of the a hierarchical funnel). Broadly
speaking, the sampler is prone to malfunction in such regions, causing the
sampler to fly off towards to infinity. The ruins the chains by heavily
biasing the samples.</p>
</li>
<li>
<p>Remember: if you have even <em>one</em> diverging chain, you should be worried.</p>
</li>
<li>
<p>Increase <code class="highlighter-rouge">target_accept</code>: usually 0.9 is a good number (currently the default
in PyMC3 is 0.8). This will help get rid of false positives from the test for
divergences. However, divergences that <em>don’t</em> go away are cause for alarm.</p>
</li>
<li>
<p>Increasing <code class="highlighter-rouge">tune</code> can sometimes help as well: this gives the sampler more time
to 1) find the typical set and 2) find good values for step sizes, scaling
factors, etc. If you’re running into divergences, it’s always possible that
the sampler just hasn’t started the mixing phase and is still trying to find
the typical set.</p>
</li>
<li>
<p>Consider a <em>noncentered</em> parameterization. This is an amazing trick: it all boils down
to the familiar equation <script type="math/tex">X = \sigma Z + \mu</script> from STAT 101, but it honestly
works wonders. See <a href="http://twiecki.github.io/blog/2017/02/08/bayesian-hierchical-non-centered/">Thomas Wiecki’s blog
post</a>
on it, and <a href="https://docs.pymc.io/notebooks/Diagnosing_biased_Inference_with_Divergences.html">this page from the PyMC3
documentation</a>.</p>
</li>
<li>
<p>If that doesn’t work, there may be something wrong with the way you’re
thinking about your data: consider reparameterizing your model, or
respecifying it entirely.</p>
</li>
</ul>
<h3 id="other-common-warnings">Other common warnings</h3>
<ul>
<li>
<p>It’s worth noting that far and away the worst warning to get is the one about
divergences. While a divergent chain indicates that your inference may be
flat-out <em>invalid</em>, the rest of these warnings indicate that your inference is
merely (lol, “merely”) <em>inefficient</em>.</p>
</li>
<li><code class="highlighter-rouge">The number of effective samples is smaller than XYZ for some parameters.</code>
<ul>
<li>Quoting <a href="https://discourse.pymc.io/t/the-number-of-effective-samples-is-smaller-than-25-for-some-parameters/1050/3">Junpeng Lao on
discourse.pymc3.io</a>:
“A low number of effective samples is usually an indication of strong
autocorrelation in the chain.”</li>
<li>Make sure you’re using an efficient sampler like NUTS. (And not, for
instance, Metropolis–Hastings. (I mean seriously, it’s the 21st century, why
would you ever want Metropolis–Hastings?))</li>
<li>Tweak the acceptance probability (<code class="highlighter-rouge">target_accept</code>) — it should be large
enough to ensure good exploration, but small enough to not reject all
proposals and get stuck.</li>
</ul>
</li>
<li><code class="highlighter-rouge">The gelman-rubin statistic is larger than XYZ for some parameters. This
indicates slight problems during sampling.</code>
<ul>
<li>When PyMC3 samples, it runs several chains in parallel. Loosely speaking,
the Gelman–Rubin statistic measures how similar these chains are. Ideally it
should be close to 1.</li>
<li>Increasing the <code class="highlighter-rouge">tune</code> parameter may help, for the same reasons as described
in the <em>Fixing Divergences</em> section.</li>
</ul>
</li>
<li><code class="highlighter-rouge">The chain reached the maximum tree depth. Increase max_treedepth, increase
target_accept or reparameterize.</code>
<ul>
<li>NUTS puts a cap on the depth of the trees that it evaluates during each
iteration, which is controlled through the <code class="highlighter-rouge">max_treedepth</code>. Reaching the maximum
allowable tree depth indicates that NUTS is prematurely pulling the plug to
avoid excessive compute time.</li>
<li>Yeah, what the <em>Magic Inference Button™</em> says: try increasing
<code class="highlighter-rouge">max_treedepth</code> or <code class="highlighter-rouge">target_accept</code>.</li>
</ul>
</li>
</ul>
<h3 id="model-reparameterization">Model reparameterization</h3>
<ul>
<li>
<p>Countless warnings have told you to engage in this strange activity of
“reparameterization”. What even is that? Luckily, the <a href="https://github.com/stan-dev/stan/releases/download/v2.17.1/stan-reference-2.17.1.pdf">Stan User
Manual</a>
(specifically the <em>Reparameterization and Change of Variables</em> section) has
an excellent explanation of reparameterization, and even some practical tips
to help you do it (although your mileage may vary on how useful those tips
will be to you).</p>
</li>
<li>
<p>Asides from meekly pointing to other resources, there’s not much I can do to
help: this stuff really comes from a combination of intuition, statistical
knowledge and good ol’ experience. I can, however, cite some examples to give
you a better idea.</p>
<ul>
<li>The noncentered parameterization is a classic example. If you have a
parameter whose mean and variance you are also modelling, the noncentered
parameterization decouples the sampling of mean and variance from the
sampling of the parameter, so that they are now independent. In this way, we
avoid “funnels”.</li>
<li>The <a href="http://proceedings.mlr.press/v5/carvalho09a.html"><em>horseshoe
distribution</em></a> is known to
be a good shrinkage prior, as it is <em>very</em> spikey near zero, and has <em>very</em>
long tails. However, modelling it using one parameter can give multimodal
posteriors — an exceptionally bad result. The trick is to reparameterize and
model it as the product of two parameters: one to create spikiness at zero,
and one to create long tails (which makes sense: to sample from the
horseshoe, take the product of samples from a normal and a half-Cauchy).</li>
</ul>
</li>
</ul>
<h2 id="model-diagnostics">Model Diagnostics</h2>
<ul>
<li>Admittedly the distinction between the previous section and this one is
somewhat artificial (since problems with your chains indicate problems with
your model), but I still think it’s useful to make this distinction because
these checks indicate that you’re thinking about your data in the wrong way,
(i.e. you made a poor modelling decision), and <em>not</em> that the sampler is having
a hard time doing its job.</li>
</ul>
<ol>
<li>
<p>Run the following snippet of code to inspect the pairplot of your variables
one at a time (if you have a plate of variables, it’s fine to pick a couple
at random). It’ll tell you if the two random variables are correlated, and
help identify any troublesome neighborhoods in the parameter space (divergent
samples will be colored differently, and will cluster near such
neighborhoods).</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pm</span><span class="o">.</span><span class="n">pairplot</span><span class="p">(</span><span class="n">trace</span><span class="p">,</span>
<span class="n">sub_varnames</span><span class="o">=</span><span class="p">[</span><span class="n">variable_1</span><span class="p">,</span> <span class="n">variable_2</span><span class="p">],</span>
<span class="n">divergences</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
<span class="n">color</span><span class="o">=</span><span class="s">'C3'</span><span class="p">,</span>
<span class="n">kwargs_divergence</span><span class="o">=</span><span class="p">{</span><span class="s">'color'</span><span class="p">:</span> <span class="s">'C2'</span><span class="p">})</span>
</code></pre></div> </div>
</li>
<li>
<p>Look at your posteriors (either from the traceplot, density plots or
posterior plots). Do they even make sense? E.g. are there outliers or long
tails that you weren’t expecting? Do their uncertainties look reasonable to
you? If you had <a href="https://en.wikipedia.org/wiki/Plate_notation">a plate</a> of
variables, are their posteriors different? Did you expect them to be that
way? If not, what about the data made the posteriors different? You’re the
only one who knows your problem/use case, so the posteriors better look good
to you!</p>
</li>
<li>Broadly speaking, there are four kinds of bad geometries that your posterior
can suffer from:
<ul>
<li>highly correlated posteriors: this will probably cause divergences or
traces that don’t look like “fuzzy caterpillars”. Either look at the
jointplots of each pair of variables, or look at the correlation matrix of
all variables. Try using a centered parameterization, or reparameterize in
some other way, to remove these correlations.</li>
<li>posteriors that form “funnels”: this will probably cause divergences. Try
using a noncentered parameterization.</li>
<li>long tailed posteriors: this will probably raise warnings about
<code class="highlighter-rouge">max_treedepth</code> being exceeded. If your data has long tails, you should
model that with a long-tailed distribution. If your data doesn’t have long
tails, then your model is ill-specified: perhaps a more informative prior
would help.</li>
<li>multimodal posteriors: right now this is pretty much a death blow. At the
time of writing, all samplers have a hard time with multimodality, and
there’s not much you can do about that. Try reparameterizing to get a
unimodal posterior. If that’s not possible (perhaps you’re <em>modelling</em>
multimodality using a mixture model), you’re out of luck: just let NUTS
sample for a day or so, and hopefully you’ll get a good trace.</li>
</ul>
</li>
<li>
<p>Pick a small subset of your raw data, and see what exactly your model does
with that data (i.e. run the model on a specific subset of your data). I find
that a lot of problems with your model can be found this way.</p>
</li>
<li>Run <a href="https://docs.pymc.io/notebooks/posterior_predictive.html"><em>posterior predictive
checks</em></a> (a.k.a.
PPCs): sample from your posterior, plug it back in to your model, and
“generate new data sets”. PyMC3 even has a nice function to do all this for
you: <code class="highlighter-rouge">pm.sample_ppc</code>. But what do you do with these new data sets? That’s a
question only you can answer! The point of a PPC is to see if the generated
data sets reproduce patterns you care about in the observed real data set,
and only you know what patterns you care about. E.g. how close are the PPC
means to the observed sample mean? What about the variance?
<ul>
<li>For example, suppose you were modelling the levels of radon gas in
different counties in a country (through a hierarchical model). Then you
could sample radon gas levels from the posterior for each county, and take
the maximum within each county. You’d then have a distribution of maximum
radon gas levels across counties. You could then check if the <em>actual</em>
maximum radon gas level (in your observed data set) is acceptably within
that distribution. If it’s much larger than the maxima, then you would know
that the actual likelihood has longer tails than you assumed (e.g. perhaps
you should use a Student’s T instead of a normal?)</li>
<li>Remember that how well the posterior predictive distribution fits the data
is of little consequence (e.g. the expectation that 90% of the data should
fall within the 90% credible interval of the posterior). The posterior
predictive distribution tells you what values for data you would expect if
we were to remeasure, given that you’ve already observed the data you did.
As such, it’s informed by your prior as well as your data, and it’s not
its job to adequately fit your data!</li>
</ul>
</li>
</ol>George HoThis is a compilation of notes, tips, tricks and recipes for Bayesian modelling that I've collected from everywhere: papers, documentation, peppering my more experienced colleagues with questions.Understanding Hate Speech on Reddit through Text Clustering2018-03-18T00:00:00+00:002018-03-18T00:00:00+00:00https://eigenfoo.xyz/reddit-clusters<blockquote>
<p>Note: the following article contains several examples of hate speech
(including but not limited to racist, misogynistic and homophobic views).</p>
</blockquote>
<p>Have you heard of <code class="highlighter-rouge">/r/TheRedPill</code>? It’s an online forum (a subreddit, but I’ll
explain that later) where people (usually men) espouse an ideology predicated
entirely on gender. “Swallowers of the red pill”, as they call themselves,
maintain that it is <em>men</em>, not women, who are socially marginalized; that feminism
is something between a damaging ideology and a symptom of societal retardation;
that the patriarchy should actively assert its dominance over female
compatriots.</p>
<p>Despite being shunned by the world (or perhaps, because of it), <code class="highlighter-rouge">/r/TheRedPill</code>
has grown into a sizable community and evolved its own slang, language and
culture. Let me give you an example.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Cluster #14:
Cluster importance: 0.0489376285127
shit: 2.433590
test: 1.069885
frame: 0.396684
pass: 0.204953
bitch: 0.163619
</code></pre></div></div>
<p>This is a snippet from a text clustering of <code class="highlighter-rouge">/r/TheRedPill</code> — you don’t really
need to understand the details right now: all you need to know is that each
cluster is simply a bunch of words that frequently appear together in Reddit
posts and comments. Following each word is a number indicating its importance in
the cluster, and on line 2 is the importance of this cluster to the subreddit
overall.</p>
<p>As it turns out, this cluster has picked up on a very specific meme on
<code class="highlighter-rouge">/r/TheRedPill</code>: the concept of the <em>shit test</em>, and how your frame can <em>pass</em> the
<em>shit tests</em> that life (but predominantly, <em>bitches</em>) can throw at you.</p>
<p>There’s absolutely no way I could explain this stuff better than the swallowers
of the red pill themselves, so I’ll just quote from a post on <code class="highlighter-rouge">/r/TheRedPill</code> and
a related blog.</p>
<p>The concept of the shit test very broad:</p>
<blockquote>
<p>… when somebody “gives you shit” and fucks around with your head to see how
you will react, what you are experiencing is typically a (series of) shit
test(s).</p>
</blockquote>
<p>A shit test is designed to test your temperament, or more colloquially,
<em>“determine your frame”</em>.</p>
<blockquote>
<p>Frame is a concept which essentially means “composure and self-control”.</p>
<p>… if you can keep composure/seem unfazed and/or assert your boundaries
despite a shit test, generally speaking you will be considered to have passed
the shit test. If you get upset, offended, doubt yourself or show weakness in
any discernible way when shit tested, it will be generally considered that you
failed the test.</p>
</blockquote>
<p>Finally, not only do shit tests test your frame, but they also serve a specific,
critical social function:</p>
<blockquote>
<p>When it comes right down to it shit tests are typically women’s way of
flirting.</p>
<p>… Those who “pass” show they can handle the woman’s BS and is “on her
level”, so to speak. This is where the evolutionary theory comes into play:
you’re demonstrating her faux negativity doesn’t phase you [sic] and that
you’re an emotionally developed person who isn’t going to melt down at the
first sign of trouble. Ergo you’ll be able to protect her when threats to
her safety emerge.</p>
</blockquote>
<p>If you want to learn more, I took all the above quotes from
<a href="https://www.reddit.com/r/TheRedPill/comments/22qnmk/newbies_read_this_the_definitive_guide_to_shit/">here</a>
and <a href="https://illimitablemen.com/2014/12/14/the-shit-test-encyclopedia/">here</a>:
feel free to toss yourself down that rabbit hole (but you may want to open those
links in Incognito mode).</p>
<p>Clearly though, the cluster did a good job of identifying one topic of
discussion on <code class="highlighter-rouge">/r/TheRedPill</code>. In fact, not only can clustering pick up on a
general topic of conversation, but also on specific memes, motifs and vocabulary
associated with it.</p>
<p>Interested? Read on! I’ll explain what I did, and describe some of my other
results.</p>
<hr />
<p>Reddit is — well, it’s pretty hard to describe what Reddit <em>is</em>, mainly because
Reddit comprises several thousand communities, called <em>subreddits</em>, which center
around topics broad (<code class="highlighter-rouge">/r/Sports</code>) and niche (<code class="highlighter-rouge">/r/thinkpad</code>), delightful
(<code class="highlighter-rouge">/r/aww</code>) and unsavory (<code class="highlighter-rouge">/r/Incels</code>).</p>
<p>Each subreddit is a unique community with its own rules, culture and standards.
Some are welcoming and inclusive, and anyone can post and comment; others, not
so much: you must be invited to even read their front page. Some have pliant
standards about what is acceptable as a post; others have moderators willing to
remove posts and ban users upon any infraction of community guidelines.</p>
<p>Whatever Reddit is though, two things are for certain:</p>
<ol>
<li>
<p>It’s widely used. <em>Very</em> widely used. At the time of writing, it’s the <a href="https://www.alexa.com/topsites/countries/US">fourth
most popular website in the United
States</a> and the <a href="https://www.alexa.com/topsites">sixth most popular
globally</a>.</p>
</li>
<li>
<p>Where there is free speech, there is hate speech. Reddit’s hate speech
problem is <a href="https://www.wired.com/2015/08/reddit-mods-handle-hate-speech/">well
documented</a>,
the <a href="https://www.inverse.com/article/43611-reddit-ceo-steve-huffman-hate-speech">center of recent
controversy</a>,
and even <a href="https://fivethirtyeight.com/features/dissecting-trumps-most-rabid-online-following/">the subject of statistical
analysis</a>.</p>
</li>
</ol>
<p>Now, there are many well-known hateful subreddits. The three that I decided to
focus on were <code class="highlighter-rouge">/r/TheRedPill</code>, <code class="highlighter-rouge">/r/The_Donald</code>, and<code class="highlighter-rouge">/r/CringeAnarchy</code>.</p>
<p>The goal here is to understand what these subreddits are like, and expose their
culture for people to see. To quote <a href="https://www.inverse.com/article/43611-reddit-ceo-steve-huffman-hate-speech">Steve Huffman, Reddit’s
CEO</a>:</p>
<blockquote>
<p>“I believe the best defense against racism and other repugnant views, both
on Reddit and in the world, is instead of trying to control what people
can and cannot say through rules, is to repudiate these views in a free
conversation, and empower our communities to do so on Reddit.”</p>
</blockquote>
<p>And there’s no way we can refute and repudiate these deplorable views without
knowing what those views are. And instead of spending hours of each of these
subreddits ourselves, let’s have a machine learn what gets talked about on these
subreddits.</p>
<hr />
<p>Now, how do we do this? This can be done using <em>clustering</em>, a machine learning
technique in which we’re given data points, and tasked with grouping them in
some way. A picture will explain better than words:</p>
<figure>
<a href="https://eigenfoo.xyz/assets/images/clusters.png"><img src="https://eigenfoo.xyz/assets/images/clusters.png" alt="Illustration of clustering" /></a>
<figcaption>Clustering.</figcaption>
</figure>
<p>The clustering algorithm was hard to decide on. After several dead ends were
explored, I settled on non-negative matrix factorization of the document-term
matrix, featurized using tf-idfs. I don’t really want to go into the technical
details now: suffice to say that this technique is <a href="http://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html">known to work well for this
application</a>
(perhaps I’ll write another piece on this in the future).</p>
<p>Finally, we need our data points: <a href="https://bigquery.cloud.google.com/dataset/fh-bigquery:reddit_comments">Google
BigQuery</a>
has all posts and comments across all of Reddit, from the the beginning of
Reddit right up until the end of 2017. We decided to focus on the last two
months for which there is data: November and December, 2017.</p>
<p>I could talk at length about the technical details, but right now, I want to
focus on the results of the clustering. What follows are two hand-picked
clusters from each of the three subreddits, visualized as word clouds (you can
think of word clouds as visual representations of the code snippet above), as
well as an example comment from each of the clusters.</p>
<h2 id="rtheredpill"><code class="highlighter-rouge">/r/TheRedPill</code></h2>
<p>You already know <code class="highlighter-rouge">/r/TheRedPill</code>, so let me describe the clusters in more detail:
a good number of them are about sex, or about how to approach girls. Comments in
these clusters tend to give advice on how to pick up girls, or describe the
social/sexual exploits of the commenter.</p>
<p>What is interesting is that, as sex-obsessed as <code class="highlighter-rouge">/r/TheRedPill</code> is, many
swallowers (of the red pill) profess that sex is <em>not</em> the purpose of the
subreddit: the point is to becoming an “alpha male”. Even more interesting,
there is more talk about what an alpha male <em>is</em>, and what kind of people
<em>aren’t</em> alpha, than there is about how people can <em>become</em> alpha. This is the
first cluster shown below, and comprises around 3% of all text on
<code class="highlighter-rouge">/r/TheRedPill</code>.</p>
<p>The second cluster comprises around 6% of all text on <code class="highlighter-rouge">/r/TheRedPill</code>, and
contains comments that expound theories on the role of men, women and feminism
in today’s society (it isn’t pretty). Personally, the most repugnant views that
I’ve read are to be found in this cluster.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>I feel like the over dramatization of beta qualities in media/pop culture is due
to the fact that anyone representing these qualities is already Alpha by
default.
The actors who play the white knight lead roles, the rock stars that sing about
pining for some chick… these men/characters are already very Alpha in both looks
and status, so when beta BS comes from their mouths, it’s seen as attractive
because it balances out their already alpha state into that "mostly alpha but
some beta" balance that makes women swoon.
…
</code></pre></div></div>
<figure class="half">
<a href="https://raw.githubusercontent.com/eigenfoo/reddit-clusters/master/wordclouds/images/TheRedPill/13_3.21%25.png"><img src="https://raw.githubusercontent.com/eigenfoo/reddit-clusters/master/wordclouds/images/TheRedPill/13_3.21%25.png" alt="/r/TheRedPill cluster #13" /></a>
<a href="https://raw.githubusercontent.com/eigenfoo/reddit-clusters/master/wordclouds/images/TheRedPill/06_6.41%25.png"><img src="https://raw.githubusercontent.com/eigenfoo/reddit-clusters/master/wordclouds/images/TheRedPill/06_6.41%25.png" alt="/r/TheRedPill cluster #6" /></a>
<figcaption>Wordclouds from /r/TheRedPill.</figcaption>
</figure>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>…
Since the dawn of humanity men were always in control, held all the power and
women were happy because of it. But now men are forced to lose their masculinity
and power or else they'll be killed/punished by other pussy men with big guns
and laws who believe feminism is the right path for humanity.
…
Feminism is really a blessing in disguise because it's a wake up call for men
and a hidden cry for help from women for men to regain their masculinity,
integrity and control over women.
…
</code></pre></div></div>
<h2 id="rthe_donald"><code class="highlighter-rouge">/r/The_Donald</code></h2>
<p>You may have already heard of <code class="highlighter-rouge">/r/The_Donald</code> (a.k.a. the “pro-Trump cesspool”),
famed for their <a href="https://en.wikipedia.org/wiki//r/The_Donald#Conflict_with_Reddit_management">takeover of the Reddit front
page</a>,
and their <a href="https://en.wikipedia.org/wiki//r/The_Donald#Controversies">involvement in several recent
controversies</a>. It
may therefore be surprising to learn that there is an iota of lucid discussion
that goes on, although in a jeering, bullying tone.</p>
<p><code class="highlighter-rouge">/r/The_Donald</code> is the subreddit which has developed the most language and inside
jokes: from “nimble navigators” to “swamp creatures”, “spezzes” to the
“Trumpire”… Explaining these memes would take too long: reach out, or Google, if
you really want to know.</p>
<p>The first cluster accounts for 5% of all text on <code class="highlighter-rouge">/r/The_Donald</code>, and contains
(relatively) coherent arguments both for and against net neutrality. The second
cluster accounts for 1% of the all text on <code class="highlighter-rouge">/r/The_Donald</code>, and is actually from
the subreddit’s <code class="highlighter-rouge">MAGABrickBot</code>, which is a bot that keeps count of how many times
the word “brick” has been used in comments, by automatically generating this
comment.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>So much misinformation perpetuated by the Swamp... Abolishing Net Neutrality
would benefit swamp creatures with corporate payouts but would be most damaging
to conservatives long term.
Net Neutrality was NOT created by Obama, it was actually in effect from the very
beginning...
</code></pre></div></div>
<figure class="half">
<a href="https://raw.githubusercontent.com/eigenfoo/reddit-clusters/master/wordclouds/images/The_Donald/00_5.19%25.png"><img src="https://raw.githubusercontent.com/eigenfoo/reddit-clusters/master/wordclouds/images/The_Donald/00_5.19%25.png" alt="/r/The_Donald cluster #0" /></a>
<a href="https://raw.githubusercontent.com/eigenfoo/reddit-clusters/master/wordclouds/images/The_Donald/02_1.26%25.png"><img src="https://raw.githubusercontent.com/eigenfoo/reddit-clusters/master/wordclouds/images/The_Donald/02_1.26%25.png" alt="/r/The_Donald cluster #2" /></a>
<figcaption>Wordclouds from /r/The_Donald.</figcaption>
</figure>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>**FOR THE LOVE OF GOD GET THIS PATRIOT A BRICK! THAT'S 92278 BRICKS HANDED
OUT!**
We are at **14.3173880911%** of our goal to **BUILD THE WALL** starting from Imperial
Beach, CA to Brownsville, Texas! Lets make sure everyone gets a brick in the
United States! For every Centipede a brick, for every brick a Centipede!
At this rate, the wall will be **1071.35224786 MILES WIDE** and **353.552300867 FEET
HIGH** by tomorrow! **DO YOUR PART!**
</code></pre></div></div>
<h2 id="rcringeanarchy"><code class="highlighter-rouge">/r/CringeAnarchy</code></h2>
<p>On the Internet, <em>cringe</em> is the second-hand embarrassment you feel when someone
acts extremely awkwardly or uncomfortably. And on <code class="highlighter-rouge">/r/CringeAnarchy</code> you can find
memes about the <em>real</em> cringe, which is, um, liberals and anyone else who
advocates for an inclusionary, equitable ideology. Their morally grey jokes run
the gamut of delicate topics: gender, race, sexuality, nationality…</p>
<p>In some respects, the clustering provided very little insight into this
subreddit: each such delicate topic had one or two clusters, and there’s nothing
really remarkable about any of them. This speaks to the inherent difficulty of
training a topic model on memes: I rant at greater length about this topic on
<a href="https://eigenfoo.xyz/lda-sucks/">one of my blog posts</a>.</p>
<p>Both clusters below comprise around 3% of text on <code class="highlighter-rouge">/r/CringeAnarchy</code>: one is to do
with race, and the other is to do with homosexuality.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Has anyone here, non-black or otherwise, ever wished someone felt sorry for
being black? Maybe it's just where I live... the majority is black. It's
whatever.
</code></pre></div></div>
<figure class="half">
<a href="https://raw.githubusercontent.com/eigenfoo/reddit-clusters/master/wordclouds/images/CringeAnarchy/08_3.10%25.png"><img src="https://raw.githubusercontent.com/eigenfoo/reddit-clusters/master/wordclouds/images/CringeAnarchy/08_3.10%25.png" alt="/r/CringeAnarchy cluster #8" /></a>
<a href="https://raw.githubusercontent.com/eigenfoo/reddit-clusters/master/wordclouds/images/CringeAnarchy/12_2.92%25.png"><img src="https://raw.githubusercontent.com/eigenfoo/reddit-clusters/master/wordclouds/images/CringeAnarchy/12_2.92%25.png" alt="/r/CringeAnarchy cluster #8" /></a>
<figcaption>Wordclouds from /r/CringeAnarchy.</figcaption>
</figure>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>…
Also, the distinction between bisexual and gay is academic. If you do a gay
thing, you have done a gay thing. That's what "being gay" means to a LOT of
people. Redefining it is as useful as all the other things SJWs are redefining.
</code></pre></div></div>
<hr />
<p>As much information as that might have been, this was just a glimpse into what
these subreddits are like: I made 20 clusters for each subreddit, and you could
argue that (for somewhat technical reasons) 20 clusters isn’t even enough!
Moreover, there is just no way I could distill everything I learned about these
communities into one Medium story: I’ve curated just the more remarkable or
provocative results to put here.</p>
<p>If you still have the stomach for this stuff, scroll through the complete log
files
<a href="https://github.com/eigenfoo/reddit-clusters/tree/master/clustering/nmf/results">here</a>,
or look through images of the word clouds
<a href="https://github.com/eigenfoo/reddit-clusters/tree/master/wordclouds/images">here</a>.</p>
<p>Finally, as has been said before, “Talk is cheap. Show me the code.” For
everything I’ve written to make these clusters, check out <a href="https://github.com/eigenfoo/reddit-clusters">this GitHub
repository</a>.</p>
<hr />
<p><strong>EDIT (11-08-2018):</strong> If you’re interested in the technical, data science side
of the project, check out the slide deck and speaker notes from <a href="https://eigenfoo.xyz/reddit-slides/">my recent
talk</a> on exactly that!</p>
<hr />
<p><em>This post was originally published on Medium on May 18, 2018: I have since
<a href="https://medium.com/@nikitonsky/medium-is-a-poor-choice-for-blogging-bb0048d19133">migrated away from
Medium</a>
and <a href="https://bts.nomadgate.com/medium-evergreen-content">deleted my account</a> and
<a href="https://www.joshjahans.com/ditching-medium/">all my stories</a>.</em></p>
<p><em>This post was also reprinted in the inaugural issue of The Cooper Union’s
<a href="https://www.facebook.com/theunionjournal/">UNION Journal</a>.</em></p>George HoA recent project on trying to model hate speech on Reddit through text clustering — from 'nimble navigators' to 'swamp creatures', 'spezzes' to the 'Trumpire'.