Jekyll2019-03-14T06:01:48+00:00https://eigenfoo.xyz/feed.xmlEigenfooGeorge HoAutoregressive Models in Deep Learning — A Brief Survey2019-03-09T00:00:00+00:002019-03-09T00:00:00+00:00https://eigenfoo.xyz/deep-autoregressive-models<p>My current project involves working with deep autoregressive models: a class of
remarkable neural networks that aren’t usually seen on a first pass through deep
learning. These notes are a quick write-up of my reading and research: I assume
basic familiarity with deep learning, and aim to highlight general trends and
similarities across autoregressive models, instead of commenting on individual
architectures.</p>
<p><strong>tldr:</strong> <em>Deep autoregressive models are sequence models, yet feed-forward
(i.e. not recurrent); generative models, yet supervised. They are a compelling
alternative to RNNs for sequential data, and GANs for generation tasks.</em></p>
<h2 id="deep-autoregressive-models">Deep Autoregressive Models</h2>
<p>To be explicit (at the expense of redundancy), this blog post is about <em>deep
autoregressive generative sequence models</em>. That’s quite a mouthful of jargon
(and two of those words are actually unnecessary), so let’s unpack that.</p>
<ol>
<li>Deep
<ul>
<li>Well, these papers are using TensorFlow or PyTorch… so they must be
“deep” :wink:</li>
<li>You would think this word is unnecessary, but it’s actually not!
Autoregressive linear models like
<a href="https://en.wikipedia.org/wiki/Autoregressive%E2%80%93moving-average_model">ARMA</a>
or
<a href="https://en.wikipedia.org/wiki/Autoregressive_conditional_heteroskedasticity">ARCH</a>
have been used in statistics, econometrics and financial modelling for
ages.</li>
</ul>
</li>
<li>Autoregressive
<ul>
<li><a href="https://deepgenerativemodels.github.io/notes/autoregressive/">Stanford has a good introduction</a>
to autoregressive models, but I think a good way to explain these models
is to compare them to recurrent neural networks (RNNs), which are far
more well-known.</li>
</ul>
<figure>
<a href="https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/RNN-unrolled.png"><img src="https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/RNN-unrolled.png" /></a>
<figcaption>Obligatory RNN diagram. Source: <a href="https://colah.github.io/posts/2015-08-Understanding-LSTMs/">Chris Olah</a>.</figcaption>
</figure>
<ul>
<li>Like an RNN, an autoregressive model’s output <script type="math/tex">h_t</script> at time <script type="math/tex">t</script>
depends on not just <script type="math/tex">x_t</script>, but also <script type="math/tex">x</script>’s from previous time steps.
However, <em>unlike</em> an RNN, the previous <script type="math/tex">x</script>’s are not provided via some
hidden state: they are given as just another input to the model.</li>
<li>The following animation of Google DeepMind’s WaveNet illustrates this
well: the <script type="math/tex">t</script>th output is generated in a <em>feed-forward</em> fashion from
several input <script type="math/tex">x</script> values.<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></li>
</ul>
<figure>
<a href="https://storage.googleapis.com/deepmind-live-cms/documents/BlogPost-Fig2-Anim-160908-r01.gif"><img src="https://storage.googleapis.com/deepmind-live-cms/documents/BlogPost-Fig2-Anim-160908-r01.gif" /></a>
<figcaption>WaveNet animation. Source: <a href="https://deepmind.com/blog/wavenet-generative-model-raw-audio/">Google DeepMind</a>.</figcaption>
</figure>
<ul>
<li>Put simply, <strong>an autoregressive model is merely a feed-forward model which
predicts future values from past values.</strong></li>
<li>I’ll explain this more later, but it’s worth saying now: autoregressive
models offer a compelling bargain. You can have stable, parallel and
easy-to-optimize training, faster inference computations, and completely
do away with the fickleness of <a href="https://en.wikipedia.org/wiki/Backpropagation_through_time">truncated backpropagation through
time</a>, if you
are willing to accept a model that (by design) <em>cannot have</em> infinite
memory. There is <a href="http://www.offconvex.org/2018/07/27/approximating-recurrent/">recent
research</a>
to suggest that this is a worthwhile tradeoff.</li>
</ul>
</li>
<li>Generative
<ul>
<li>Informally, a generative model is one that can generate new data after
learning from the dataset.</li>
<li>More formally, a generative model models the joint distribution <script type="math/tex">P(X,
Y)</script> of the observation <script type="math/tex">X</script> and the target <script type="math/tex">Y</script>. Contrast this to a
discriminative model that models the conditional distribution <script type="math/tex">P(Y|X)</script>.</li>
<li>GANs and VAEs are two families of popular generative models.</li>
<li>This is unnecessary word #1: any autoregressive model can be run
sequentially to generate a new sequence! Start with your seed <script type="math/tex">x_1, x_2,
..., x_k</script> and predict <script type="math/tex">x_{k+1}</script>. Then use <script type="math/tex">x_2, x_3, ..., x_{k+1}</script>
to predict <script type="math/tex">x_{k+2}</script>, and so on.</li>
</ul>
</li>
<li>Sequence model
<ul>
<li>Fairly self explanatory: a model that deals with sequential data, whether
it is mapping sequences to scalars (e.g. language models), or mapping
sequences to sequences (e.g. machine translation models).</li>
<li>Although sequence models are designed for sequential data (duh), there
has been success at applying them to non-sequential data. For example,
PixelCNN (discussed below) can generate entire images, even though images
are not sequential in nature: the model generates a pixel at a time, in
sequence!<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup></li>
<li>Notice that an autoregressive model must be a sequence model, so it’s
redundant to further describe these models as sequential (which makes
this unnecessary word #2).</li>
</ul>
</li>
</ol>
<p>A good distinction is that “generative” and “sequential” describe <em>what</em> these
models do, or what kind of data they deal with. “Autoregressive” describes <em>how</em>
these models do what they do: i.e. they describe properties of the network or
its architecture.</p>
<h2 id="some-architectures-and-applications">Some Architectures and Applications</h2>
<p>Deep autoregressive models have seen a good degree of success: below is a list
of some of examples. Each architecture merits exposition and discussion, but
unfortunately there isn’t enough space here to devote to do any of them justice.</p>
<ul>
<li><a href="https://arxiv.org/abs/1601.06759">PixelCNN by Google DeepMind</a> was probably
the first deep autoregressive model, and the progenitor of most of the other
models below. Ironically, the authors spend the bulk of the paper discussing a
recurrent model, PixelRNN, and consider PixelCNN as a “workaround” to avoid
excessive computation. However, PixelCNN is probably this paper’s most lasting
contribution.</li>
<li><a href="https://arxiv.org/abs/1701.05517">PixelCNN++ by OpenAI</a> is, unsurprisingly,
PixelCNN but with various improvements.</li>
<li><a href="https://deepmind.com/blog/wavenet-generative-model-raw-audio/">WaveNet by Google
DeepMind</a> is
heavily inspired by PixelCNN, and models raw audio, not just encoded music.
They had to pull <a href="https://en.wikipedia.org/wiki/%CE%9C-law_algorithm">a neat trick from telecommunications/signals
processing</a> in order to
cope with the sheer size of audio (high-quality audio involves at least
16-bit precision samples, which means a 65,536-way-softmax per time step!)</li>
<li><a href="https://arxiv.org/abs/1706.03762">Transformer, a.k.a. <em>the “attention is all you need” model</em> by Google
Brain</a> is now a mainstay of NLP, performing
very well at many NLP tasks and being incorporated into subsequent models like
<a href="https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html">BERT</a>.</li>
</ul>
<p>These models have also found applications: for example, <a href="https://arxiv.org/abs/1610.10099">Google DeepMind’s
ByteNet can perform neural machine translation (in linear
time!)</a> and <a href="https://arxiv.org/abs/1610.00527">Google DeepMind’s Video Pixel
Network can model video</a>.<sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup></p>
<h2 id="some-thoughts-and-observations">Some Thoughts and Observations</h2>
<ol>
<li>Given previous values <script type="math/tex">x_1, x_2, ..., x_t</script>, these models do not output a
<em>value</em> for <script type="math/tex">x_{t+1}</script>, they output the <em>predictive probability
distribution</em> <script type="math/tex">P(x_{t+1} | x_1, x_2, ..., x_t)</script> for <script type="math/tex">x_{t+1}</script>.
<ul>
<li>If the <script type="math/tex">x</script>’s are discrete, then you can do this by outputting an
<script type="math/tex">N</script>-way softmaxxed tensor, where <script type="math/tex">N</script> is the number of discrete
classes. This is what the original PixelCNN did, but gets problematic when
<script type="math/tex">N</script> is large (e.g. in the case of WaveNet, where <script type="math/tex">N = 2^{16}</script>).</li>
<li>If the <script type="math/tex">x</script>’s are continuous, you can model the probability distribution
itself as the sum of basis functions, and having the model output the
parameters of these basis functions. This massively reduces the memory
footprint of the model, and was an important contribution of PixelCNN++.</li>
<li>Theoretically you could have an autoregressive model that <em>doesn’t</em> model
the conditional distribution… but most recent models do.</li>
</ul>
</li>
<li>Autoregressive models are supervised.
<ul>
<li>With the success and hype of GANs and VAEs, it is easy to assume that
all generative models are unsupervised: this is not true!</li>
<li>This means that that training is stable and highly parallelizable, that it
is straightfoward to tune hyperparameters, and that inference is
computationally inexpensive. We can also break out all the good stuff
from ML-101: train-valid-test splits, cross validation, loss metrics, etc.
These are all things that we lose when we resort to e.g. GANs.</li>
</ul>
</li>
<li>Autoregressive models work on both continuous and discrete data.
<ul>
<li>Autoregressive sequential models have worked for audio (WaveNet), images
(PixelCNN++) and text (Transformer): these models are very flexible in the
kind of data that they can model.</li>
<li>Contrast this to GANs, which (as far as I’m aware) cannot model discrete
data.</li>
</ul>
</li>
<li>Autoregressive models are very amenable to conditioning.
<ul>
<li>There are many options for conditioning! You can condition on both
discrete and continuous variables; you can condition at multiple time
scales; you can even condition on latent embeddings or the outputs of
other neural networks.</li>
<li>There is one ostensible problem with using autoregressive models as
generative models: you can only condition on your data’s labels. I.e.
unlike a GAN, you cannot condition on random noise and expect the model
to shape the noise space into a semantically (stylistically) meaningful
latent space.</li>
<li>Google DeepMind followed up their original PixelRNN paper with <a href="https://arxiv.org/abs/1606.05328">another
paper</a> that describes one way to
overcome this problem. Briefly: to condition, they incorporate the latent
vector into the PixelCNN’s activation functions; to produce/learn the
latent vectors, they use a convolutional encoder; and to generate an
image given a latent vector, they replace the traditional deconvolutional
decoder with a conditional PixelCNN.</li>
<li>WaveNet goes even futher and employs “global” and “local” conditioning
(both are achieved by incorporating the latent vectors into WaveNet’s
activation functions). The authors devise a battery of conditioning
schemes to capture speaker identity, linguistic features of input text,
music genre, musical instrument, etc.</li>
</ul>
</li>
<li>Generating output sequences of variable length is not straightforward.
<ul>
<li>Neither WaveNet nor PixelCNN needed to worry about a variable output
length: both audio and images are comprised of a fixed number of outputs
(i.e. audio is just <script type="math/tex">N</script> samples, and images are just <script type="math/tex">N^2</script> pixels).</li>
<li>Text, on the other hand, is different: sentences can be of variable
length. One would think that this is a nail in a coffin, but thankfully
text is discrete: the standard trick is to have a “stop token” that
indicates that the sentence is finished (i.e. model a full stop as its own
token).</li>
<li>As far as I am aware, there is no prior literature on having both
problems: a variable-length output of continuous values.</li>
</ul>
</li>
<li>Autoregressive models can model multiple timescales
<ul>
<li>In the case of music, there are important patterns to model at multiple
time scales: individual musical notes drive correlations between audio
samples at the millisecond scale, and music exhibits rhythmic patterns
over the course of minutes. This is well illustrated by the following
animation:</li>
</ul>
<figure>
<a href="https://storage.googleapis.com/deepmind-live-cms/documents/BlogPost-Fig1-Anim-160908-r01.gif"><img src="https://storage.googleapis.com/deepmind-live-cms/documents/BlogPost-Fig1-Anim-160908-r01.gif" /></a>
<figcaption>Audio exhibits patterns at multiple timescales. Source: <a href="https://deepmind.com/blog/wavenet-generative-model-raw-audio/">Google DeepMind</a>.</figcaption>
</figure>
<ul>
<li>There are two main ways capture these many patterns at these many
different time scales: either make the receptive field of your model
<em>extremely</em> wide (e.g. through dilated convolutions), or condition your
model on a subsampled version of your generated output, which is in turn
produced by an unconditioned model.
<ul>
<li>Google DeepMind composes an unconditional PixelRNN with one or more
conditional PixelRNNs to form a so-called “multi-scale” PixelRNN: the
first PixelRNN generates a lower-resolution image that conditions the
subsequent PixelRNNs.</li>
<li>WaveNet employs a technique and calls them “context stacks”.</li>
</ul>
</li>
</ul>
</li>
<li>How the hell can any of this stuff work?
<ul>
<li>RNNs are theoretically more expressive and powerful than autoregressive
models. However, recent work suggests that such infinite-horizon memory is
seldom achieved in practice.</li>
<li>To quote <a href="http://www.offconvex.org/2018/07/27/approximating-recurrent/">John Miller at the Berkeley AI Research
lab</a>:</li>
</ul>
<blockquote>
<p><strong>Recurrent models trained in practice are effectively feed-forward.</strong> This
could happen either because truncated backpropagation through time
cannot learn patterns significantly longer than <script type="math/tex">k</script> steps, or, more
provocatively, because models <em>trainable by gradient descent</em> cannot
have long-term memory.</p>
</blockquote>
</li>
</ol>
<hr />
<div class="footnotes">
<ol>
<li id="fn:1">
<p>There’s actually a lot more nuance than meets the eye in this animation, but all I’m trying to illustrate is the feed-forward nature of autoregressive models. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>I personally think it’s breathtakingly that machines can do this. Imagine your phone keyboard’s word suggestions (those are autoregressive!) spitting out an entire novel. Or imagine weaving a sweater but you had to choose the color of every stitch, in order, in advance. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>In case you haven’t noticed, Google DeepMind seemed to have had an infatuation with autoregressive models back in 2016. <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>George HoMy current project involves working with a class of fairly niche and interesting neural networks that aren't usually seen on a first pass through deep learning. I thought I'd write up my reading and research and post it.Modern Computational Methods for Bayesian Inference — A Reading List2019-01-02T00:00:00+00:002019-01-02T00:00:00+00:00https://eigenfoo.xyz/bayesian-inference-reading<p>Lately I’ve been troubled by how little I actually knew about how Bayesian
inference <em>really worked</em>. I could explain to you <a href="https://maria-antoniak.github.io/2018/11/19/data-science-crash-course.html">many other machine learning
techniques</a>,
but with Bayesian modelling… well, there’s a model (which is basically the
likelihood, I think?), and then there’s a prior, and then, um…</p>
<p>What actually happens when you run a sampler? What makes inference
“variational”? And what is this automatic differentiation doing in my
variational inference? <em>Cue long sleepless nights, contemplating my own
ignorance.</em></p>
<p>So to celebrate the new year<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>, I compiled a list of things to read — blog
posts, journal papers, books, anything that would help me understand (or at
least, appreciate) the math and computation that happens when I press the <em>Magic
Inference Button™</em>. Again, this reading list isn’t focused on how to use
Bayesian modelling for a <em>specific</em> use case<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>; it’s focused on how modern
computational methods for Bayesian inference work <em>in general</em>.</p>
<p>So without further ado…</p>
<h2 id="markov-chain-monte-carlo">Markov-Chain Monte Carlo</h2>
<h3 id="for-the-uninitiated">For the uninitiated</h3>
<ol>
<li><a href="https://twiecki.github.io/blog/2015/11/10/mcmc-sampling/">MCMC Sampling for
Dummies</a> by Thomas
Wiecki. A basic introduction to MCMC with accompanying Python snippets. The
Metropolis sampler is used an introduction to sampling.</li>
<li><a href="http://www.mcmchandbook.net/HandbookChapter1.pdf">Introduction to Markov Chain Monte
Carlo</a> by Charles Geyer.
The first chapter of the aptly-named <a href="http://www.mcmchandbook.net/"><em>Handbook of Markov Chain Monte
Carlo</em></a>.</li>
</ol>
<h3 id="hamiltonian-monte-carlo-and-the-no-u-turn-sampler">Hamiltonian Monte Carlo and the No-U-Turn Sampler</h3>
<ol>
<li><a href="https://arogozhnikov.github.io/2016/12/19/markov_chain_monte_carlo.html">Hamiltonian Monte Carlo
explained</a>.
A visual and intuitive explanation of HMC: great for starters.</li>
<li><a href="https://arxiv.org/abs/1701.02434">A Conceptual Introduction to Hamiltonian Monte
Carlo</a> by Michael Betancourt. An excellent
paper for a solid conceptual understanding and principled intuition for HMC.</li>
<li><a href="https://arxiv.org/abs/1111.4246">The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte
Carlo</a> by Matthew Hoffman and Andrew Gelman.
The original NUTS paper.</li>
<li><a href="http://www.mcmchandbook.net/HandbookChapter5.pdf">MCMC Using Hamiltonian
Dynamics</a> by Radford Neal.</li>
<li><a href="https://colindcarroll.com/talk/hamiltonian-monte-carlo/">Hamiltonian Monte Carlo in
PyMC3</a> by Colin
Carroll.</li>
</ol>
<h3 id="sequential-monte-carlo-and-particle-filters">Sequential Monte Carlo and particle filters</h3>
<ol>
<li><a href="https://www.stats.ox.ac.uk/~doucet/doucet_defreitas_gordon_smcbookintro.pdf">An Introdution to Sequential Monte Carlo
Methods</a>
by Arnaud Doucet, Nando de Freitas and Neil Gordon. This chapter from <a href="https://www.springer.com/us/book/9780387951461">the
authors’ textbook on SMC</a>
provides motivation for using SMC methods, and gives a brief introduction to
a basic particle filter.</li>
<li><a href="http://www.stats.ox.ac.uk/~doucet/smc_resources.html">Sequential Monte Carlo Methods & Particle Filters
Resources</a> by Arnaud
Doucet. A list of resources on SMC and particle filters: way more than you
probably ever need to know about them.</li>
</ol>
<h3 id="other-sampling-methods">Other sampling methods</h3>
<ol>
<li>Chapter 11 (Sampling Methods) of <a href="https://www.microsoft.com/en-us/research/people/cmbishop/#!prml-book">Pattern Recognition and Machine
Learning</a>
by Christopher Bishop. Covers rejection, importance, Metropolis-Hastings,
Gibbs and slice sampling. Perhaps not as rampantly useful as NUTS, but good
to know nevertheless.</li>
<li><a href="https://chi-feng.github.io/mcmc-demo/">The Markov-chain Monte Carlo Interactive
Gallery</a> by Chi Feng. A fantastic
library of visualizations of various MCMC samplers.</li>
</ol>
<h2 id="variational-inference">Variational Inference</h2>
<h3 id="for-the-uninitiated-1">For the uninitiated</h3>
<ol>
<li><a href="http://willwolf.io/2018/11/11/em-for-lda/">Deriving
Expectation-Maximization</a> by Will
Wolf. The first blog post in a series that builds from EM all the way to VI.
Also check out <a href="http://willwolf.io/2018/11/23/mean-field-variational-bayes/">Deriving Mean-Field Variational
Bayes</a>.</li>
<li><a href="https://arxiv.org/abs/1601.00670">Variational Inference: A Review for
Statisticians</a> by David Blei, Alp
Kucukelbir and Jon McAuliffe. An high-level overview of variational
inference: the authors go over one example (performing VI on GMMs) in depth.</li>
<li>Chapter 10 (Approximate Inference) of <a href="https://www.microsoft.com/en-us/research/people/cmbishop/#!prml-book">Pattern Recognition and Machine
Learning</a>
by Christopher Bishop.</li>
</ol>
<h3 id="automatic-differentiation-variational-inference-advi">Automatic differentiation variational inference (ADVI)</h3>
<ol>
<li><a href="https://arxiv.org/abs/1603.00788">Automatic Differentiation Variational
Inference</a> by Alp Kucukelbir, Dustin Tran
et al. The original ADVI paper.</li>
<li><a href="https://papers.nips.cc/paper/5758-automatic-variational-inference-in-stan">Automatic Variational Inference in
Stan</a>
by Alp Kucukelbir, Rajesh Ranganath, Andrew Gelman and David Blei.</li>
</ol>
<h2 id="open-source-software-for-bayesian-inference">Open-Source Software for Bayesian Inference</h2>
<p>There are many open-source software libraries for Bayesian modelling and
inference, and it is instructive to look into the inference methods that they do
(or do not!) implement.</p>
<ol>
<li><a href="http://mc-stan.org/">Stan</a></li>
<li><a href="http://docs.pymc.io/">PyMC3</a></li>
<li><a href="http://pyro.ai/">Pyro</a></li>
<li><a href="https://www.tensorflow.org/probability/">Tensorflow Probability</a></li>
<li><a href="http://edwardlib.org/">Edward</a></li>
<li><a href="https://greta-stats.org/">Greta</a></li>
<li><a href="https://dotnet.github.io/infer/">Infer.NET</a></li>
<li><a href="https://www.mrc-bsu.cam.ac.uk/software/bugs/">BUGS</a></li>
<li><a href="http://mcmc-jags.sourceforge.net/">JAGS</a></li>
</ol>
<h2 id="further-topics">Further Topics</h2>
<p>Bayesian inference doesn’t stop at MCMC and VI: there is bleeding-edge research
being done on other methods of inference. While they aren’t ready for real-world
use, it is interesting to see what they are.</p>
<h3 id="approximate-bayesian-computation-abc-and-likelihood-free-methods">Approximate Bayesian computation (ABC) and likelihood-free methods</h3>
<ol>
<li><a href="https://arxiv.org/abs/1001.2058">Likelihood-free Monte Carlo</a> by Scott
Sisson and Yanan Fan.</li>
</ol>
<h3 id="expectation-propagation">Expectation propagation</h3>
<ol>
<li><a href="https://arxiv.org/abs/1412.4869">Expectation propagation as a way of life: A framework for Bayesian inference
on partitioned data</a> by Aki Vehtari, Andrew
Gelman, et al.</li>
</ol>
<h3 id="operator-variational-inference-opvi">Operator variational inference (OPVI)</h3>
<ol>
<li><a href="https://arxiv.org/abs/1610.09033">Operator Variational Inference</a> by Rajesh
Ranganath, Jaan Altosaar, Dustin Tran and David Blei. The original OPVI
paper.</li>
</ol>
<p>(I’ve tried to include as many relevant and helpful resources as I could find,
but if you feel like I’ve missed something, <a href="https://twitter.com/@_eigenfoo">drop me a
line</a>!)</p>
<hr />
<div class="footnotes">
<ol>
<li id="fn:1">
<p><a href="https://twitter.com/year_progress/status/1079889949871300608">Relevant tweet here.</a> <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>If that’s what you’re looking for, check out my <a href="https://eigenfoo.xyz/bayesian-modelling-cookbook">Bayesian modelling cookbook</a> or <a href="https://betanalpha.github.io/assets/case_studies/principled_bayesian_workflow.html">Michael Betancourt’s excellent essay on a principles Bayesian workflow</a>. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>George HoAn annotated reading list on modern computational methods for Bayesian inference — Markov chain Monte Carlo (MCMC), variational inference (VI) and some other (more experimental) methods.Modelling Hate Speech on Reddit — A Three-Act Play (Slide Deck)2018-11-03T00:00:00+00:002017-11-08T00:00:00+00:00https://eigenfoo.xyz/reddit-slides<p>This is a follow-up post to my first post on a recent project to <a href="https://eigenfoo.xyz/reddit-clusters/">model hate
speech on Reddit</a>. If you haven’t taken a
look at my first post, please do!</p>
<p>I recently gave a talk on the technical, data science side of the project,
describing not just the final result, but also the trajectory of the whole
project: stumbling blocks, dead ends and all. Below is the slide deck, as well
as the speaker notes. Enjoy!</p>
<h2 id="abstract">Abstract</h2>
<p>Reddit is the one of the most popular discussion websites today, and is famously
broad-minded in what it allows to be said on its forums: however, where there is
free speech, there are invariably pockets of hate speech.</p>
<p>In this talk, I present a recent project to model hate speech on Reddit. In
three acts, I chronicle the thought processes and stumbling blocks of the
project, with each act applying a different form of machine learning: supervised
learning, topic modelling and text clustering. I conclude with the current state
of the project: a system that allows the modelling and summarization of entire
subreddits, and possible future directions. Rest assured that both the talk and
the slides have been scrubbed to be safe for work!</p>
<h2 id="slides">Slides</h2>
<p>(Don’t forget to take a look at the speaker notes!)</p>
<style>
.responsive-wrap iframe{ max-width: 100%;}
</style>
<div class="responsive-wrap">
<!-- this is the embed code provided by Google -->
<iframe src="https://docs.google.com/presentation/d/e/2PACX-1vS9wBAwScepPz3vmvyMrq-osBfIGzL7C3wArXmL3ky_A2dfaqlVSshTz2CyHuMibQBX3Ej6QCsZ0qv_/embed?start=false&loop=false&delayms=3000" frameborder="0" width="960" height="569" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe>
<!-- Google embed ends -->
</div>George HoA talk I gave about a recent project to model hate speech on Reddit. In this blog post, I describe the thought processes behind the project, and the stumbling blocks encountered along the way.Probabilistic and Bayesian Matrix Factorizations for Text Clustering2018-10-13T00:00:00+00:002018-10-13T00:00:00+00:00https://eigenfoo.xyz/matrix-factorizations<p>Natural language processing is in a curious place right now. It was always a
late bloomer (as far as machine learning subfields go), and it’s not immediately
obvious how close the field is to viable, large-scale, production-ready
techniques (in the same way that, say, <a href="https://clarifai.com/models/">computer vision
is</a>). For example, <a href="https://ruder.io">Sebastian
Ruder</a> predicted that the field is <a href="https://thegradient.pub/nlp-imagenet/">close to a watershed
moment</a>, and that soon we’ll have
downloadable language models. However, <a href="https://amarasovic.github.io/">Ana
Marasović</a> points out that there is <a href="https://thegradient.pub/frontiers-of-generalization-in-natural-language-processing/">a tremendous
amount of work demonstrating
that</a>:</p>
<blockquote>
<p>“despite good performance on benchmark datasets, modern NLP techniques are
nowhere near the skill of humans at language understanding and reasoning when
making sense of novel natural language inputs”.</p>
</blockquote>
<p>I am confident that I am <em>very</em> bad at making lofty predictions about the
future. Instead, I’ll talk about something I know a bit about: simple solutions
to concrete problems, with some Bayesianism thrown in for good measure
:grinning:.</p>
<p>This blog post summarizes some literature on probabilistic and Bayesian
matrix factorization methods, keeping an eye out for applications to one
specific task in NLP: text clustering. It’s exactly what it sounds like, and
there’s been a fair amount of success in applying text clustering to many other
NLP tasks (e.g. check out these examples in <a href="https://www-users.cs.umn.edu/~hanxx023/dmclass/scatter.pdf">document
organization</a>,
<a href="http://jmlr.csail.mit.edu/papers/volume3/bekkerman03a/bekkerman03a.pdf">corpus</a>
<a href="https://www.cs.technion.ac.il/~rani/el-yaniv-papers/BekkermanETW01.pdf">summarization</a>
and <a href="http://www.kamalnigam.com/papers/emcat-aaai98.pdf">document
classification</a>).</p>
<p>What follows is a literature review of three matrix factorization techniques for
machine learning: one classical, one probabilistic and one Bayesian. I also
experimented with applying these methods to text clustering: I gave a guest
lecture on my results to a graduate-level machine learning class at The Cooper
Union (the slide deck is below). Dive in!</p>
<h2 id="non-negative-matrix-factorization-nmf">Non-Negative Matrix Factorization (NMF)</h2>
<p>NMF is a <a href="https://en.wikipedia.org/wiki/Non-negative_matrix_factorization">very
well-known</a>
<a href="http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html">matrix
factorization</a>
<a href="https://arxiv.org/abs/1401.5226">technique</a>, perhaps most famous for its
applications in <a href="http://blog.echen.me/2011/10/24/winning-the-netflix-prize-a-summary/">collaborative filtering and the Netflix
Prize</a>.</p>
<p>Factorize your (entrywise non-negative) <script type="math/tex">m \times n</script> matrix <script type="math/tex">V</script> as
<script type="math/tex">V = WH</script>, where <script type="math/tex">W</script> is <script type="math/tex">m \times p</script> and <script type="math/tex">H</script> is <script type="math/tex">p \times n</script>. <script type="math/tex">p</script>
is the dimensionality of your latent space, and each latent dimension usually
comes to quantify something with semantic meaning. There are several algorithms
to compute this factorization, but Lee and Seung’s <a href="https://dl.acm.org/citation.cfm?id=3008829">multiplicative update
rule</a> (originally published in NIPS
2000) is most popular.</p>
<p>Fairly simple: enough said, I think.</p>
<h2 id="probabilistic-matrix-factorization-pmf">Probabilistic Matrix Factorization (PMF)</h2>
<p>Originally introduced as a paper at <a href="https://papers.nips.cc/paper/3208-probabilistic-matrix-factorization">NIPS
2007</a>,
<em>probabilistic matrix factorization</em> is essentially the exact same model as NMF,
but with uncorrelated (a.k.a. “spherical”) multivariate Gaussian priors placed
on the rows and columns of <script type="math/tex">U</script> and <script type="math/tex">V</script>. Expressed as a graphical model, PMF
would look like this:</p>
<figure>
<a href="/assets/images/pmf.png"><img style="float: middle" src="/assets/images/pmf.png" /></a>
</figure>
<p>Note that the priors are placed on the <em>rows</em> of the <script type="math/tex">U</script> and <script type="math/tex">V</script> matrices.</p>
<p>The authors then (somewhat disappointing) proceed to find the MAP estimate of
the <script type="math/tex">U</script> and <script type="math/tex">V</script> matrices. They show that maximizing the posterior is
equivalent to minimizing the sum-of-squared-errors loss function with two
quadratic regularization terms:</p>
<script type="math/tex; mode=display">\frac{1}{2} \sum_{i=1}^{N} \sum_{j=1}^{M} {I_{ij} (R_{ij} - U_i^T V_j)^2} +
\frac{\lambda_U}{2} \sum_{i=1}^{N} |U|_{Fro}^2 +
\frac{\lambda_V}{2} \sum_{j=1}^{M} |V|_{Fro}^2</script>
<p>where <script type="math/tex">|\cdot|_{Fro}</script> denotes the Frobenius norm, and <script type="math/tex">I_{ij}</script> is 1 if document
<script type="math/tex">i</script> contains word <script type="math/tex">j</script>, and 0 otherwise.</p>
<p>This loss function can be minimized via gradient descent, and implemented in
your favorite deep learning framework (e.g. Tensorflow or PyTorch).</p>
<p>The problem with this approach is that while the MAP estimate is often a
reasonable point in low dimensions, it becomes very strange in high dimensions,
and is usually not informative or special in any way. Read <a href="https://www.inference.vc/high-dimensional-gaussian-distributions-are-soap-bubble/">Ferenc Huszár’s blog
post</a>
for more.</p>
<h2 id="bayesian-probabilistic-matrix-factorization-bpmf">Bayesian Probabilistic Matrix Factorization (BPMF)</h2>
<p>Strictly speaking, PMF is not a Bayesian model. After all, there aren’t any
priors or posteriors, only fixed hyperparameters and a MAP estimate. <em>Bayesian
probabilistic matrix factorization</em>, originally published by <a href="https://dl.acm.org/citation.cfm?id=1390267">researchers from
the University of Toronto</a> is a
fully Bayesian treatment of PMF.</p>
<p>Instead of saying that the rows/columns of U and V are normally distributed with
zero mean and some precision matrix, we place hyperpriors on the mean vector and
precision matrices. The specific priors are Wishart priors on the covariance
matrices (with scale matrix <script type="math/tex">W_0</script> and <script type="math/tex">\nu_0</script> degrees of freedom), and
Gaussian priors on the means (with mean <script type="math/tex">\mu_0</script> and covariance equal to the
covariance given by the Wishart prior). Expressed as a graphical model, BPMF
would look like this:</p>
<figure>
<a href="/assets/images/bpmf.png"><img style="float: middle" src="/assets/images/bpmf.png" /></a>
</figure>
<p>Note that, as above, the priors are placed on the <em>rows</em> of the <script type="math/tex">U</script> and <script type="math/tex">V</script>
matrices, and that <script type="math/tex">n</script> is the dimensionality of latent space (i.e. the number
of latent dimensions in the factorization).</p>
<p>The authors then sample from the posterior distribution of <script type="math/tex">U</script> and <script type="math/tex">V</script> using
a Gibbs sampler. Sampling takes several hours: somewhere between 5 to 180,
depending on how many samples you want. Nevertheless, the authors demonstrate
that BPMF can achieve more accurate and more robust results on the Netflix data
set.</p>
<p>I would propose two changes to the original paper:</p>
<ol>
<li>Use an LKJ prior on the covariance matrices instead of a Wishart prior.
<a href="https://docs.pymc.io/notebooks/LKJ.html">According to Michael Betancourt and the PyMC3 docs, this is more numerically
stable</a>, and will lead to better
inference.</li>
<li>Use a more robust sampler such as NUTS (instead of a Gibbs sampler), or even
resort to variational inference. The paper makes it clear that BPMF is a
computationally painful endeavor, so any speedup to the method would be a
great help. It seems to me that for practical real-world applications to
collaborative filtering, we would want to use variational inference. Netflix
ain’t waiting 5 hours for their recommendations.</li>
</ol>
<h2 id="application-to-text-clustering">Application to Text Clustering</h2>
<p>Most of the work in these matrix factorization techniques focus on
dimensionality reduction: that is, the problem of finding two factor matrices
that faithfully reconstruct the original matrix when multiplied together.
However, I was interested in applying the exact same techniques to a separate
task: text clustering.</p>
<p>A natural question is: why is matrix factorization<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup> a good technique to use
for text clustering? Because it is simultaneously a clustering and a feature
engineering technique: not only does it offer us a latent representation of the
original data, but it also gives us a way to easily <em>reconstruct</em> the original
data from the latent variables! This is something that <a href="https://eigenfoo.xyz/lda-sucks">latent Dirichlet
allocation</a>, for instance, cannot do.</p>
<p>Matrix factorization lives an interesting double life: clustering technique by
day, feature transformation technique by night. <a href="http://charuaggarwal.net/text-cluster.pdf">Aggarwal and
Zhai</a> suggest that chaining matrix
factorization with some other clustering technique (e.g. agglomerative
clustering or topic modelling) is common practice and is called <em>concept
decomposition</em>, but I haven’t seen any other source back this up.</p>
<p>I experimented with using these techniques to cluster subreddits (<a href="https://eigenfoo.xyz/reddit-clusters">sound
familiar?</a>). In a nutshell, nothing seemed
to work out very well, and I opine on why I think that’s the case in the slide
deck below. This talk was delivered to a graduate-level course in frequentist
machine learning. Don’t forget to take a look at the speaker notes too!</p>
<style>
.responsive-wrap iframe{ max-width: 100%;}
</style>
<div class="responsive-wrap">
<!-- this is the embed code provided by Google -->
<iframe src="https://docs.google.com/presentation/d/e/2PACX-1vT_yB6dMJCnnwKRtkGbdx90lhYGGH329QAGrYw8SaR2mCh0VuocMWGEVJ2XhFNp44JQtPV_vOlQkslo/embed?start=false&loop=false&delayms=3000" frameborder="0" width="960" height="569" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe>
<!-- Google embed ends -->
</div>
<hr />
<div class="footnotes">
<ol>
<li id="fn:1">
<p>which is, by the way, a <a href="http://scikit-learn.org/stable/modules/decomposition.html">severely underappreciated technique in machine learning</a> <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>George HoThis blog post summarizes some literature on probabilistic and Bayesian matrix factorization methods, keeping an eye out for applications to one specific task in NLP: text clustering.Multi-Armed Bandits, Conjugate Models and Bayesian Reinforcement Learning2018-08-31T00:00:00+00:002018-08-31T00:00:00+00:00https://eigenfoo.xyz/bayesian-bandits<p>Let’s talk about Bayesianism. It’s developed a reputation (not entirely
justified, but not entirely unjustified either) for being too mathematically
sophisticated or too computationally intensive to work at scale. For instance,
inferring from a Gaussian mixture model is fraught with computational problems
(hierarchical funnels, multimodal posteriors, etc.), and may take a seasoned
Bayesian anywhere between a day and a month to do well. On the other hand, other
blunt hammers of estimation are as easy as a maximum likelihood estimate:
something you could easily get a SQL query to do if you wanted to.</p>
<p>In this blog post I hope to show that there is more to Bayesianism than just
MCMC sampling and suffering, by demonstrating a Bayesian approach to a classic
reinforcement learning problem: the <em>multi-armed bandit</em>.</p>
<p>The problem is this: imagine a gambler at a row of slot machines (each machine
being a “one-armed bandit”), who must devise a strategy so as to maximize
rewards. This strategy includes which machines to play, how many times to play
each machine, in which order to play them, and whether to continue with the
current machine or try a different machine.</p>
<p>This problem is a central problem in decision theory and reinforcement learning:
the agent (our gambler) starts out in a state of ignorance, but learns through
interacting with its environment (playing slots). For more details, Cam
Davidson-Pilon has a great introduction to multi-armed bandits in Chapter 6 of
his book <a href="https://nbviewer.jupyter.org/github/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/blob/master/Chapter6_Priorities/Ch6_Priors_PyMC3.ipynb"><em>Bayesian Methods for
Hackers</em></a>,
and Tor Lattimore and Csaba Szepesvári cover a breathtaking amount of the
underlying theory in their book <a href="http://banditalgs.com/"><em>Bandit Algorithms</em></a>.</p>
<p>So let’s get started! I assume that you are familiar with:</p>
<ul>
<li>some basic probability, at least enough to know some distributions: normal,
Bernoulli, binomial…</li>
<li>some basic Bayesian statistics, at least enough to understand what a
<a href="https://en.wikipedia.org/wiki/Conjugate_prior">conjugate prior</a> (and
conjugate model) is, and why one might like them.</li>
<li><a href="https://jeffknupp.com/blog/2013/04/07/improve-your-python-yield-and-generators-explained/">Python generators and the <code class="highlighter-rouge">yield</code>
keyword</a>,
to understand some of the code I’ve written<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>.</li>
</ul>
<p>Dive in!</p>
<h2 id="the-algorithm">The Algorithm</h2>
<p>The algorithm is straightforward. The description below is taken from Cam
Davidson-Pilon over at Data Origami<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>.</p>
<p>For each round,</p>
<ol>
<li>Sample a random variable <script type="math/tex">X_b</script> from the prior of bandit <script type="math/tex">b</script>, for all
<script type="math/tex">b</script>.</li>
<li>Select the bandit with largest sample, i.e. select bandit <script type="math/tex">B =
\text{argmax}(X_b)</script>.</li>
<li>Observe the result of pulling bandit <script type="math/tex">B</script>, and update your prior on bandit
<script type="math/tex">B</script> using the conjugate model update rule.</li>
<li>Repeat!</li>
</ol>
<p>What I find remarkable about this is how dumbfoundingly simple it is! No MCMC
sampling, no <script type="math/tex">\hat{R}</script>s to diagnose, no pesky divergences… all it requires is
a conjugate model, and the rest is literally just counting.</p>
<p><strong>NB:</strong> This algorithm is technically known as <em>Thompson sampling</em>, and is only
one of many algorithms out there. The main difference is that there are other
ways to go from our current priors to a decision on which bandit to play
next. E.g. instead of simply sampling from our priors, we could use the
upper bound of the 90% credible region, or some dynamic quantile of the
posterior (as in Bayes UCB). See Data Origami<sup id="fnref:2:1"><a href="#fn:2" class="footnote">2</a></sup> for more information.</p>
<h3 id="stochastic-aka-stationary-bandits">Stochastic (a.k.a. stationary) bandits</h3>
<p>Let’s take this algorithm for a spin! Assume we have rewards which are Bernoulli
distributed (this would be the situation we face when e.g. modelling
click-through rates). The conjugate prior for the Bernoulli distribution is the
Beta distribution (this is a special case of the Beta-Binomial model).</p>
<script src="https://gist.github.com/eigenfoo/3d8d318f5bd8fdea24f7b12936de77b5.js"></script>
<p>Here, <code class="highlighter-rouge">pull</code> returns the result of pulling on the <code class="highlighter-rouge">arm</code>‘th bandit, and
<code class="highlighter-rouge">make_bandits</code> is just a factory function for <code class="highlighter-rouge">pull</code>.</p>
<p>The <code class="highlighter-rouge">bayesian_strategy</code> function actually implements the algorithm. We only need
to keep track of the number of times we win and the number of times we played
(<code class="highlighter-rouge">num_rewards</code> and <code class="highlighter-rouge">num_trials</code>, respectively). It samples from all current
<code class="highlighter-rouge">np.random.beta</code> priors (where the original prior was a <script type="math/tex">\text{Beta}(2,
2)</script>, which is symmetrix about 0.5 and explains the odd-looking <code class="highlighter-rouge">a=2+</code> and
<code class="highlighter-rouge">b=2+</code> there), picks the <code class="highlighter-rouge">np.argmax</code>, <code class="highlighter-rouge">pull</code>s that specific bandit, and updates
<code class="highlighter-rouge">num_rewards</code> and <code class="highlighter-rouge">num_trials</code>.</p>
<p>I’ve omitted the data visualization code here, but if you want to see it, check
out the <a href="https://github.com/eigenfoo/wanderings/blob/afcf37a8c6c2a2ac38f6708c1f3dd50db2ebe71f/bayes/bayesian-bandits.ipynb">Jupyter notebook on my
GitHub</a></p>
<figure>
<a href="/assets/images/beta-binomial.png"><img style="float: middle" src="/assets/images/beta-binomial.png" /></a>
</figure>
<h3 id="generalizing-to-conjugate-models">Generalizing to conjugate models</h3>
<p>In fact, this algorithm isn’t just limited to Bernoulli-distributed rewards: it
will work for any <a href="https://en.wikipedia.org/wiki/Conjugate_prior#Table_of_conjugate_distributions">conjugate
model</a>!
Here I implement the Gamma-Poisson model (that is, Poisson distributed rewards,
with a Gamma conjugate prior) to illustrate how extensible this framework is.
(Who cares about Poisson distributed rewards, you ask? Anyone who worries about
returning customers, for one!)</p>
<p>Here’s what we need to change:</p>
<ul>
<li>The rewards distribution on line 5 (in practice, you don’t get to pick this,
so <em>technically</em> there’s nothing to change if you’re doing this in
production!)</li>
<li>The sampling from the prior on lines 17–18</li>
<li>The variables you need to keep track of and update rule on lines 12–13 and
24–25.</li>
</ul>
<p>Without further ado:</p>
<script src="https://gist.github.com/eigenfoo/e9a9933d94524e6dee717276c6b6f732.js"></script>
<figure>
<a href="/assets/images/gamma-poisson.png"><img style="float: middle" src="/assets/images/gamma-poisson.png" /></a>
</figure>
<p>This really demonstrates how lean and mean conjugate models can be, especially
considering how much of a pain MCMC or approximate inference methods would be,
compared to literal <em>counting</em>. Conjugate models aren’t just textbook examples:
they’re <em>(gasp)</em> actually useful!</p>
<h3 id="generalizing-to-arbitrary-rewards-distributions">Generalizing to arbitrary rewards distributions</h3>
<p>OK, so if we have a conjugate model, we can use Thompson sampling to solve the
multi-armed bandit problem. But what if our rewards distribution doesn’t have a
conjugate prior, or what if we don’t even <em>know</em> our rewards distribution?</p>
<p>In general this problem is very difficult to solve. Theoretically, we could
place some fairly uninformative prior on our rewards, and after every pull we
could run MCMC to get our posterior, but that doesn’t scale, especially for the
online algorithms that we have in mind. Luckily a recent paper by Agrawal and
Goyal<sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup> gives us some help, <em>if we assume rewards are bounded on the interval
<script type="math/tex">[0, 1]</script></em> (of course, if we have bounded rewards, then we can just normalize
them by their maximum value to get rewards between 0 and 1).</p>
<p>This solutions bootstraps the first Beta-Bernoulli model to this new situation.
Here’s what happens:</p>
<ol>
<li>Sample a random variable <script type="math/tex">X_b</script> from the (Beta) prior of bandit <script type="math/tex">b</script>, for
all <script type="math/tex">b</script>.</li>
<li>Select the bandit with largest sample, i.e. select bandit <script type="math/tex">B =
\text{argmax}(X_b)</script>.</li>
<li>Observe the reward <script type="math/tex">R</script> from bandit <script type="math/tex">B</script>.</li>
<li><strong>Observe the outcome <script type="math/tex">r</script> from a Bernoulli trial with probability of success <script type="math/tex">R</script>.</strong></li>
<li>Update posterior of <script type="math/tex">B</script> with this observation <script type="math/tex">r</script>.</li>
<li>Repeat!</li>
</ol>
<p>Here I do this for the logit-normal distribution (i.e. a random variable whose
logit is normally distributed). Note that <code class="highlighter-rouge">np.expit</code> is the inverse of the logit
function.</p>
<script src="https://gist.github.com/eigenfoo/7a397fef8aaa028c5119c9f86860d72e.js"></script>
<figure>
<a href="/assets/images/bounded.png"><img style="float: middle" src="/assets/images/bounded.png" /></a>
</figure>
<h2 id="final-remarks">Final Remarks</h2>
<p>None of this theory is new: I’m just advertising it :blush:. See Cam
Davidson-Pilon’s great blog post about Bayesian bandits<sup id="fnref:2:2"><a href="#fn:2" class="footnote">2</a></sup> for a much more
in-depth treatment, and of course, read around papers on arXiv if you want to go
deeper!</p>
<p>Also, if you want to see all the code that went into this blog post, check out
<a href="https://github.com/eigenfoo/wanderings/blob/afcf37a8c6c2a2ac38f6708c1f3dd50db2ebe71f/bayes/bayesian-bandits.ipynb">the notebook
here</a>.</p>
<hr />
<div class="footnotes">
<ol>
<li id="fn:1">
<p>I’ve hopped on board the functional programming bandwagon, and couldn’t help but think that to demonstrate this idea, I didn’t need a framework, a library or even a class. Just two functions! <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>Davidson-Pilon, Cameron. “Multi-Armed Bandits.” DataOrigami, 6 Apr. 2013, <a href="https://dataorigami.net/blogs/napkin-folding/79031811-multi-armed-bandits">dataorigami.net/blogs/napkin-folding/79031811-multi-armed-bandits</a> <a href="#fnref:2" class="reversefootnote">↩</a> <a href="#fnref:2:1" class="reversefootnote">↩<sup>2</sup></a> <a href="#fnref:2:2" class="reversefootnote">↩<sup>3</sup></a></p>
</li>
<li id="fn:3">
<p><a href="https://arxiv.org/abs/1111.1797">arXiv:1111.1797</a> <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>George HoIn this blog post I hope to show that there is more to Bayesianism than just MCMC sampling and suffering, by demonstrating a Bayesian approach to a classic reinforcement learning problem: the _multi-armed bandit_.Cookbook — Bayesian Modelling with PyMC32018-06-19T00:00:00+00:002018-06-24T00:00:00+00:00https://eigenfoo.xyz/bayesian-modelling-cookbook<p>Recently I’ve started using <a href="https://github.com/pymc-devs/pymc3">PyMC3</a> for
Bayesian modelling, and it’s an amazing piece of software! The API only exposes
as much of heavy machinery of MCMC as you need — by which I mean, just the
<code class="highlighter-rouge">pm.sample()</code> method (a.k.a., as <a href="http://twiecki.github.io/blog/2013/08/12/bayesian-glms-1/">Thomas
Wiecki</a> puts it, the
<em>Magic Inference Button™</em>). This really frees up your mind to think about your
data and model, which is really the heart and soul of data science!</p>
<p>That being said however, I quickly realized that the water gets very deep very
fast: I explored my data set, specified a hierarchical model that made sense to
me, hit the <em>Magic Inference Button™</em>, and… uh, what now? I blinked at the
angry red warnings the sampler spat out.</p>
<p>So began by long, rewarding and ongoing exploration of Bayesian modelling. This
is a compilation of notes, tips, tricks and recipes that I’ve collected from
everywhere: papers, documentation, peppering my <a href="https://twitter.com/twiecki">more
experienced</a>
<a href="https://twitter.com/aseyboldt">colleagues</a> with questions. It’s still very much
a work in progress, but hopefully somebody else finds it useful!</p>
<p><img style="float: middle" width="600" src="https://cdn.rawgit.com/pymc-devs/pymc3/master/docs/logos/svg/PyMC3_banner.svg" /></p>
<h2 id="for-the-uninitiated">For the Uninitiated</h2>
<ul>
<li>First of all, <em>welcome!</em> It’s a brave new world out there — where statistics
is cool, Bayesian and (if you’re lucky) even easy. Dive in!</li>
</ul>
<h3 id="bayesian-modelling">Bayesian modelling</h3>
<ul>
<li>
<p>If you don’t know any probability, I’d recommend <a href="https://betanalpha.github.io/assets/case_studies/probability_theory.html">Michael
Betancourt’s</a>
crash-course in practical probability theory.</p>
</li>
<li>
<p>For an introduction to general Bayesian methods and modelling, I really liked
<a href="http://camdavidsonpilon.github.io/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/">Cam Davidson Pilon’s <em>Bayesian Methods for
Hackers</em></a>:
it really made the whole “thinking like a Bayesian” thing click for me.</p>
</li>
<li>
<p>If you’re willing to spend some money, I’ve heard that <a href="https://sites.google.com/site/doingbayesiandataanalysis/"><em>Doing Bayesian Data
Analysis</em> by
Kruschke</a> (a.k.a.
<em>“the puppy book”</em>) is for the bucket list.</p>
</li>
<li>
<p>Here we come to a fork in the road. The central problem in Bayesian modelling
is this: given data and a probabilistic model that we think models this data,
how do we find the posterior distribution of the model’s parameters? There are
currently two good solutions to this problem. One is Markov-chain Monte Carlo
sampling (a.k.a. MCMC sampling), and the other is variational inference
(a.k.a. VI). Both methods are mathematical Death Stars: extremely powerful but
incredibly complicated. Nevertheless, I think it’s important to get at least a
hand-wavy understanding of what these methods are. If you’re new to all this,
my personal recommendation is to invest your time in learning MCMC: it’s been
around longer, we know that there are sufficiently robust tools to help you,
and there’s a lot more support/documentation out there.</p>
</li>
</ul>
<h3 id="markov-chain-monte-carlo">Markov-chain Monte Carlo</h3>
<ul>
<li>
<p>For a good high-level introduction to MCMC, I liked <a href="https://www.youtube.com/watch?v=DJ0c7Bm5Djk&feature=youtu.be&t=4h40m9s">Michael Betancourt’s
StanCon 2017
talk</a>:
especially the first few minutes where he provides a motivation for MCMC, that
really put all this math into context for me.</p>
</li>
<li>
<p>For a more in-depth (and mathematical) treatment of MCMC, I’d check out his
<a href="https://arxiv.org/abs/1701.02434">paper on Hamiltonian Monte Carlo</a>.</p>
</li>
</ul>
<h3 id="variational-inference">Variational inference</h3>
<ul>
<li>
<p>VI has been around for a while, but it was only in 2017 (2 years ago, at the
time of writing) that <em>automatic differentiation variational inference</em> was
invented. As such, variational inference is undergoing a renaissance and is
currently an active area of statistical research. Since it’s such a nascent
field, most resources on it are very theoretical and academic in nature.</p>
</li>
<li>
<p>Chapter 10 (on approximate inference) in Bishop’s <em>Pattern Recognition and
Machine Learning</em> and <a href="https://www.cs.princeton.edu/courses/archive/fall11/cos597C/lectures/variational-inference-i.pdf">this
tutorial</a>
by David Blei are excellent, if a bit mathematically-intensive, resources.</p>
</li>
<li>
<p>The most hands-on explanation of variational inference I’ve seen is the docs
for <a href="http://pyro.ai/examples/svi_part_i.html">Pyro</a>, a probabilistic
programming language developed by Uber that specializes in variational
inference.</p>
</li>
</ul>
<h2 id="model-formulation">Model Formulation</h2>
<ul>
<li>
<p>Try thinking about <em>how</em> your data would be generated: what kind of machine
has your data as outputs? This will help you both explore your data, as well
as help you arrive at a reasonable model formulation.</p>
</li>
<li>
<p>Try to avoid correlated variables. Some of the more robust samplers (<strong>cough</strong>
NUTS <strong>cough cough</strong>) can cope with <em>a posteriori</em> correlated random
variables, but sampling is much easier for everyone involved if the variables
are uncorrelated. By the way, the bar is pretty low here: if the
jointplot/scattergram of the two variables looks like an ellipse, thats
usually okay. It’s when the ellipse starts looking like a line that you should
be alarmed.</p>
</li>
<li>
<p>Try to avoid discrete latent variables, and discrete parameters in general.
There is no good method to sample them in a smart way (since discrete
parameters have no gradients); and with “naïve” samplers (i.e. those that do
not take advantage of the gradient), the number of samples one needs to make
good inferences generally scales exponentially in the number of parameters.
For an instance of this, see <a href="https://docs.pymc.io/notebooks/marginalized_gaussian_mixture_model.html">this example on marginal Gaussian
mixtures</a>.</p>
</li>
<li>
<p>The <a href="https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations">Stan GitHub
wiki</a> has
some excellent recommendations on how to choose good priors. Once you get a
good handle on the basics of using PyMC3, I <em>100% recommend</em> reading this wiki
from start to end: the Stan community has fantastic resources on Bayesian
statistics, and even though their APIs are quite different, the mathematical
theory all translates over.</p>
</li>
</ul>
<h3 id="hierarchical-models">Hierarchical models</h3>
<ul>
<li>
<p>First of all, hierarchical models are amazing! <a href="https://docs.pymc.io/notebooks/GLM-hierarchical.html">The PyMC3
docs</a> opine on this at
length, so let’s not waste any digital ink.</p>
</li>
<li>
<p>The poster child of a Bayesian hierarchical model looks something like this
(equations taken from
<a href="https://en.wikipedia.org/wiki/Bayesian_hierarchical_modeling">Wikipedia</a>):</p>
<p><img style="float: center" src="https://wikimedia.org/api/rest_v1/media/math/render/svg/765f37f86fa26bef873048952dccc6e8067b78f4" /></p>
<p><img style="float: center" src="https://wikimedia.org/api/rest_v1/media/math/render/svg/ca8c0e1233fd69fa4325c6eacf8462252ed6b00a" /></p>
<p><img style="float: center" src="https://wikimedia.org/api/rest_v1/media/math/render/svg/1e56b3077b1b3ec867d6a0f2539ba9a3e79b45c1" /></p>
<p>This hierarchy has 3 levels (some would say it has 2 levels, since there are
only 2 levels of parameters to infer, but honestly whatever: by my count there
are 3). 3 levels is fine, but add any more levels, and it becomes harder for
to sample. Try out a taller hierarchy to see if it works, but err on the side
of 3-level hierarchies.</p>
</li>
<li>
<p>If your hierarchy is too tall, you can truncate it by introducing a
deterministic function of your parameters somewhere (this usually turns out to
just be a sum). For example, instead of modelling your observations are drawn
from a 4-level hierarchy, maybe your observations can be modeled as the sum of
three parameters, where these parameters are drawn from a 3-level hierarchy.</p>
</li>
<li>
<p>More in-depth treatment here in <a href="https://arxiv.org/abs/1312.0906">(Betancourt and Girolami,
2013)</a>. <strong>tl;dr:</strong> hierarchical models all
but <em>require</em> you use to use Hamiltonian Monte Carlo; also included are some
practical tips and goodies on how to do that stuff in the real world.</p>
</li>
</ul>
<h2 id="model-implementation">Model Implementation</h2>
<ul>
<li>
<p>At the risk of overgeneralizing, there are only two things that can go wrong
in Bayesian modelling: either your data is wrong, or your model is wrong. And
it is a hell of a lot easier to debug your data than it is to debug your
model. So before you even try implementing your model, plot histograms of your
data, count the number of data points, drop any NaNs, etc. etc.</p>
</li>
<li>
<p>PyMC3 has one quirky piece of syntax, which I tripped up on for a while. It’s
described quite well in <a href="http://twiecki.github.io/blog/2014/03/17/bayesian-glms-3/#comment-2213376737">this comment on Thomas Wiecki’s
blog</a>.
Basically, suppose you have several groups, and want to initialize several
variables per group, but you want to initialize different numbers of variables
for each group. Then you need to use the quirky <code class="highlighter-rouge">variables[index]</code>
notation. I suggest using <code class="highlighter-rouge">scikit-learn</code>’s <code class="highlighter-rouge">LabelEncoder</code> to easily create the
index. For example, to make normally distributed heights for the iris dataset:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Different numbers of examples for each species</span>
<span class="n">species</span> <span class="o">=</span> <span class="p">(</span><span class="mi">48</span> <span class="o">*</span> <span class="p">[</span><span class="s">'setosa'</span><span class="p">]</span> <span class="o">+</span> <span class="mi">52</span> <span class="o">*</span> <span class="p">[</span><span class="s">'virginica'</span><span class="p">]</span> <span class="o">+</span> <span class="mi">63</span> <span class="o">*</span> <span class="p">[</span><span class="s">'versicolor'</span><span class="p">])</span>
<span class="n">num_species</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="nb">set</span><span class="p">(</span><span class="n">species</span><span class="p">)))</span> <span class="c"># 3</span>
<span class="c"># One variable per group </span>
<span class="n">heights_per_species</span> <span class="o">=</span> <span class="n">pm</span><span class="o">.</span><span class="n">Normal</span><span class="p">(</span><span class="s">'heights_per_species'</span><span class="p">,</span>
<span class="n">mu</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">sd</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">shape</span><span class="o">=</span><span class="n">num_species</span><span class="p">)</span>
<span class="n">idx</span> <span class="o">=</span> <span class="n">sklearn</span><span class="o">.</span><span class="n">preprocessing</span><span class="o">.</span><span class="n">LabelEncoder</span><span class="p">()</span><span class="o">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">species</span><span class="p">)</span>
<span class="n">heights</span> <span class="o">=</span> <span class="n">heights_per_species</span><span class="p">[</span><span class="n">idx</span><span class="p">]</span>
</code></pre></div> </div>
</li>
<li>
<p>You might find yourself in a situation in which you want to use a centered
parameterization for a portion of your data set, but a noncentered
parameterization for the rest of your data set (see below for what these
parameterizations are). There’s a useful idiom for you here:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">num_xs</span> <span class="o">=</span> <span class="mi">5</span>
<span class="n">use_centered</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">])</span> <span class="c"># len(use_centered) = num_xs</span>
<span class="n">x_sd</span> <span class="o">=</span> <span class="n">pm</span><span class="o">.</span><span class="n">HalfCauchy</span><span class="p">(</span><span class="s">'x_sd'</span><span class="p">,</span> <span class="n">sd</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">x_raw</span> <span class="o">=</span> <span class="n">pm</span><span class="o">.</span><span class="n">Normal</span><span class="p">(</span><span class="s">'x_raw'</span><span class="p">,</span> <span class="n">mu</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">sd</span><span class="o">=</span><span class="n">mu_x_sd</span><span class="o">**</span><span class="n">use_centered</span><span class="p">,</span> <span class="n">shape</span><span class="o">=</span><span class="n">num_xs</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">pm</span><span class="o">.</span><span class="n">Deterministic</span><span class="p">(</span><span class="s">'x'</span><span class="p">,</span> <span class="n">x_sd</span><span class="o">**</span><span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">use_centered</span><span class="p">)</span> <span class="o">*</span> <span class="n">x_raw</span><span class="p">)</span>
</code></pre></div> </div>
<p>You could even experiment with allowing <code class="highlighter-rouge">use_centered</code> to be <em>between</em> 0 and
1, instead of being <em>either</em> 0 or 1!</p>
</li>
<li>
<p>I prefer to use the <code class="highlighter-rouge">pm.Deterministic</code> function instead of simply using normal
arithmetic operations (e.g. I’d prefer to write <code class="highlighter-rouge">x = pm.Deterministic('x', y +
z)</code> instead of <code class="highlighter-rouge">x = y + z</code>). This means that you can index the <code class="highlighter-rouge">trace</code> object
later on with just <code class="highlighter-rouge">trace['x']</code>, instead of having to compute it yourself with
<code class="highlighter-rouge">trace['y'] + trace['z']</code>.</p>
</li>
</ul>
<h2 id="mcmc-initialization-and-sampling">MCMC Initialization and Sampling</h2>
<ul>
<li>
<p>Have faith in PyMC3’s default initialization and sampling settings: someone
much more experienced than us took the time to choose them! NUTS is the most
efficient MCMC sampler known to man, and <code class="highlighter-rouge">jitter+adapt_diag</code>… well, you get
the point.</p>
</li>
<li>
<p>However, if you’re truly grasping at straws, the more powerful initialization
setting would be <code class="highlighter-rouge">advi</code> or <code class="highlighter-rouge">advi+adapt_diag</code>, which uses variational
inference to initialize the sampler. An even better option would be to use
<code class="highlighter-rouge">advi+adapt_diag_grad</code>, which is (at the time of writing) an experimental
feature in beta.</p>
</li>
<li>
<p>Never initialize the sampler with the MAP estimate! In low dimensional
problems the MAP estimate (a.k.a. the mode of the posterior) is often quite a
reasonable point. But in high dimensions, the MAP becomes very strange. Check
out <a href="http://www.inference.vc/high-dimensional-gaussian-distributions-are-soap-bubble/">Ferenc Huszár’s blog
post</a>
on high-dimensional Gaussians to see why. Besides, at the MAP all the derivatives
of the posterior are zero, and that isn’t great for derivative-based samplers.</p>
</li>
</ul>
<h2 id="mcmc-trace-diagnostics">MCMC Trace Diagnostics</h2>
<ul>
<li>You’ve hit the <em>Magic Inference Button™</em>, and you have a <code class="highlighter-rouge">trace</code> object. Now
what? First of all, make sure that your sampler didn’t barf itself, and that
your chains are safe for consumption (i.e., analysis).</li>
</ul>
<ol>
<li>
<p>Run the chain for as long as you have the patience or resources for. Make
sure that the <code class="highlighter-rouge">tune</code> parameter increases commensurately with the <code class="highlighter-rouge">draws</code>
parameter.</p>
</li>
<li>
<p>Check for divergences. PyMC3’s sampler will spit out a warning if there are
diverging chains, but the following code snippet may make things easier:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Display the total number and percentage of divergent chains</span>
<span class="n">diverging</span> <span class="o">=</span> <span class="n">trace</span><span class="p">[</span><span class="s">'diverging'</span><span class="p">]</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Number of Divergent Chains: {}'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">diverging</span><span class="o">.</span><span class="n">nonzero</span><span class="p">()[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">size</span><span class="p">))</span>
<span class="n">diverging_perc</span> <span class="o">=</span> <span class="n">divergent</span><span class="o">.</span><span class="n">nonzero</span><span class="p">()[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">size</span> <span class="o">/</span> <span class="nb">len</span><span class="p">(</span><span class="n">trace</span><span class="p">)</span> <span class="o">*</span> <span class="mi">100</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Percentage of Divergent Chains: {:.1f}'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">diverging_perc</span><span class="p">))</span>
</code></pre></div> </div>
</li>
<li>
<p>Check the traceplot (<code class="highlighter-rouge">pm.traceplot(trace)</code>). You’re looking for traceplots
that look like “fuzzy caterpillars”. If the trace moves into some region and
stays there for a long time (a.k.a. there are some “sticky regions”), that’s
cause for concern! That indicates that once the sampler moves into some
region of parameter space, it gets stuck there (probably due to high
curvature or other bad topological properties).</p>
</li>
<li>
<p>In addition to the traceplot, there are <a href="https://docs.pymc.io/api/plots.html">a ton of other
plots</a> you can make with your trace:</p>
<ul>
<li><code class="highlighter-rouge">pm.plot_posterior(trace)</code>: check if your posteriors look reasonable.</li>
<li><code class="highlighter-rouge">pm.forestplot(trace)</code>: check if your variables have reasonable credible
intervals, and Gelman–Rubin scores close to 1.</li>
<li><code class="highlighter-rouge">pm.autocorrplot(trace)</code>: check if your chains are impaired by high
autocorrelation. Also remember that thinning your chains is a waste of
time at best, and deluding yourself at worst. See Chris Fonnesbeck’s
comment on <a href="https://github.com/pymc-devs/pymc/issues/23">this GitHub
issue</a> and <a href="https://twitter.com/junpenglao/status/1009748562136256512">Junpeng Lao’s
reply to Michael Betancourt’s
tweet</a></li>
<li><code class="highlighter-rouge">pm.energyplot(trace)</code>: ideally the energy and marginal energy
distributions should look very similar. Long tails in the distribution of
energy levels indicates deteriorated sampler efficiency.</li>
<li><code class="highlighter-rouge">pm.densityplot(trace)</code>: a souped-up version of <code class="highlighter-rouge">pm.plot_posterior</code>. It
doesn’t seem to be wildly useful unless you’re plotting posteriors from
multiple models.</li>
</ul>
</li>
<li>PyMC3 has a nice helper function to pretty-print a summary table of the
trace: <code class="highlighter-rouge">pm.summary(trace)</code> (I usually tack on a <code class="highlighter-rouge">.round(2)</code> for my sanity).
Look out for:
<ul>
<li>the <script type="math/tex">\hat{R}</script> values (a.k.a. the Gelman–Rubin statistic, a.k.a. the
potential scale reduction factor, a.k.a. the PSRF): are they all close to
1? If not, something is <em>horribly</em> wrong. Consider respecifying or
reparameterizing your model. You can also inspect these in the forest plot.</li>
<li>the sign and magnitude of the inferred values: do they make sense, or are
they unexpected and unreasonable? This could indicate a poorly specified
model. (E.g. parameters of the unexpected sign that have low uncertainties
might indicate that your model needs interaction terms.)</li>
</ul>
</li>
<li>
<p>As a drastic debugging measure, try to <code class="highlighter-rouge">pm.sample</code> with <code class="highlighter-rouge">draws=1</code>,
<code class="highlighter-rouge">tune=500</code>, and <code class="highlighter-rouge">discard_tuned_samples=False</code>, and inspect the traceplot.
During the tuning phase, we don’t expect to see friendly fuzzy caterpillars,
but we <em>do</em> expect to see good (if noisy) exploration of parameter space. So
if the sampler is getting stuck during the tuning phase, that might explain
why the trace looks horrible.</p>
</li>
<li>
<p>If you get scary errors that describe mathematical problems (e.g. <code class="highlighter-rouge">ValueError:
Mass matrix contains zeros on the diagonal. Some derivatives might always be
zero.</code>), then you’re <del>shit out of luck</del> exceptionally unlucky: those kinds of
errors are notoriously hard to debug. I can only point to the <a href="http://andrewgelman.com/2008/05/13/the_folk_theore/">Folk Theorem of
Statistical Computing</a>:</p>
<blockquote>
<p>If you’re having computational problems, probably your model is wrong.</p>
</blockquote>
</li>
</ol>
<h3 id="fixing-divergences">Fixing divergences</h3>
<blockquote>
<p><code class="highlighter-rouge">There were N divergences after tuning. Increase 'target_accept' or reparameterize.</code></p>
<p>— The <em>Magic Inference Button™</em></p>
</blockquote>
<ul>
<li>
<p>Divergences in HMC occur when the sampler finds itself in regions of extremely
high curvature (such as the opening of the a hierarchical funnel). Broadly
speaking, the sampler is prone to malfunction in such regions, causing the
sampler to fly off towards to infinity. The ruins the chains by heavily
biasing the samples.</p>
</li>
<li>
<p>Remember: if you have even <em>one</em> diverging chain, you should be worried.</p>
</li>
<li>
<p>Increase <code class="highlighter-rouge">target_accept</code>: usually 0.9 is a good number (currently the default
in PyMC3 is 0.8). This will help get rid of false positives from the test for
divergences. However, divergences that <em>don’t</em> go away are cause for alarm.</p>
</li>
<li>
<p>Increasing <code class="highlighter-rouge">tune</code> can sometimes help as well: this gives the sampler more time
to 1) find the typical set and 2) find good values for step sizes, scaling
factors, etc. If you’re running into divergences, it’s always possible that
the sampler just hasn’t started the mixing phase and is still trying to find
the typical set.</p>
</li>
<li>
<p>Consider a <em>noncentered</em> parameterization. This is an amazing trick: it all boils down
to the familiar equation <script type="math/tex">X = \sigma Z + \mu</script> from STAT 101, but it honestly
works wonders. See <a href="http://twiecki.github.io/blog/2017/02/08/bayesian-hierchical-non-centered/">Thomas Wiecki’s blog
post</a>
on it, and <a href="https://docs.pymc.io/notebooks/Diagnosing_biased_Inference_with_Divergences.html">this page from the PyMC3
documentation</a>.</p>
</li>
<li>
<p>If that doesn’t work, there may be something wrong with the way you’re
thinking about your data: consider reparameterizing your model, or
respecifying it entirely.</p>
</li>
</ul>
<h3 id="other-common-warnings">Other common warnings</h3>
<ul>
<li>
<p>It’s worth noting that far and away the worst warning to get is the one about
divergences. While a divergent chain indicates that your inference may be
flat-out <em>invalid</em>, the rest of these warnings indicate that your inference is
merely (lol, “merely”) <em>inefficient</em>.</p>
</li>
<li><code class="highlighter-rouge">The number of effective samples is smaller than XYZ for some parameters.</code>
<ul>
<li>Quoting <a href="https://discourse.pymc.io/t/the-number-of-effective-samples-is-smaller-than-25-for-some-parameters/1050/3">Junpeng Lao on
discourse.pymc3.io</a>:
“A low number of effective samples is usually an indication of strong
autocorrelation in the chain.”</li>
<li>Make sure you’re using an efficient sampler like NUTS. (And not, for
instance, Metropolis–Hastings. (I mean seriously, it’s the 21st century, why
would you ever want Metropolis–Hastings?))</li>
<li>Tweak the acceptance probability (<code class="highlighter-rouge">target_accept</code>) — it should be large
enough to ensure good exploration, but small enough to not reject all
proposals and get stuck.</li>
</ul>
</li>
<li><code class="highlighter-rouge">The gelman-rubin statistic is larger than XYZ for some parameters. This
indicates slight problems during sampling.</code>
<ul>
<li>When PyMC3 samples, it runs several chains in parallel. Loosely speaking,
the Gelman–Rubin statistic measures how similar these chains are. Ideally it
should be close to 1.</li>
<li>Increasing the <code class="highlighter-rouge">tune</code> parameter may help, for the same reasons as described
in the <em>Fixing Divergences</em> section.</li>
</ul>
</li>
<li><code class="highlighter-rouge">The chain reached the maximum tree depth. Increase max_treedepth, increase
target_accept or reparameterize.</code>
<ul>
<li>NUTS puts a cap on the depth of the trees that it evaluates during each
iteration, which is controlled through the <code class="highlighter-rouge">max_treedepth</code>. Reaching the maximum
allowable tree depth indicates that NUTS is prematurely pulling the plug to
avoid excessive compute time.</li>
<li>Yeah, what the <em>Magic Inference Button™</em> says: try increasing
<code class="highlighter-rouge">max_treedepth</code> or <code class="highlighter-rouge">target_accept</code>.</li>
</ul>
</li>
</ul>
<h3 id="model-reparameterization">Model reparameterization</h3>
<ul>
<li>
<p>Countless warnings have told you to engage in this strange activity of
“reparameterization”. What even is that? Luckily, the <a href="https://github.com/stan-dev/stan/releases/download/v2.17.1/stan-reference-2.17.1.pdf">Stan User
Manual</a>
(specifically the <em>Reparameterization and Change of Variables</em> section) has
an excellent explanation of reparameterization, and even some practical tips
to help you do it (although your mileage may vary on how useful those tips
will be to you).</p>
</li>
<li>
<p>Asides from meekly pointing to other resources, there’s not much I can do to
help: this stuff really comes from a combination of intuition, statistical
knowledge and good ol’ experience. I can, however, cite some examples to give
you a better idea.</p>
<ul>
<li>The noncentered parameterization is a classic example. If you have a
parameter whose mean and variance you are also modelling, the noncentered
parameterization decouples the sampling of mean and variance from the
sampling of the parameter, so that they are now independent. In this way, we
avoid “funnels”.</li>
<li>The <a href="http://proceedings.mlr.press/v5/carvalho09a.html"><em>horseshoe
distribution</em></a> is known to
be a good shrinkage prior, as it is <em>very</em> spikey near zero, and has <em>very</em>
long tails. However, modelling it using one parameter can give multimodal
posteriors — an exceptionally bad result. The trick is to reparameterize and
model it as the product of two parameters: one to create spikiness at zero,
and one to create long tails (which makes sense: to sample from the
horseshoe, take the product of samples from a normal and a half-Cauchy).</li>
</ul>
</li>
</ul>
<h2 id="model-diagnostics">Model Diagnostics</h2>
<ul>
<li>Admittedly the distinction between the previous section and this one is
somewhat artificial (since problems with your chains indicate problems with
your model), but I still think it’s useful to make this distinction because
these checks indicate that you’re thinking about your data in the wrong way,
(i.e. you made a poor modelling decision), and <em>not</em> that the sampler is having
a hard time doing its job.</li>
</ul>
<ol>
<li>
<p>Run the following snippet of code to inspect the pairplot of your variables
one at a time (if you have a plate of variables, it’s fine to pick a couple
at random). It’ll tell you if the two random variables are correlated, and
help identify any troublesome neighborhoods in the parameter space (divergent
samples will be colored differently, and will cluster near such
neighborhoods).</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pm</span><span class="o">.</span><span class="n">pairplot</span><span class="p">(</span><span class="n">trace</span><span class="p">,</span>
<span class="n">sub_varnames</span><span class="o">=</span><span class="p">[</span><span class="n">variable_1</span><span class="p">,</span> <span class="n">variable_2</span><span class="p">],</span>
<span class="n">divergences</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
<span class="n">color</span><span class="o">=</span><span class="s">'C3'</span><span class="p">,</span>
<span class="n">kwargs_divergence</span><span class="o">=</span><span class="p">{</span><span class="s">'color'</span><span class="p">:</span> <span class="s">'C2'</span><span class="p">})</span>
</code></pre></div> </div>
</li>
<li>
<p>Look at your posteriors (either from the traceplot, density plots or
posterior plots). Do they even make sense? E.g. are there outliers or long
tails that you weren’t expecting? Do their uncertainties look reasonable to
you? If you had <a href="https://en.wikipedia.org/wiki/Plate_notation">a plate</a> of
variables, are their posteriors different? Did you expect them to be that
way? If not, what about the data made the posteriors different? You’re the
only one who knows your problem/use case, so the posteriors better look good
to you!</p>
</li>
<li>Broadly speaking, there are four kinds of bad geometries that your posterior
can suffer from:
<ul>
<li>highly correlated posteriors: this will probably cause divergences or
traces that don’t look like “fuzzy caterpillars”. Either look at the
jointplots of each pair of variables, or look at the correlation matrix of
all variables. Try using a centered parameterization, or reparameterize in
some other way, to remove these correlations.</li>
<li>posteriors that form “funnels”: this will probably cause divergences. Try
using a noncentered parameterization.</li>
<li>long tailed posteriors: this will probably raise warnings about
<code class="highlighter-rouge">max_treedepth</code> being exceeded. If your data has long tails, you should
model that with a long-tailed distribution. If your data doesn’t have long
tails, then your model is ill-specified: perhaps a more informative prior
would help.</li>
<li>multimodal posteriors: right now this is pretty much a death blow. At the
time of writing, all samplers have a hard time with multimodality, and
there’s not much you can do about that. Try reparameterizing to get a
unimodal posterior. If that’s not possible (perhaps you’re <em>modelling</em>
multimodality using a mixture model), you’re out of luck: just let NUTS
sample for a day or so, and hopefully you’ll get a good trace.</li>
</ul>
</li>
<li>
<p>Pick a small subset of your raw data, and see what exactly your model does
with that data (i.e. run the model on a specific subset of your data). I find
that a lot of problems with your model can be found this way.</p>
</li>
<li>Run <a href="https://docs.pymc.io/notebooks/posterior_predictive.html"><em>posterior predictive
checks</em></a> (a.k.a.
PPCs): sample from your posterior, plug it back in to your model, and
“generate new data sets”. PyMC3 even has a nice function to do all this for
you: <code class="highlighter-rouge">pm.sample_ppc</code>. But what do you do with these new data sets? That’s a
question only you can answer! The point of a PPC is to see if the generated
data sets reproduce patterns you care about in the observed real data set,
and only you know what patterns you care about. E.g. how close are the PPC
means to the observed sample mean? What about the variance?
<ul>
<li>For example, suppose you were modelling the levels of radon gas in
different counties in a country (through a hierarchical model). Then you
could sample radon gas levels from the posterior for each county, and take
the maximum within each county. You’d then have a distribution of maximum
radon gas levels across counties. You could then check if the <em>actual</em>
maximum radon gas level (in your observed data set) is acceptably within
that distribution. If it’s much larger than the maxima, then you would know
that the actual likelihood has longer tails than you assumed (e.g. perhaps
you should use a Student’s T instead of a normal?)</li>
<li>Remember that how well the posterior predictive distribution fits the data
is of little consequence (e.g. the expectation that 90% of the data should
fall within the 90% credible interval of the posterior). The posterior
predictive distribution tells you what values for data you would expect if
we were to remeasure, given that you’ve already observed the data you did.
As such, it’s informed by your prior as well as your data, and it’s not
its job to adequately fit your data!</li>
</ul>
</li>
</ol>George HoThis is a compilation of notes, tips, tricks and recipes for Bayesian modelling that I've collected from everywhere: papers, documentation, peppering my more experienced colleagues with questions.Understanding Hate Speech on Reddit through Text Clustering2018-03-18T00:00:00+00:002018-03-18T00:00:00+00:00https://eigenfoo.xyz/reddit-clusters<blockquote>
<p>Note: the following article contains several examples of hate speech
(including but not limited to racist, misogynistic and homophobic views).</p>
</blockquote>
<p>Have you heard of <code class="highlighter-rouge">/r/TheRedPill</code>? It’s an online forum (a subreddit, but I’ll
explain that later) where people (usually men) espouse an ideology predicated
entirely on gender. “Swallowers of the red pill”, as they call themselves,
maintain that it is <em>men</em>, not women, who are socially marginalized; that feminism
is something between a damaging ideology and a symptom of societal retardation;
that the patriarchy should actively assert its dominance over female
compatriots.</p>
<p>Despite being shunned by the world (or perhaps, because of it), <code class="highlighter-rouge">/r/TheRedPill</code>
has grown into a sizable community and evolved its own slang, language and
culture. Let me give you an example.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Cluster #14:
Cluster importance: 0.0489376285127
shit: 2.433590
test: 1.069885
frame: 0.396684
pass: 0.204953
bitch: 0.163619
</code></pre></div></div>
<p>This is a snippet from a text clustering of <code class="highlighter-rouge">/r/TheRedPill</code> — you don’t really
need to understand the details right now: all you need to know is that each
cluster is simply a bunch of words that frequently appear together in Reddit
posts and comments. Following each word is a number indicating its importance in
the cluster, and on line 2 is the importance of this cluster to the subreddit
overall.</p>
<p>As it turns out, this cluster has picked up on a very specific meme on
<code class="highlighter-rouge">/r/TheRedPill</code>: the concept of the <em>shit test</em>, and how your frame can <em>pass</em> the
<em>shit tests</em> that life (but predominantly, <em>bitches</em>) can throw at you.</p>
<p>There’s absolutely no way I could explain this stuff better than the swallowers
of the red pill themselves, so I’ll just quote from a post on <code class="highlighter-rouge">/r/TheRedPill</code> and
a related blog.</p>
<p>The concept of the shit test very broad:</p>
<blockquote>
<p>… when somebody “gives you shit” and fucks around with your head to see how
you will react, what you are experiencing is typically a (series of) shit
test(s).</p>
</blockquote>
<p>A shit test is designed to test your temperament, or more colloquially,
<em>“determine your frame”</em>.</p>
<blockquote>
<p>Frame is a concept which essentially means “composure and self-control”.</p>
<p>… if you can keep composure/seem unfazed and/or assert your boundaries
despite a shit test, generally speaking you will be considered to have passed
the shit test. If you get upset, offended, doubt yourself or show weakness in
any discernible way when shit tested, it will be generally considered that you
failed the test.</p>
</blockquote>
<p>Finally, not only do shit tests test your frame, but they also serve a specific, critical social function:</p>
<blockquote>
<p>When it comes right down to it shit tests are typically women’s way of
flirting.</p>
<p>… Those who “pass” show they can handle the woman’s BS and is “on her
level”, so to speak. This is where the evolutionary theory comes into play:
you’re demonstrating her faux negativity doesn’t phase you [sic] and that
you’re an emotionally developed person who isn’t going to melt down at the
first sign of trouble. Ergo you’ll be able to protect her when threats to
her safety emerge.</p>
</blockquote>
<p>If you want to learn more, I took all the above quotes from
<a href="https://www.reddit.com/r/TheRedPill/comments/22qnmk/newbies_read_this_the_definitive_guide_to_shit/">here</a>
and <a href="https://illimitablemen.com/2014/12/14/the-shit-test-encyclopedia/">here</a>:
feel free to toss yourself down that rabbit hole (but you may want to open those
links in Incognito mode).</p>
<p>Clearly though, the cluster did a good job of identifying one topic of
discussion on <code class="highlighter-rouge">/r/TheRedPill</code>. In fact, not only can clustering pick up on a
general topic of conversation, but also on specific memes, motifs and vocabulary
associated with it.</p>
<p>Interested? Read on! I’ll explain what I did, and describe some of my other
results.</p>
<hr />
<p>Reddit is — well, it’s pretty hard to describe what Reddit <em>is</em>, mainly because
Reddit comprises several thousand communities, called <em>subreddits</em>, which center
around topics broad (<code class="highlighter-rouge">/r/Sports</code>) and niche (<code class="highlighter-rouge">/r/thinkpad</code>), delightful
(<code class="highlighter-rouge">/r/aww</code>) and unsavory (<code class="highlighter-rouge">/r/Incels</code>).</p>
<p>Each subreddit is a unique community with its own rules, culture and standards.
Some are welcoming and inclusive, and anyone can post and comment; others, not
so much: you must be invited to even read their front page. Some have pliant
standards about what is acceptable as a post; others have moderators willing to
remove posts and ban users upon any infraction of community guidelines.</p>
<p>Whatever Reddit is though, two things are for certain:</p>
<ol>
<li>
<p>It’s widely used. <em>Very</em> widely used. At the time of writing, it’s the <a href="https://www.alexa.com/topsites/countries/US">fourth
most popular website in the United
States</a> and the <a href="https://www.alexa.com/topsites">sixth most popular
globally</a>.</p>
</li>
<li>
<p>Where there is free speech, there is hate speech. Reddit’s hate speech
problem is <a href="https://www.wired.com/2015/08/reddit-mods-handle-hate-speech/">well
documented</a>,
the <a href="https://www.inverse.com/article/43611-reddit-ceo-steve-huffman-hate-speech">center of recent
controversy</a>,
and even <a href="https://fivethirtyeight.com/features/dissecting-trumps-most-rabid-online-following/">the subject of statistical
analysis</a>.</p>
</li>
</ol>
<p>Now, there are many well-known hateful subreddits. The three that I decided to
focus on were <code class="highlighter-rouge">/r/TheRedPill</code>, <code class="highlighter-rouge">/r/The_Donald</code>, and<code class="highlighter-rouge">/r/CringeAnarchy</code>.</p>
<p>The goal here is to understand what these subreddits are like, and expose their
culture for people to see. To quote <a href="https://www.inverse.com/article/43611-reddit-ceo-steve-huffman-hate-speech">Steve Huffman, Reddit’s
CEO</a>:</p>
<blockquote>
<p>“I believe the best defense against racism and other repugnant views, both
on Reddit and in the world, is instead of trying to control what people
can and cannot say through rules, is to repudiate these views in a free
conversation, and empower our communities to do so on Reddit.”</p>
</blockquote>
<p>And there’s no way we can refute and repudiate these deplorable views without
knowing what those views are. And instead of spending hours of each of these
subreddits ourselves, let’s have a machine learn what gets talked about on these
subreddits.</p>
<hr />
<p>Now, how do we do this? This can be done using <em>clustering</em>, a machine learning
technique in which we’re given data points, and tasked with grouping them in
some way. A picture will explain better than words:</p>
<figure>
<a href="https://eigenfoo.xyz/assets/images/clusters.png"><img src="https://eigenfoo.xyz/assets/images/clusters.png" /></a>
<figcaption>Clustering.</figcaption>
</figure>
<p>The clustering algorithm was hard to decide on. After several dead ends were
explored, I settled on non-negative matrix factorization of the document-term
matrix, featurized using tf-idfs. I don’t really want to go into the technical
details now: suffice to say that this technique is <a href="http://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html">known to work well for this
application</a>
(perhaps I’ll write another piece on this in the future).</p>
<p>Finally, we need our data points: <a href="https://bigquery.cloud.google.com/dataset/fh-bigquery:reddit_comments">Google
BigQuery</a>
has all posts and comments across all of Reddit, from the the beginning of
Reddit right up until the end of 2017. We decided to focus on the last two
months for which there is data: November and December, 2017.</p>
<p>I could talk at length about the technical details, but right now, I want to
focus on the results of the clustering. What follows are two hand-picked
clusters from each of the three subreddits, visualized as word clouds (you can
think of word clouds as visual representations of the code snippet above), as
well as an example comment from each of the clusters.</p>
<h2 id="rtheredpill"><code class="highlighter-rouge">/r/TheRedPill</code></h2>
<p>You already know <code class="highlighter-rouge">/r/TheRedPill</code>, so let me describe the clusters in more detail:
a good number of them are about sex, or about how to approach girls. Comments in
these clusters tend to give advice on how to pick up girls, or describe the
social/sexual exploits of the commenter.</p>
<p>What is interesting is that, as sex-obsessed as <code class="highlighter-rouge">/r/TheRedPill</code> is, many
swallowers (of the red pill) profess that sex is <em>not</em> the purpose of the
subreddit: the point is to becoming an “alpha male”. Even more interesting,
there is more talk about what an alpha male <em>is</em>, and what kind of people
<em>aren’t</em> alpha, than there is about how people can <em>become</em> alpha. This is the
first cluster shown below, and comprises around 3% of all text on
<code class="highlighter-rouge">/r/TheRedPill</code>.</p>
<p>The second cluster comprises around 6% of all text on <code class="highlighter-rouge">/r/TheRedPill</code>, and
contains comments that expound theories on the role of men, women and feminism
in today’s society (it isn’t pretty). Personally, the most repugnant views that
I’ve read are to be found in this cluster.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>I feel like the over dramatization of beta qualities in media/pop culture is due
to the fact that anyone representing these qualities is already Alpha by
default.
The actors who play the white knight lead roles, the rock stars that sing about
pining for some chick… these men/characters are already very Alpha in both looks
and status, so when beta BS comes from their mouths, it’s seen as attractive
because it balances out their already alpha state into that "mostly alpha but
some beta" balance that makes women swoon.
…
</code></pre></div></div>
<figure class="half">
<a href="https://raw.githubusercontent.com/eigenfoo/reddit-clusters/master/wordclouds/images/TheRedPill/13_3.21%25.png"><img src="https://raw.githubusercontent.com/eigenfoo/reddit-clusters/master/wordclouds/images/TheRedPill/13_3.21%25.png" /></a>
<a href="https://raw.githubusercontent.com/eigenfoo/reddit-clusters/master/wordclouds/images/TheRedPill/06_6.41%25.png"><img src="https://raw.githubusercontent.com/eigenfoo/reddit-clusters/master/wordclouds/images/TheRedPill/06_6.41%25.png" /></a>
<figcaption>Wordclouds from /r/TheRedPill.</figcaption>
</figure>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>…
Since the dawn of humanity men were always in control, held all the power and
women were happy because of it. But now men are forced to lose their masculinity
and power or else they'll be killed/punished by other pussy men with big guns
and laws who believe feminism is the right path for humanity.
…
Feminism is really a blessing in disguise because it's a wake up call for men
and a hidden cry for help from women for men to regain their masculinity,
integrity and control over women.
…
</code></pre></div></div>
<h2 id="rthe_donald"><code class="highlighter-rouge">/r/The_Donald</code></h2>
<p>You may have already heard of <code class="highlighter-rouge">/r/The_Donald</code> (a.k.a. the “pro-Trump cesspool”),
famed for their <a href="https://en.wikipedia.org/wiki//r/The_Donald#Conflict_with_Reddit_management">takeover of the Reddit front
page</a>,
and their <a href="https://en.wikipedia.org/wiki//r/The_Donald#Controversies">involvement in several recent
controversies</a>. It
may therefore be surprising to learn that there is an iota of lucid discussion
that goes on, although in a jeering, bullying tone.</p>
<p><code class="highlighter-rouge">/r/The_Donald</code> is the subreddit which has developed the most language and inside
jokes: from “nimble navigators” to “swamp creatures”, “spezzes” to the
“Trumpire”… Explaining these memes would take too long: reach out, or Google, if
you really want to know.</p>
<p>The first cluster accounts for 5% of all text on <code class="highlighter-rouge">/r/The_Donald</code>, and contains
(relatively) coherent arguments both for and against net neutrality. The second
cluster accounts for 1% of the all text on <code class="highlighter-rouge">/r/The_Donald</code>, and is actually from
the subreddit’s <code class="highlighter-rouge">MAGABrickBot</code>, which is a bot that keeps count of how many times
the word “brick” has been used in comments, by automatically generating this
comment.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>So much misinformation perpetuated by the Swamp... Abolishing Net Neutrality
would benefit swamp creatures with corporate payouts but would be most damaging
to conservatives long term.
Net Neutrality was NOT created by Obama, it was actually in effect from the very
beginning...
</code></pre></div></div>
<figure class="half">
<a href="https://raw.githubusercontent.com/eigenfoo/reddit-clusters/master/wordclouds/images/The_Donald/00_5.19%25.png"><img src="https://raw.githubusercontent.com/eigenfoo/reddit-clusters/master/wordclouds/images/The_Donald/00_5.19%25.png" /></a>
<a href="https://raw.githubusercontent.com/eigenfoo/reddit-clusters/master/wordclouds/images/The_Donald/02_1.26%25.png"><img src="https://raw.githubusercontent.com/eigenfoo/reddit-clusters/master/wordclouds/images/The_Donald/02_1.26%25.png" /></a>
<figcaption>Wordclouds from /r/The_Donald.</figcaption>
</figure>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>**FOR THE LOVE OF GOD GET THIS PATRIOT A BRICK! THAT'S 92278 BRICKS HANDED
OUT!**
We are at **14.3173880911%** of our goal to **BUILD THE WALL** starting from Imperial
Beach, CA to Brownsville, Texas! Lets make sure everyone gets a brick in the
United States! For every Centipede a brick, for every brick a Centipede!
At this rate, the wall will be **1071.35224786 MILES WIDE** and **353.552300867 FEET
HIGH** by tomorrow! **DO YOUR PART!**
</code></pre></div></div>
<h2 id="rcringeanarchy"><code class="highlighter-rouge">/r/CringeAnarchy</code></h2>
<p>On the Internet, <em>cringe</em> is the second-hand embarrassment you feel when someone
acts extremely awkwardly or uncomfortably. And on <code class="highlighter-rouge">/r/CringeAnarchy</code> you can find
memes about the <em>real</em> cringe, which is, um, liberals and anyone else who
advocates for an inclusionary, equitable ideology. Their morally grey jokes run
the gamut of delicate topics: gender, race, sexuality, nationality…</p>
<p>In some respects, the clustering provided very little insight into this
subreddit: each such delicate topic had one or two clusters, and there’s nothing
really remarkable about any of them. This speaks to the inherent difficulty of
training a topic model on memes: I rant at greater length about this topic on
<a href="https://eigenfoo.xyz/lda-sucks/">one of my blog posts</a>.</p>
<p>Both clusters below comprise around 3% of text on <code class="highlighter-rouge">/r/CringeAnarchy</code>: one is to do
with race, and the other is to do with homosexuality.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Has anyone here, non-black or otherwise, ever wished someone felt sorry for
being black? Maybe it's just where I live... the majority is black. It's
whatever.
</code></pre></div></div>
<figure class="half">
<a href="https://raw.githubusercontent.com/eigenfoo/reddit-clusters/master/wordclouds/images/CringeAnarchy/08_3.10%25.png"><img src="https://raw.githubusercontent.com/eigenfoo/reddit-clusters/master/wordclouds/images/CringeAnarchy/08_3.10%25.png" /></a>
<a href="https://raw.githubusercontent.com/eigenfoo/reddit-clusters/master/wordclouds/images/CringeAnarchy/12_2.92%25.png"><img src="https://raw.githubusercontent.com/eigenfoo/reddit-clusters/master/wordclouds/images/CringeAnarchy/12_2.92%25.png" /></a>
<figcaption>Wordclouds from /r/CringeAnarchy.</figcaption>
</figure>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>…
Also, the distinction between bisexual and gay is academic. If you do a gay
thing, you have done a gay thing. That's what "being gay" means to a LOT of
people. Redefining it is as useful as all the other things SJWs are redefining.
</code></pre></div></div>
<hr />
<p>As much information as that might have been, this was just a glimpse into what
these subreddits are like: I made 20 clusters for each subreddit, and you could
argue that (for somewhat technical reasons) 20 clusters isn’t even enough!
Moreover, there is just no way I could distill everything I learned about these
communities into one Medium story: I’ve curated just the more remarkable or
provocative results to put here.</p>
<p>If you still have the stomach for this stuff, scroll through the complete log
files
<a href="https://github.com/eigenfoo/reddit-clusters/tree/master/clustering/results">here</a>,
or look through images of the word clouds
<a href="https://github.com/eigenfoo/reddit-clusters/tree/master/wordclouds/images">here</a>.</p>
<p>Finally, as has been said before, “Talk is cheap. Show me the code.” For
everything I’ve written to make these clusters, check out <a href="https://github.com/eigenfoo/reddit-clusters">this GitHub
repository</a>.</p>
<hr />
<p><strong>EDIT (11-08-2018):</strong> If you’re interested in the technical, data science side
of the project, check out the slide deck and speaker notes from <a href="https://eigenfoo.xyz/reddit-slides/">my recent
talk</a> on exactly that!</p>
<hr />
<p><em>This post was originally published on Medium on May 18, 2018: I have since
<a href="https://medium.com/@nikitonsky/medium-is-a-poor-choice-for-blogging-bb0048d19133">migrated away from
Medium</a>
and <a href="https://bts.nomadgate.com/medium-evergreen-content">deleted my account</a> and
<a href="https://www.joshjahans.com/ditching-medium/">all my stories</a>.</em></p>
<p><em>This post was also reprinted in the inaugural issue of The Cooper Union’s
<a href="https://www.facebook.com/theunionjournal/">UNION Journal</a>.</em></p>George HoA recent project on trying to model hate speech on Reddit through text clustering — from 'nimble navigators' to 'swamp creatures', 'spezzes' to the 'Trumpire'.Why Latent Dirichlet Allocation Sucks2018-03-06T00:00:00+00:002018-03-06T00:00:00+00:00https://eigenfoo.xyz/lda-sucks<p>As I learn more and more about data science and machine learning, I’ve noticed
that a lot of resources out there go something like this:</p>
<blockquote>
<p>Check out this thing! It’s great at this task! The important task! The one
that was impossible/hard to do before! Look how well it does! So good! So
fast!</p>
<p>Take this! It’s our algorithm/code/paper! We used it to do the thing! And now
you can do the thing too!</p>
</blockquote>
<p>Jokes aside, I do think it’s true that a lot of research and resources focus on
what things <em>can</em> do, or what things are <em>good</em> at doing. Whenever I actually
implement the hyped-up “thing”, I’m invariably frustrated when it doesn’t
perform so well as originally described.</p>
<p>Maybe I’m not smart enough to see this, but after I learn about a new technique
or tool or model, it’s not immediately obvious to me when <em>not</em> to use it. I
think it would be very helpful to learn what things <em>aren’t</em> good at doing, or
why things just plain <em>suck</em> at times. Doing so not only helps you understand
the technique/tool/model better, but also sharpens your understanding of your
use case and the task at hand: what is it about your application that makes it
unsuitable for such a technique?</p>
<p>Which is why I’m writing the first of what will (hopefully) be a series of posts
on <em>“Why [Thing] Sucks”</em>. The title is provocative but reductive: a better name
might be <em>When and Why [Thing] Might Suck</em>… but that doesn’t have quite the
same ring to it! In these articles I’ll be outlining what I tried and why it
didn’t work: documenting my failures and doing a quick post-mortem, if you will.
My hope is that this will be useful to anyone else trying to do the same thing
I’m doing.</p>
<hr />
<p>So first up: topic modelling. Specifically, <a href="https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation">latent Dirichlet
allocation</a>, or LDA
for short (not to be confused with <a href="https://eigenfoo.xyz/lda/">the other
LDA</a>, which I wrote a blog post about before).</p>
<p>If you’ve already encountered LDA and have seen <a href="https://en.wikipedia.org/wiki/Plate_notation">plate
notation</a> before, this picture
will probably refresh your memory:</p>
<p><a title="By Bkkbrad [GFDL (http://www.gnu.org/copyleft/fdl.html) or CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)], from Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:Latent_Dirichlet_allocation.svg"><img width="512" alt="Latent Dirichlet allocation" src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/d3/Latent_Dirichlet_allocation.svg/512px-Latent_Dirichlet_allocation.svg.png" /></a></p>
<p>If you don’t know what LDA is, fret not, for there is
<a href="http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf">no</a>
<a href="http://obphio.us/pdfs/lda_tutorial.pdf">shortage</a>
<a href="http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/">of</a>
<a href="https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html">resources</a>
<a href="http://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation">about</a>
<a href="https://radimrehurek.com/gensim/models/ldamodel.html">this</a>
<a href="https://www.quora.com/What-is-a-good-explanation-of-Latent-Dirichlet-Allocation">stuff</a>.
I’m going to move on to when and why LDA isn’t the best idea.</p>
<p><strong>tl;dr:</strong> <em>LDA and topic modelling doesn’t work well with a) short documents,
in which there isn’t much text to model, or b) documents that don’t coherently
discuss a single topic.</em></p>
<p>Wait, what? Did George just say that topic modelling sucks when there’s not much
topic, and not much text to model? Isn’t that obvious?</p>
<p><em>Yes! Exactly!</em> Of course it’s <a href="https://en.wikipedia.org/wiki/Egg_of_Columbus">obvious in
retrospect</a>! Which is why I was
so upset when I realized I spent two whole weeks faffing around with LDA when
topic models were the opposite of what I needed, and so frustrated that more
people aren’t talking about when <em>not</em> to use/do certain things.</p>
<p>But anyways, <code class="highlighter-rouge"><\rant></code> and let’s move on to why I say what I’m saying.</p>
<p>Recently, I’ve taken up a project in modelling the textual data on Reddit using
NLP techniques. There are, of course, many ways one count take this, but
something I was interested in was finding similarities between subreddits,
clustering comments, and visualizing these clusters somehow: what does Reddit
talk about on average? Of course, I turned to topic modelling and dimensionality
reduction.</p>
<p>The techniques that I came across first were LDA (<a href="https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation">latent Dirichlet
allocation</a>) and
t-SNE (<a href="https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding">t-distributed stochastic neighbor
embedding</a>).
Both techniques are well known and well documented, but I can’t say that using
them together is a popular choice of two techniques. However, there have been
some successes. For instance, Shuai had some success with this method <a href="https://shuaiw.github.io/2016/12/22/topic-modeling-and-tsne-visualzation.html">when
using it the 20 newsgroups
dataset</a>;
some work done by Kagglers have <a href="https://www.kaggle.com/ykhorramz/lda-and-t-sne-interactive-visualization">yielded reasonable
results</a>,
and <a href="https://stats.stackexchange.com/questions/305356/plot-latent-dirichlet-allocation-output-using-t-sne">the StackExchange community doesn’t think its a ridiculous
idea</a>.</p>
<p>The dataset that I applied this technique to was the <a href="bigquery.cloud.google.com/dataset/fh-bigquery:reddit">Reddit dataset on Google
BigQuery</a>, which contains
data on all subreddits, posts and comments for as long as Reddit has been around.
I limited myself to the top 10 most active subreddits in December 2017 (the most
recent month for which we have data, at the time of writing), and chose 20 to be
the number of topics to model (any choice is as arbitrary as any other).</p>
<p>I ran LDA and t-SNE exactly as Shuai described on <a href="https://shuaiw.github.io/2016/12/22/topic-modeling-and-tsne-visualzation.html">this blog
post</a>,
except using the great <a href="https://radimrehurek.com/gensim/"><code class="highlighter-rouge">gensim</code></a> library to
perform LDA, which was built with large corpora and efficient online algorithms
in mind. (Specifically, <code class="highlighter-rouge">gensim</code> implements online variational inference with
the EM algorthm, instead of using MCMC-based algorithms, which <code class="highlighter-rouge">lda</code> does. It
seems that variational Bayes scales better to very large corpora than collapsed
Gibbs sampling.)</p>
<p>Here are the results:</p>
<figure>
<a href="/assets/images/lda-sucks.png"><img style="float: middle" width="600" height="600" src="/assets/images/lda-sucks.png" /></a>
</figure>
<p>Horrible, right? Nowhere near the well-separated clusters that Shuai got with
the 20 newsgroups. In fact, the tiny little huddles of around 5 to 10 comments
are probably artifacts of the dimensionality reduction done by t-SNE, so those
might even just be noise! You might say that there are at least 3 very large
clusters, but even that’s bad news! If they’re clustered together, you would
hope that they have the same topics, and that’s definitely not the case here!
These large clusters tells us that a lot of comments have roughly the same topic
distribution (i.e. they’re close to each other in the high-dimensional
topic-space), but their dominant topics (i.e. the topic with greatest
probability) don’t end up being the same.</p>
<p>By the way, t-SNE turns out to be <a href="https://distill.pub/2016/misread-tsne/">a really devious dimensionality reduction
technique</a>, and you really need to
experiment with the perplexity values in order to use it properly. I used the
default <code class="highlighter-rouge">perplexity=30</code> from sklearn for the previous plot, but I repeated the
visualizations for multiple other values and the results aren’t so hot either.
You can check out the results <a href="https://www.flickr.com/photos/155778261@N04/albums/72157694226050095">on my
Flickr</a>.
Note that I did these on a random subsample of 1000 comments, so as to reduce
compute time.</p>
<figure class="half">
<a href="/assets/images/perplexity50.png"><img src="/assets/images/perplexity50.png" /></a>
<a href="/assets/images/perplexity100.png"><img src="/assets/images/perplexity100.png" /></a>
<figcaption>t-SNE with perplexity values of 50 and 100, respectively.</figcaption>
</figure>
<p>So, what went wrong? There’s a <a href="https://stackoverflow.com/questions/29786985/whats-the-disadvantage-of-lda-for-short-texts">nice StackOverflow
post</a>
that describes the problem well.</p>
<p>Firstly, latent Dirichlet allocation and other probabilistic topic models are
very complex and flexible. While this means that they have very high variance
and low bias, it also means that they need a lot of data (or data with a decent
signal-to-noise ratio) for them to learn anything meaningful. Particularly for
LDA, which infers topics on a document-by-document basis, if there aren’t enough
words in a document, there simply isn’t enough data to infer a reliable topic
distribution for that document.</p>
<p>Secondly, Reddit comments are by their nature very short and very-context
dependent, since they respond to a post, or another comment. So not only are
Reddit comments just short: it’s actually worse than that! They don’t even
discuss a certain topic coherently (by which I mean, they don’t necessarily use
words that pertain to what they’re talking about). I’ll give an example:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"I'm basing my knowledge on the fact that I watched the fucking rock fall."
</code></pre></div></div>
<p>Now, stopwords compose a little less than half of this comment, and they would
be stripped before LDA even looks at it. But that aside, what is this comment
about? What does the rock falling mean? What knowledge is this user claiming?
It’s a very confusing comment, but probably made complete sense in the context
of the post it responded to and the comments that came before it. As it is,
however, its impossible for <em>me</em> to figure out what topic this comment is about,
let alone an algorithm!</p>
<p>Also, just to drive the point home, here are the top 10 words in each of the 20
topics that LDA came up with, on the same dataset as before:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Topic #0:
got just time day like went friend told didn kids
Topic #1:
just gt people say right doesn know law like government
Topic #2:
removed com https www https www tax money http watch news
Topic #3:
people don just like think really good know want things
Topic #4:
years time did great ago ve just work life damn
Topic #5:
movie like love just really school star movies film story
Topic #6:
like just fucking shit head car looks new makes going
Topic #7:
game team season year good win play teams playing best
Topic #8:
right thing yeah don think use internet ok water case
Topic #9:
going like work just need way want money free fuck
Topic #10:
better just play games make ve ll seen lol fun
Topic #11:
like don know did feel shit big man didn guys
Topic #12:
deleted fuck guy year old man amp year old state lmao
Topic #13:
sure believe trump wrong saying comment post mueller evidence gt
Topic #14:
gt yes https com good oh wikipedia org en wiki
Topic #15:
think like good 10 look point lebron just pretty net
Topic #16:
gt said fucking american agree trump thanks obama states did
Topic #17:
trump vote party republicans election moore president republican democrats won
Topic #18:
war world country israel countries china military like happy does
Topic #19:
reddit message askreddit post questions com reddit com subreddit compose message compose
</code></pre></div></div>
<p>Now, it’s not entirely bad: topic 2 seems like its collecting the tokens from links
(I didn’t stopword those out, oops), topic 7 looks like its about football or
some other sport, 13 is probably about American politics, and 18 looks like
its about world news, etc.</p>
<p>But almost all other topics are just collections of words: it’s not immediately
obvious to me what each topic represents.</p>
<p>So yeah, there you have it, LDA really sucks sometimes.</p>
<hr />
<p><strong>EDIT (8/12/2018):</strong> In retrospect, I think that this whole blog post is
summarized well in the following tweet thread. Clustering algorithms will give
you clusters because that’s what they do, not because there actually <em>are</em>
clusters. In this case, extremely short and context-dependent documents make it
hard to justify that there are topic clusters in the first place.</p>
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">Algorithms that have to report something will always report something, even if it's a bad idea. Please do not use these algorithms unless you have principled reasons why there should be something. <a href="https://t.co/kzxZiuBfmm">https://t.co/kzxZiuBfmm</a></p>— \mathfrak{Michael Betancourt} (@betanalpha) <a href="https://twitter.com/betanalpha/status/1026619046626828288?ref_src=twsrc%5Etfw">August 7, 2018</a></blockquote>
<script async="" src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>George HoLatent Dirichlet allocation is a well-known and popular model in machine learning and natural language processing, but it really sucks sometimes. Here's why.~~Fruit~~ Loops and Learning - The LUPI Paradigm and SVM+2018-01-30T00:00:00+00:002018-01-30T00:00:00+00:00https://eigenfoo.xyz/lupi<p>Here’s a short story you might know: you have a black box, whose name is
<em>Machine Learning Algorithm</em>. It’s got two modes: training mode and testing
mode. You set it to training mode, and throw in a lot (sometimes <em>a lot</em> a lot)
of ordered pairs <script type="math/tex">(x_i, y_i), 1 \leq i \leq l</script>. Here, the <script type="math/tex">x_i</script> are called
the <em>examples</em> and the <script type="math/tex">y_i</script> are called the <em>targets</em>. Then, you set it to
testing mode and throw in some more examples, for which you don’t have the
corresponding targets. You hope the <script type="math/tex">y_i</script>s that come out are in some sense
the “right” ones.</p>
<p>Generally speaking, this is a parable of <em>supervised learning</em>. However, Vapnik
(the inventor of the
<a href="https://en.wikipedia.org/wiki/Support_vector_machine">SVM</a>) recently described
a new way to think about machine learning
(<a href="http://www.engr.uconn.edu/~jinbo/doc/vladimir_newparadiam.pdf">here</a> and
<a href="http://jmlr.csail.mit.edu/papers/volume16/vapnik15b/vapnik15b.pdf">here</a>):
<em>learning using privileged information</em>, or <em>LUPI</em> for short.</p>
<p>This post is meant to introduce the LUPI paradigm of machine learning to
people who are generally familiar with supervised learning and SVMs, and are
interested in seeing the math and intuition behind both things extended to the
LUPI paradigm.</p>
<h2 id="what-is-lupi">What is LUPI?</h2>
<p>The main idea is that instead of two-tuples <script type="math/tex">(x_i, y_i)</script>, the black box is fed
three-tuples <script type="math/tex">(x_i, x_i^{*}, y_i)</script>, where the <script type="math/tex">x^{*}</script>s are the so-called
<em>privileged information</em> that is only available during training, and not during
testing. The hope is that this information will train the model to better
generalize during the testing phase.</p>
<p>Vapnik offers many examples in which LUPI can be applied in real life: in
bioinformatics and proteomics (where advanced biological models, which the
machine might not necessarily “understand”, serve as the privileged
information), in financial time series analysis (where future movements of the
time series are the unknown at prediction time, but are available
retrospectively), and in the classic MNIST dataset, where the images were
converted to a lower resolution, but each annotated with a “poetic description”
(which was available for the training data but not for the testing data).</p>
<p>Vapnik’s team ran tests on well-known datasets in all three application areas
and found that his newly-developed LUPI methods performed noticeably better than
classical SVMs in both convergence time (i.e. the number of examples necessary
to achieve a certain degree of accuracy) and estimation of a good predictor
function. In fact, Vapnik’s proof-of-concept experiments are so whacky that
they actually <a href="https://nautil.us/issue/6/secret-codes/teaching-me-softly">make for an entertaining read
</a>!</p>
<h2 id="classical-svms-separable-and-non-separable-case">Classical SVMs (separable and non-separable case)</h2>
<p>There are many ways of thinking about SVMs, but I think that the one that is
most instructive here is to think of them as solving the following optimization
problem:</p>
<blockquote>
<p>Minimize <script type="math/tex">\frac{1}{2} \|w\|^2</script></p>
<p>subject to <script type="math/tex">y_i [ w \cdot x_i + b ] \geq 1, \; 1 \leq i \leq l</script>.</p>
</blockquote>
<p>Basically all this is saying is that we want to find the hyperplane that
separates our data by the maximum margin. More technically speaking, this finds
the parameters (<script type="math/tex">w</script> and <script type="math/tex">b</script>) of the maximum margin hyperplane, with <script type="math/tex">l_2</script>
regularization.</p>
<p>In the non-separable case, we concede that our hyperplane may not classify all
examples perfectly (or that it may not be desireable to do so: think of
overfitting), and so we introduce a so-called <em>slack variable</em> <script type="math/tex">\xi_i \geq 0</script> for each example <script type="math/tex">i</script>, which measures the severity of misclassification of
that example. With that, the optimization becomes:</p>
<blockquote>
<p>Minimize <script type="math/tex">\frac{1}{2} \|w\|^2 + C\sum_{i=1}^{l}{\xi_i}</script></p>
<p>subject to <script type="math/tex">y_i [ w \cdot x_i + b ] \geq 1 - \xi_i, \; \xi_i \geq 0, 1
\leq i \leq l</script>.</p>
</blockquote>
<p>where <script type="math/tex">C</script> is some regularization parameter.</p>
<p>This says the same thing as the previous optimization problem, but now allows
points to be (a) classified properly (<script type="math/tex">\xi_i = 0</script>), (b) within the margin but
still classified properly (<script type="math/tex">% <![CDATA[
0 < \xi_i < 1 %]]></script>), or (c) misclassified
(<script type="math/tex">1 \leq \xi_i</script>).</p>
<p>In both the separable and non-separable cases, the decision rule is simply <script type="math/tex">\hat{y} = \text{sign}(w \cdot x + b)</script>.</p>
<p>An important thing to note is that, in the separable case, the SVM uses <script type="math/tex">l</script>
examples to estimate the <script type="math/tex">n</script> components of <script type="math/tex">w</script>, whereas in the nonseparable
case, the SVM uses <script type="math/tex">l</script> examples to estimate <script type="math/tex">n+l</script> parameters: the <script type="math/tex">n</script>
components of <script type="math/tex">w</script> and <script type="math/tex">l</script> values of slacks <script type="math/tex">\xi_i</script>. Thus, in the
non-separable case, the number of parameters to be estimated is always larger
than the number of examples: it does not matter here that most of slacks may be
equal to zero: the SVM still has to estimate all of them.</p>
<p>The way both optimization problems are actually <em>solved</em> is fairly involved (they
require <a href="https://en.wikipedia.org/wiki/Lagrange_multiplier">Lagrange
multipliers</a>), but in terms
of getting an intuitive feel for how SVMs work, I think that examining the
optimization problems suffice!</p>
<h2 id="what-is-svm">What is SVM+?</h2>
<p>In his paper introducing the LUPI paradigm, Vapnik outlines <em>SVM+</em>, a
modified form of the SVM that fits well into the LUPI paradigm, using privileged
information to improve performance. It should be emphasized that LUPI is a
paradigm - a way of thinking about machine learning - and not just a collection
of algorithms. SVM+ is just one technique that interoperates with the LUPI
paradigm.</p>
<p>The innovation of the SVM+ algorithm is that is uses the privileged information
to estimate the slack variables. Given the training three-tuple <script type="math/tex">(x, x^{*}, y)</script>, we map <script type="math/tex">x</script> to the feature space <script type="math/tex">Z</script>, and <script type="math/tex">x^{*}</script> to a separate feature
space <script type="math/tex">Z^{*}</script>. Then, the decision rule is <script type="math/tex">\hat{y} = \text{sign}(w \cdot x +
b)</script> and the slack variables are estimated by <script type="math/tex">\xi = w^{*} \cdot x^{*} + b^{*}</script>.</p>
<p>In order to find <script type="math/tex">w</script>, <script type="math/tex">b</script>, <script type="math/tex">w^{*}</script> and <script type="math/tex">b^{*}</script>, we solve the following
optimization problem:</p>
<blockquote>
<p>Minimize <script type="math/tex">\frac{1}{2} (\|w\|^2 + \gamma \|w^{*}\|^2) +
C \sum_{i=1}^{l}{(w^{*} \cdot x_i^{*} + b^{*})}</script></p>
<p>subject to <script type="math/tex">y_i [ w \cdot x_i + b ] \geq 1 - (w^{*} \cdot x^{*} + b^{*}),
\; (w^{*} \cdot x^{*} + b^{*}) \geq 0, 1 \leq i \leq l</script>.</p>
</blockquote>
<p>where <script type="math/tex">\gamma</script> indicates the extent to which the slack estimation should be
regularized in comparison to the SVM. Notice how this optimization problem is
essentially identical to the non-separable classical SVM, except the slacks
<script type="math/tex">\xi_i</script> are now estimated with <script type="math/tex">w^{*} \cdot x^{*} + b^{*}</script>.</p>
<p>Again, the method of actually solving this optimization problem involves
Lagrange multipliers and quadratic programming, but I think the intuition is
captured in the optimization problem statement.</p>
<h2 id="interpretation-of-svm">Interpretation of SVM+</h2>
<p>The SVM+ has a very ready interpretation. Instead of a single feature space, it
has two: one in which the non-privileged information lives (where decisions are
made), and one in which the privileged information lives (where slack variables
are estimated).</p>
<p>But what’s the point of this second feature space? How does it help us? Vapnik
terms this problem <em>knowledge transfer</em>: it’s all well and good for us to learn
from the privileged information, but it’s all for naught if we can’t use this
newfound knowledge in the test phase.</p>
<p>The way knowledge transfer is resolved here is by assuming that <em>examples in the
training set that are hard to separate in the privileged space, are also hard to
separate in the regular space</em>. Therefore, we can use the privileged information
to obtain an estimate for the slack variables.</p>
<p>Of course, SVMs are a technique with many possible interpretations, of which my
presentation (in terms of the optimization of <script type="math/tex">w</script> and <script type="math/tex">b</script>) is just one. For
example, it’s possible to think of SVMs in terms of kernels functions, or as
linear classifiers minimizing hinge loss. In all cases, it’s possible and
worthwhile to understand that interpretation of SVMs, and how the LUPI paradigm
contributes to or extends that interpretation. I’m hoping to write a piece later
to explain these exact topics.</p>
<p>Vapnik also puts a great emphasis on analyzing SVM+ based on its statistical
learning theoretic properties (in particular, analyzing its rate of convergence
via the <a href="https://en.wikipedia.org/wiki/VC_dimension">VC dimension</a>). Vapnik was
one of the main pioneers behind statistical learning theory, and has written an
<a href="https://www.amazon.com/Statistical-Learning-Theory-Vladimir-Vapnik/dp/0471030031">entire
book</a>
on this stuff <del>which I have not read</del>, so I’ll leave that part aside for now. I
hope to understand this stuff one day.</p>
<h2 id="implementation-of-svm">Implementation of SVM+</h2>
<p>There’s just one catch: SVM+ is actually an fairly inefficient algorithm, and
definitely will not scale to large data sets. What’s so bad about it? <em>It has
<script type="math/tex">n</script> training examples but <script type="math/tex">2n</script> variables to estimate.</em> This is twice as many
variables to estimate as the standard formulation of the <a href="https://en.wikipedia.org/wiki/Support_vector_machine#Computing_the_SVM_classifier">vanilla
SVM</a>.
This isn’t something that we can patch: the problem is inherent to the
Lagrangian dual formulation that Vapnik and Vashist proposed in 1995.</p>
<p>Even worse, the optimization problem has constraints that are very different
from those of the standard SVM. In essence, this means that efficient libraries
out-of-the-box solvers for the standard SVM (e.g.
<a href="https://www.csie.ntu.edu.tw/~cjlin/libsvm/">LIBSVM</a> and
<a href="https://www.csie.ntu.edu.tw/~cjlin/liblinear/">LIBLINEAR</a>) can’t be used to
train an SVM+ model.</p>
<p>Luckily, <a href="https://www.researchgate.net/publication/301880839_Simple_and_Efficient_Learning_using_Privileged_Information">a recent paper by Xu et
al.</a>
describes a neat mathematical trick to implement SVM+ in a simple and efficient
way. With this amendment, the authors rechristen the algorithm as SVM2+.
Essentially, instead of using the hinge loss when training SVM+, we will instead
use the <em>squared</em> hinge loss. It turns out that changing the loss function in
this way leads to a tiny miracle.</p>
<p>This (re)formulation of SVM+ becomes <em>identical</em> to that of the standard SVM,
except we replace the Gram matrix (a.k.a. kernel matrix) <script type="math/tex">\bf K</script> by <script type="math/tex">\bf K +
\bf Q_\lambda \odot (\bf y y^t)</script>, where</p>
<ul>
<li><script type="math/tex">\bf y</script> is the target vector</li>
<li><script type="math/tex">\odot</script> denotes the Hadamard product</li>
<li><script type="math/tex">\bf{Q_\lambda}</script> is given by <script type="math/tex">Q_\lambda = \frac{1}{\lambda} (\tilde{K}
(\frac{\lambda}{C} I_n + \tilde{K})^{-1} \tilde{K})</script>, and</li>
<li><script type="math/tex">\bf \tilde{K}</script> is the Gram matrix formed by the privileged information</li>
</ul>
<p>So by replacing the hinge loss with the squared hinge loss, the SVM+ formulation
can now be solved with existing libraries!</p>
<h2 id="extensions-to-svm">Extensions to SVM+</h2>
<p>In his paper, Vapnik makes it clear that LUPI is a very general and abstract
paradigm, and as such there is plenty of room for creativity and innovation -
not just in researching and developing new LUPI methods and algorithms, but also
in implementing and applying them. It is unknown how to best go about supplying
privileged information so as to get good performance. How should the data be
feature engineered? How much signal should be in the privileged information?
These are all open questions.</p>
<p>Vapnik himself opens up three avenues to extend the SVM+ algorithm:</p>
<ol>
<li><em>a mixture model of slacks:</em> when slacks are estimated by a mixture of a
smooth function and some prior</li>
<li><em>a model where privileged information is available only for a part of the
training data:</em> where we can only supply privileged information on a small
subset of the training examples</li>
<li><em>multiple-space privileged information:</em> where the privileged information we
can supply do not all share the same features</li>
</ol>
<p>Clearly, there’s a lot of potential in the LUPI paradigm, as well as a lot of
reasons to be skeptical. It’s very much a nascent perspective of machine
learning, so I’m interested in keeping an eye on it going forward. I’m hoping
to write more posts on LUPI in the future!</p>George HoWhat is learning using privileged information (LUPI), how do I do it, and why should I care? A brief introduction to LUPI and SVM+.Linear Discriminant Analysis for Starters2017-12-16T00:00:00+00:002017-12-30T00:00:00+00:00https://eigenfoo.xyz/lda<p><em>Linear discriminant analysis</em> (commonly abbreviated to LDA, and not to be
confused with <a href="https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation">the other
LDA</a>) is a very
common dimensionality reduction technique for classification problems. However,
that’s something of an understatement: it does so much more than “just”
dimensionality reduction.</p>
<p>In plain English, if you have high-dimensional data (i.e. a large number of
features) from which you wish to classify observations, LDA will help you
transform your data so as to make the classes as distinct as possible. More
rigorously, LDA will find the linear projection of your data into a
lower-dimensional subspace that optimizes some measure of class separation. The
dimension of this subspace is necessarily strictly less than the number of
classes.</p>
<p>This separation-maximizing property of LDA makes it so good at its job that it’s
sometimes considered a classification algorithm in and of itself, which leads to
some confusion. <em>Linear discriminant analysis</em> is a form of dimensionality
reduction, but with a few extra assumptions, it can be turned into a classifier.
(Avoiding these assumptions gives its relative, <em>quadratic discriminant
analysis</em>, but more on that later). Somewhat confusingly, some authors call the
dimensionality reduction technique “discriminant analysis”, and only prepend the
“linear” once we begin classifying. I actually like this naming convention more
(it tracks the mathematical assumptions a bit better, I think), but most people
nowadays call the entire technique “LDA”, so that’s what I’ll call it.</p>
<p>The goal of this post is to give a comprehensive introduction to, and
explanation of, LDA. I’ll look at LDA in three ways:</p>
<ol>
<li>LDA as an algorithm: what does it do, and how does it do it?</li>
<li>LDA as a theorem: a mathematical derivation of LDA</li>
<li>LDA as a machine learning technique: practical considerations when using LDA</li>
</ol>
<p>This is a lot for one post, but my hope is that there’s something in here for
everyone.</p>
<h2 id="lda-as-an-algorithm">LDA as an Algorithm</h2>
<h3 id="problem-statement">Problem statement</h3>
<p>Before we dive into LDA, it’s good to get an intuitive grasp of what LDA
tries to accomplish.</p>
<p>Suppose that:</p>
<ol>
<li>You have very high-dimensional data, and that</li>
<li>You are dealing with a classification problem</li>
</ol>
<p>This could mean that the number of features is greater than the number of
observations, or it could mean that you suspect there are noisy features that
contain little information, or anything in between.</p>
<p>Given that this is the problem at hand, you wish to accomplish two things:</p>
<ol>
<li>Reduce the number of features (i.e. reduce the dimensionality of your feature
space), and</li>
<li>Preserve (or even increase!) the “distinguishability” of your classes or the
“separatedness” of the classes in your feature space.</li>
</ol>
<p>This is the problem that LDA attempts to solve. It should be fairly obvious why
this problem might be worth solving.</p>
<p>To judiciously appropriate a term from signal processing, we are interested in
increasing the signal-to-noise ratio of our data, by both extracting or
synthesizing features that are useful in classifying our data (amplifying our
signal), and throwing out the features that are not as useful (attenuating our
noise).</p>
<p>Below is simple illustration I made, inspired by <a href="https://www.quora.com/Can-you-explain-the-comparison-between-principal-component-analysis-and-linear-discriminant-analysis-in-dimensionality-reduction-with-MATLAB-code-Which-one-is-more-efficient">Sebastian
Raschka</a>
that may help our intuition about the problem:</p>
<p><a href="https://raw.githubusercontent.com/eigenfoo/eigenfoo.xyz/master/assets/images/lda-pic.png"><img style="float: middle" width="500" height="500" src="https://raw.githubusercontent.com/eigenfoo/eigenfoo.xyz/master/assets/images/lda-pic.png" /></a></p>
<p>A couple of points to make:</p>
<ul>
<li>LD1 and LD2 are among the projections that LDA would consider. In reality, LDA
would consider <em>all possible</em> projections, not just those along the x and y
axes.</li>
<li>LD1 is the one that LDA would actually come up with: this projection gives the
best “separation” of the two classes.</li>
<li>LD2 is a horrible projection by this metric: both classes get horribly
overlapped… (this actually relates to PCA, but more on that later)</li>
</ul>
<p><strong>UPDATE:</strong> For another illustration, Rahul Sangole made a simple but great
interactive visualization of LDA
<a href="https://rsangole.shinyapps.io/LDA_Visual/">here</a> using
<a href="https://shiny.rstudio.com/">Shiny</a>.</p>
<h3 id="solution">Solution</h3>
<p>First, some definitions:</p>
<p>Let:</p>
<ul>
<li><script type="math/tex">n</script> be the number of classes</li>
<li><script type="math/tex">\mu</script> be the mean of all observations</li>
<li><script type="math/tex">N_i</script> be the number of observations in the <script type="math/tex">i</script>th class</li>
<li><script type="math/tex">\mu_i</script> be the mean of the <script type="math/tex">i</script>th class</li>
<li><script type="math/tex">\Sigma_i</script> be the <a href="https://en.wikipedia.org/wiki/Scatter_matrix">scatter
matrix</a> of the <script type="math/tex">i</script>th class</li>
</ul>
<p>Now, define <script type="math/tex">S_B</script> to be the <em>within-class scatter matrix</em>, given by</p>
<script type="math/tex; mode=display">\begin{align*}
S_B = \sum_{i=1}^{n}{\Sigma_i}
\end{align*}</script>
<p>and define <script type="math/tex">S_W</script> to be the <em>between-class scatter matrix</em>, given by</p>
<script type="math/tex; mode=display">\begin{align*}
S_W = \sum_{i=1}^{n}{N_i (\mu_i - \mu) (\mu_i - \mu)^T}
\end{align*}</script>
<p><a href="https://en.wikipedia.org/wiki/Diagonalizable_matrix">Diagonalize</a> <script type="math/tex">S_W^{-1}
S_B</script> to get its eigenvalues and eigenvectors.</p>
<p>Pick the <script type="math/tex">k</script> largest eigenvalues, and their associated eigenvectors. We will
project our observations onto the subspace spanned by these vectors.</p>
<p>Concretely, what this means is that we form the matrix <script type="math/tex">A</script>, whose columns are the
<script type="math/tex">k</script> eigenvectors chosen above. <script type="math/tex">W</script> will allow us to transform our
observations into the new subspace via the equation <script type="math/tex">y = A^T x</script>, where <script type="math/tex">y</script> is
our transformed observation, and <script type="math/tex">x</script> is our original observation.</p>
<p>And that’s it!</p>
<p>For a more detailed and intuitive explanation of the LDA “recipe”, see
<a href="http://sebastianraschka.com/Articles/2014_python_lda.html">Sebastian Raschka’s blog post on
LDA</a>.</p>
<h2 id="lda-as-a-theorem">LDA as a Theorem</h2>
<p><strong>Sketch of Derivation:</strong></p>
<p>In order to maximize class separability, we need some way of measuring it as a
number. This number should be bigger when the between-class scatter is bigger,
and smaller when the within-class scatter is larger. There are many such
formulas/numbers that have this property: <a href="https://www.elsevier.com/books/introduction-to-statistical-pattern-recognition/fukunaga/978-0-08-047865-4">Fukunaga’s <em>Introduction to
Statistical Pattern
Recognition</em></a>
considers no less than four! Here, we’ll concern ourselves with just one:</p>
<script type="math/tex; mode=display">J_1 = tr(S_{WY}^{-1} S_{BY})</script>
<p>where I denote the within and between-class scatter matrices of the projection
vector <script type="math/tex">Y</script> by <script type="math/tex">S_{WY}</script> and <script type="math/tex">S_{BY}</script>, to avoid confusion with the
corresponding matrices for the projected vector <script type="math/tex">X</script>.</p>
<p>Now, a standard result from probability is that for any random variable <script type="math/tex">X</script>
and matrix <script type="math/tex">A</script>, we have <script type="math/tex">cov(A^T X) = A^T cov(X) A</script>. We’ll apply this
result to our projection <script type="math/tex">y = A^T x</script>. It follows that</p>
<script type="math/tex; mode=display">S_{WY} = A^T S_{WX} A</script>
<p>and</p>
<script type="math/tex; mode=display">S_{BY} = A^T S_{BX} A</script>
<p>where <script type="math/tex">S_{BX}</script> and <script type="math/tex">S_{BY}</script> are the between-class scatter matrices, and
<script type="math/tex">S_{WX}</script> and <script type="math/tex">S_{WY}</script> are the within-class scatter matrices, for <script type="math/tex">X</script>
and its projection <script type="math/tex">Y</script>, respectively.</p>
<p>It’s now a simple matter to write <script type="math/tex">J_1</script> in terms of <script type="math/tex">A</script>, and maximize
<script type="math/tex">J_1</script>. Without going into the details, we set <script type="math/tex">\frac{\partial J_1}{\partial
A} = 0</script> (whatever that means), and use the fact that <a href="https://math.stackexchange.com/questions/546155/proof-that-the-trace-of-a-matrix-is-the-sum-of-its-eigenvalues">the trace of a matrix is
the sum of its
eigenvalues</a>.</p>
<p>I don’t want to go into the weeds with this here, but if you really want to see
the algebra, Fukunaga is a great resource. The end result, however, is the same
condition on the eigenvalues and eigenvectors as stated above: in other words,
the optimization gives us LDA as presented.</p>
<p>There’s one more quirk of LDA that’s very much worth knowing. Suppose you have
10 classes, and you run LDA. It turns out that the <em>maximum</em> number of features
LDA can give you is one less than the number of class, so in this case, 9!</p>
<p><strong>Proposition:</strong> <script type="math/tex">S_W^{-1} S_B</script> has at most <script type="math/tex">n-1</script> non-zero eigenvalues, which
implies that LDA is must reduce the dimension to <em>at least</em> <script type="math/tex">n-1</script>.</p>
<p>To prove this, we first need a lemma.</p>
<p><strong>Lemma:</strong> Suppose <script type="math/tex">{v_i}_{i=1}^{n}</script> is a set of linearly dependent vectors, and
let <script type="math/tex">\alpha_i</script> be <script type="math/tex">n</script> coefficients. Then, <script type="math/tex">M = \sum_{i=1}^{n}{\alpha_i v_i
v_i^{T}}</script>, a linear combination of outer products of the vectors with
themselves, is rank deficient.</p>
<p><strong>Proof:</strong> The row space of <script type="math/tex">M</script> is generated by the set of vectors <script type="math/tex">{v_1, v_2,
..., v_n}</script>. However, because this set of vectors is linearly dependent, it must
span a vector space of dimension strictly less than <script type="math/tex">n</script>, or in other words
less than or equal to <script type="math/tex">n-1</script>. But the dimension of the row space is precisely
the rank of the matrix <script type="math/tex">M</script>. Thus, <script type="math/tex">rank(M) \leq n-1</script>, as desired.</p>
<p>With the lemma, we’re now ready to prove our proposition.</p>
<p><strong>Proof:</strong> We have that</p>
<script type="math/tex; mode=display">\begin{align*}
\frac{1}{n} \sum_{i=1}^{n}{\mu_i} = \mu \implies \sum_{i=1}^{n}{\mu_i-\mu} = 0
\end{align*}</script>
<p>So <script type="math/tex">\{\mu_i-\mu\}_{i=1}^{n}</script> is a linearly dependent set. Applying our lemma, we
see that</p>
<script type="math/tex; mode=display">S_B = \sum_{i=1}^{n}{N_i (\mu_i-\mu)(\mu_i-\mu)^{T}}</script>
<p>must be rank deficient. Thus, <script type="math/tex">rank(S_W) \leq n-1</script>. Now, <script type="math/tex">rank(AB) \leq
rank(A)rank(B)</script>, so</p>
<script type="math/tex; mode=display">\begin{align*}
rank(S_W^{-1}S_B) \leq \min{(rank(S_W^{-1}), rank(S_B))} = n-1
\end{align*}</script>
<p>as desired.</p>
<h2 id="lda-as-a-machine-learning-technique">LDA as a Machine Learning Technique</h2>
<p>OK so we’re done with the math, but how is LDA actually used in practice? One of
the easiest ways is to look at how LDA is actually implemented in the real
world. <code class="highlighter-rouge">scikit-learn</code> has <a href="http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html#sklearn.discriminant_analysis.LinearDiscriminantAnalysis">a very well-documented implementation of
LDA</a>:
I find that reading the docs is a great way to learn stuff.</p>
<p>Below are a few miscellaneous comments on practical considerations when using
LDA.</p>
<h3 id="regularization-aka-shrinkage">Regularization (a.k.a. shrinkage)</h3>
<p><code class="highlighter-rouge">scikit-learn</code>’s implementation of LDA has an interesting optional parameter:
<code class="highlighter-rouge">shrinkage</code>. What’s that about?</p>
<p><a href="https://stats.stackexchange.com/questions/106121/does-it-make-sense-to-combine-pca-and-lda/109810#109810">Here’s a wonderful Cross Validated
post</a>
on how LDA can introduce overfitting. In essence, matrix inversion is an
extremely sensitive operation (in that small changes in the matrix may lead to
large changes in its inverse, so that even a tiny bit of noise will be amplified
upon inverting the matrix), and so unless the estimate of the within-class
scatter matrix <script type="math/tex">S_W</script> is very good, its inversion is likely to introduce
overfitting.</p>
<p>One way to combat that is through regularizing LDA. It basically replaces
<script type="math/tex">S_W</script> with <script type="math/tex">(1-t)S_W + tI</script>, where <script type="math/tex">I</script> is the identity matrix, and <script type="math/tex">t</script> is
the <em>regularization parameter</em>, or the <em>shrinkage constant</em>. That’s what
<code class="highlighter-rouge">scikit</code>’s <code class="highlighter-rouge">shrinkage</code> parameter is: it’s <script type="math/tex">t</script>.</p>
<p>If you’re interested in <em>why</em> this linear combination of the within-class
scatter and the identity give such a well-conditioned estimate of <script type="math/tex">S_W</script>, check
out <a href="https://www.sciencedirect.com/science/article/pii/S0047259X03000964">the original paper by Ledoit and
Wolf</a>.
Their original motivation was in financial portfolio optimization, so they’ve
also authored several other papers
(<a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=433840&rec=1&srcabs=290916&alg=7&pos=6">here</a>
and <a href="https://www.sciencedirect.com/science/article/pii/S0927539803000070">here</a>)
that go into the more financial details. That needn’t concern us though:
covariance matrices are literally everywhere.</p>
<p>For an illustration of this, <code class="highlighter-rouge">amoeba</code>’s post on Cross Validated gives a good
example of LDA overfitting, and how regularization can help combat that.</p>
<h3 id="lda-as-a-classifier">LDA as a classifier</h3>
<p>We’ve talked a lot about how LDA is a dimensionality reduction technique. But in
addition to it, you can make two extra assumptions, and LDA becomes a very
robust classifier as well! Here they are:</p>
<ol>
<li>Assume that the class conditional distributions are Gaussian, and</li>
<li>Assume that these Gaussians have the same covariance matrix (a.k.a.
assume <a href="https://en.wikipedia.org/wiki/Homoscedasticity">homoskedasticity</a>)</li>
</ol>
<p>Now, <em>how</em> LDA acts as a classifier is a bit complicated: the problem is solved
fairly easily if there are only two classes. In this case, the optimal Bayesian
solution is to classify the observation depending on whether the log of the
likelihood ratio is less than or greater than some threshold. This turns out to
be a simple dot product: <script type="math/tex">\vec{w} \cdot \vec{x} > c</script>, where <script type="math/tex">\vec{w} =
\Sigma^{-1} (\vec{\mu_1} - \vec{\mu_2})</script>. <a href="https://en.wikipedia.org/wiki/Linear_discriminant_analysis#LDA_for_two_classes">Wikipedia has a good derivation of
this</a>.</p>
<p>There isn’t really a nice dot-product solution for the multiclass case. So,
what’s commonly done is to take a “one-against-the-rest” approach, in which
there are <script type="math/tex">k</script> binary classifiers, one for each of the <script type="math/tex">k</script> classes. Another
common technique is to take a pairwise approach, in which there are <script type="math/tex">k(k-1)/2</script>
classifiers, one for each pair of classes. In either case, the outputs of all
the classifiers are combined in some way to give the final classification.</p>
<h3 id="close-relatives-pca-qda-anova">Close relatives: PCA, QDA, ANOVA</h3>
<p>LDA is similar to a lot of other techniques, and the fact that they all go by
acronyms doesn’t do anyone a favor. My goal here isn’t to introduce or explain
these various techniques, but rather point out their differences.</p>
<p><em>1) Principal components analysis (PCA):</em></p>
<p>LDA is very similar to <a href="http://setosa.io/ev/principal-component-analysis">PCA</a>:
in fact, the question posted in the Cross Validated post above was actually
about whether or not it would make sense to perform PCA followed by LDA.</p>
<p>There is a crucial difference between the two techniques, though. PCA tries to
find the axes with <em>maximum variance</em> for the whole data set, whereas LDA tries
to find the axes for best <em>class separability</em>.</p>
<p><a href="https://raw.githubusercontent.com/eigenfoo/eigenfoo.xyz/master/assets/images/lda-pic.png"><img style="float: middle" width="500" height="500" src="https://raw.githubusercontent.com/eigenfoo/eigenfoo.xyz/master/assets/images/lda-pic.png" /></a></p>
<p>Going back to the illustration from before (reproduced on the right), it’s not
hard to see that PCA would give us LD1, whereas LDA would give us LD2. This
makes the main difference between PCA and LDA painfully obvious: just because a
feature has a high variance, doesn’t mean that it’s predictive of the classes!</p>
<p><em>2) Quadratic discriminant analysis (QDA):</em></p>
<p>QDA is a generalization of LDA as a classifer. As mentioned above, LDA must
assume that the class contidtional distributions are Gaussian with the same
covariance matrix, if we want it to do any classification for us.</p>
<p>QDA doesn’t make this homoskedasticity assumption (assumption number 2 above),
and attempts to estimate the covariance of all classes. While this might seem
like a more robust algorithm (fewer assumptions! Occam’s razor!), this means
there is a much larger number of parameters to estimate. In fact, the number of
parameters grows quadratically with the number of classes! So unless you can
guarantee that your covariance estimates are reliable, you might not want to use
QDA.</p>
<p>After all of this, there might be some confusion about the relationship between
LDA, QDA, what’s for dimensionality reduction, what’s for classification, etc.
<a href="https://stats.stackexchange.com/questions/71489/three-versions-of-discriminant-analysis-differences-and-how-to-use-them/71571#71571">This CrossValidated
post</a>
and everything that it links to, might help clear things up.</p>
<p><em>3) Analysis of variance (ANOVA):</em></p>
<p>LDA and <a href="https://en.wikipedia.org/wiki/Analysis_of_variance">ANOVA</a> seem to have
similar aims: both try to “decompose” an observed variable into several
explanatory/discriminatory variables. However, there is an important difference
that <a href="https://en.wikipedia.org/wiki/Linear_discriminant_analysis">the Wikipedia article on
LDA</a> puts very
succinctly (my emphases):</p>
<blockquote>
<p>LDA is closely related to analysis of variance (ANOVA) and regression
analysis, which also attempt to express one dependent variable as a linear
combination of other features or measurements. However, ANOVA uses
<strong>categorical</strong> independent variables and a <strong>continuous</strong> dependent variable,
whereas discriminant analysis has <strong>continuous</strong> independent variables and a
<strong>categorical</strong> dependent variable (i.e. the class label).</p>
</blockquote>George HoEverything that you wanted to know (and more!) about linear discriminant analysis (LDA) — how it works, why it works, and how to use it.