Jekyll2018-12-02T21:37:39+00:00https://eigenfoo.xyz/feed.xmlEigenfooGeorge Hohttps://raw.githubusercontent.com/eigenfoo/eigenfoo.xyz/master/assets/images/email.pngModelling Hate Speech on Reddit — A Three-Act Play (Slide Deck)2018-11-03T00:00:00+00:002017-11-08T00:00:00+00:00https://eigenfoo.xyz/reddit-slides<p>This is a follow-up post to my first post on a recent project to <a href="https://eigenfoo.xyz/reddit-clusters/">model hate
speech on Reddit</a>. If you haven’t taken a
look at my first post, please do!</p>
<p>I recently gave a talk on the technical, data science side of the project,
describing not just the final result, but also the trajectory of the whole
project: stumbling blocks, dead ends and all. Below is the slide deck, as well
as the speaker notes. Enjoy!</p>
<h2 id="abstract">Abstract</h2>
<p>Reddit is the one of the most popular discussion websites today, and is famously
broad-minded in what it allows to be said on its forums: however, where there is
free speech, there are invariably pockets of hate speech.</p>
<p>In this talk, I present a recent project to model hate speech on Reddit. In
three acts, I chronicle the thought processes and stumbling blocks of the
project, with each act applying a different form of machine learning: supervised
learning, topic modelling and text clustering. I conclude with the current state
of the project: a system that allows the modelling and summarization of entire
subreddits, and possible future directions. Rest assured that both the talk and
the slides have been scrubbed to be safe for work!</p>
<h2 id="slides">Slides</h2>
<p>(Don’t forget to take a look at the speaker notes!)</p>
<style>
.responsive-wrap iframe{ max-width: 100%;}
</style>
<div class="responsive-wrap">
<!-- this is the embed code provided by Google -->
<iframe src="https://docs.google.com/presentation/d/e/2PACX-1vS9wBAwScepPz3vmvyMrq-osBfIGzL7C3wArXmL3ky_A2dfaqlVSshTz2CyHuMibQBX3Ej6QCsZ0qv_/embed?start=false&loop=false&delayms=3000" frameborder="0" width="960" height="569" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe>
<!-- Google embed ends -->
</div>George Hohttps://raw.githubusercontent.com/eigenfoo/eigenfoo.xyz/master/assets/images/email.pngIn this talk, I present a recent project to model hate speech on Reddit. In three acts, I chronicle the thought processes and stumbling blocks of the project, with each act applying a different form of machine learning: supervised learning, topic modelling and text clustering.Probabilistic and Bayesian Matrix Factorizations for Text Clustering2018-10-13T00:00:00+00:002018-10-13T00:00:00+00:00https://eigenfoo.xyz/matrix-factorizations<p>Natural language processing is in a curious place right now. It was always a
late bloomer (as far as machine learning subfields go), and it’s not immediately
obvious how close the field is to viable, large-scale, production-ready
techniques (in the same way that, say, <a href="https://clarifai.com/models/">computer vision
is</a>). For example, <a href="https://ruder.io">Sebastian
Ruder</a> predicted that the field is <a href="https://thegradient.pub/nlp-imagenet/">close to a watershed
moment</a>, and that soon we’ll have
downloadable language models. However, <a href="https://amarasovic.github.io/">Ana
Marasović</a> points out that there is <a href="https://thegradient.pub/frontiers-of-generalization-in-natural-language-processing/">a tremendous
amount of work demonstrating
that</a>:</p>
<blockquote>
<p>“despite good performance on benchmark datasets, modern NLP techniques are
nowhere near the skill of humans at language understanding and reasoning when
making sense of novel natural language inputs”.</p>
</blockquote>
<p>I am confident that I am <em>very</em> bad at making lofty predictions about the
future. Instead, I’ll talk about something I know a bit about: simple solutions
to concrete problems, with some Bayesianism thrown in for good measure
:grinning:.</p>
<p>This blog post will summarize some literature on probabilistic and Bayesian
matrix factorization methods, keeping an eye out for applications to one
specific task in NLP: text clustering. It’s exactly what it sounds like, and
there’s been a fair amount of success in applying text clustering to many other
NLP tasks (e.g. check out these examples in <a href="https://www-users.cs.umn.edu/~hanxx023/dmclass/scatter.pdf">document
organization</a>,
<a href="http://jmlr.csail.mit.edu/papers/volume3/bekkerman03a/bekkerman03a.pdf">corpus</a>
<a href="https://www.cs.technion.ac.il/~rani/el-yaniv-papers/BekkermanETW01.pdf">summarization</a>
and <a href="http://www.kamalnigam.com/papers/emcat-aaai98.pdf">document
classification</a>).</p>
<p>What follows is a literature review of three matrix factorization techniques for
machine learning: one classical, one probabilistic and one Bayesian. I also
experimented with applying these methods to text clustering: I gave a guest
lecture on my results to a graduate-level machine learning class at The Cooper
Union (the slide deck is below). Dive in!</p>
<h2 id="non-negative-matrix-factorization-nmf">Non-Negative Matrix Factorization (NMF)</h2>
<p>NMF is a <a href="https://en.wikipedia.org/wiki/Non-negative_matrix_factorization">very
well-known</a>
<a href="http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html">matrix
factorization</a>
<a href="https://arxiv.org/abs/1401.5226">technique</a>, perhaps most famous for its
applications in <a href="http://blog.echen.me/2011/10/24/winning-the-netflix-prize-a-summary/">collaborative filtering and the Netflix
Prize</a>.</p>
<p>Factorize your (entrywise non-negative) <script type="math/tex">m \times n</script> matrix <script type="math/tex">V</script> as
<script type="math/tex">V = WH</script>, where <script type="math/tex">W</script> is <script type="math/tex">m \times p</script> and <script type="math/tex">H</script> is <script type="math/tex">p \times n</script>. <script type="math/tex">p</script>
is the dimensionality of your latent space, and each latent dimension usually
comes to quantify something with semantic meaning. There are several algorithms
to compute this factorization, but Lee and Seung’s <a href="https://dl.acm.org/citation.cfm?id=3008829">multiplicative update
rule</a> (originally published in NIPS
2000) is most popular.</p>
<p>Fairly simple: enough said, I think.</p>
<h2 id="probabilistic-matrix-factorization-pmf">Probabilistic Matrix Factorization (PMF)</h2>
<p>Originally introduced as a paper at <a href="https://papers.nips.cc/paper/3208-probabilistic-matrix-factorization">NIPS
2007</a>,
<em>probabilistic matrix factorization</em> is essentially the exact same model as NMF,
but with uncorrelated (a.k.a. “spherical”) multivariate Gaussian priors placed
on the rows and columns of <script type="math/tex">U</script> and <script type="math/tex">V</script>. Expressed as a graphical model, PMF
would look like this:</p>
<figure>
<a href="/assets/images/pmf.png"><img style="float: middle" src="/assets/images/pmf.png" /></a>
</figure>
<p>Note that the priors are placed on the <em>rows</em> of the <script type="math/tex">U</script> and <script type="math/tex">V</script> matrices.</p>
<p>The authors then (somewhat disappointing) proceed to find the MAP estimate of
the <script type="math/tex">U</script> and <script type="math/tex">V</script> matrices. They show that maximizing the posterior is
equivalent to minimizing the sum-of-squared-errors loss function with two
quadratic regularization terms:</p>
<script type="math/tex; mode=display">\frac{1}{2} \sum_{i=1}^{N} \sum_{j=1}^{M} {I_{ij} (R_{ij} - U_i^T V_j)^2} +
\frac{\lambda_U}{2} \sum_{i=1}^{N} |U|_{Fro}^2 +
\frac{\lambda_V}{2} \sum_{j=1}^{M} |V|_{Fro}^2</script>
<p>where <script type="math/tex">|\cdot|_{Fro}</script> denotes the Frobenius norm, and <script type="math/tex">I_{ij}</script> is 1 if document
<script type="math/tex">i</script> contains word <script type="math/tex">j</script>, and 0 otherwise.</p>
<p>This loss function can be minimized via gradient descent, and implemented in
your favorite deep learning framework (e.g. Tensorflow or PyTorch).</p>
<p>The problem with this approach is that while the MAP estimate is often a
reasonable point in low dimensions, it becomes very strange in high dimensions,
and is usually not informative or special in any way. Read <a href="https://www.inference.vc/high-dimensional-gaussian-distributions-are-soap-bubble/">Ferenc Huszár’s blog
post</a>
for more.</p>
<h2 id="bayesian-probabilistic-matrix-factorization-bpmf">Bayesian Probabilistic Matrix Factorization (BPMF)</h2>
<p>Strictly speaking, PMF is not a Bayesian model. After all, there aren’t any
priors or posteriors, only fixed hyperparameters and a MAP estimate. <em>Bayesian
probabilistic matrix factorization</em>, originally published by <a href="https://dl.acm.org/citation.cfm?id=1390267">researchers from
the University of Toronto</a> is a
fully Bayesian treatment of PMF.</p>
<p>Instead of saying that the rows/columns of U and V are normally distributed with
zero mean and some precision matrix, we place hyperpriors on the mean vector and
precision matrices. The specific priors are Wishart priors on the covariance
matrices (with scale matrix <script type="math/tex">W_0</script> and <script type="math/tex">\nu_0</script> degrees of freedom), and
Gaussian priors on the means (with mean <script type="math/tex">\mu_0</script> and covariance equal to the
covariance given by the Wishart prior). Expressed as a graphical model, BPMF
would look like this:</p>
<figure>
<a href="/assets/images/bpmf.png"><img style="float: middle" src="/assets/images/bpmf.png" /></a>
</figure>
<p>Note that, as above, the priors are placed on the <em>rows</em> of the <script type="math/tex">U</script> and <script type="math/tex">V</script>
matrices, and that <script type="math/tex">n</script> is the dimensionality of latent space (i.e. the number
of latent dimensions in the factorization).</p>
<p>The authors then sample from the posterior distribution of <script type="math/tex">U</script> and <script type="math/tex">V</script> using
a Gibbs sampler. Sampling takes several hours: somewhere between 5 to 180,
depending on how many samples you want. Nevertheless, the authors demonstrate
that BPMF can achieve more accurate and more robust results on the Netflix data
set.</p>
<p>I would propose two changes to the original paper:</p>
<ol>
<li>Use an LKJ prior on the covariance matrices instead of a Wishart prior.
<a href="https://docs.pymc.io/notebooks/LKJ.html">According to Michael Betancourt and the PyMC3 docs, this is more numerically
stable</a>, and will lead to better
inference.</li>
<li>Use a more robust sampler such as NUTS (instead of a Gibbs sampler), or even
resort to variational inference. The paper makes it clear that BPMF is a
computationally painful endeavor, so any speedup to the method would be a
great help. It seems to me that for practical real-world applications to
collaborative filtering, we would want to use variational inference. Netflix
ain’t waiting 5 hours for their recommendations.</li>
</ol>
<h2 id="application-to-text-clustering">Application to Text Clustering</h2>
<p>Most of the work in these matrix factorization techniques focus on
dimensionality reduction: that is, the problem of finding two factor matrices
that faithfully reconstruct the original matrix when multiplied together.
However, I was interested in applying the exact same techniques to a separate
task: text clustering.</p>
<p>A natural question is: why is matrix factorization<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup> a good technique to use
for text clustering? Because it is simultaneously a clustering and a feature
engineering technique: not only does it offer us a latent representation of the
original data, but it also gives us a way to easily <em>reconstruct</em> the original
data from the latent variables! This is something that <a href="https://eigenfoo.xyz/lda-sucks">latent Dirichlet
allocation</a>, for instance, cannot do.</p>
<p>Matrix factorization lives an interesting double life: clustering technique by
day, feature transformation technique by night. <a href="http://charuaggarwal.net/text-cluster.pdf">Aggarwal and
Zhai</a> suggest that chaining matrix
factorization with some other clustering technique (e.g. agglomerative
clustering or topic modelling) is common practice and is called <em>concept
decomposition</em>, but I haven’t seen any other source back this up.</p>
<p>I experimented with using these techniques to cluster subreddits (<a href="https://eigenfoo.xyz/reddit-clusters">sound
familiar?</a>). In a nutshell, nothing seemed
to work out very well, and I opine on why I think that’s the case in the slide
deck below. This talk was delivered to a graduate-level course in frequentist
machine learning. Don’t forget to take a look at the speaker notes too!</p>
<style>
.responsive-wrap iframe{ max-width: 100%;}
</style>
<div class="responsive-wrap">
<!-- this is the embed code provided by Google -->
<iframe src="https://docs.google.com/presentation/d/e/2PACX-1vT_yB6dMJCnnwKRtkGbdx90lhYGGH329QAGrYw8SaR2mCh0VuocMWGEVJ2XhFNp44JQtPV_vOlQkslo/embed?start=false&loop=false&delayms=3000" frameborder="0" width="960" height="569" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe>
<!-- Google embed ends -->
</div>
<hr />
<div class="footnotes">
<ol>
<li id="fn:1">
<p>which is, by the way, a <a href="http://scikit-learn.org/stable/modules/decomposition.html">severely underappreciated technique in machine learning</a> <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>George Hohttps://raw.githubusercontent.com/eigenfoo/eigenfoo.xyz/master/assets/images/email.pngNatural language processing is in a curious place right now. It was always a late bloomer (as far as machine learning subfields go), and it's not immediately obvious how close the field is to viable, large-scale, production-ready techniques.Multi-Armed Bandits, Conjugate Models and Bayesian Reinforcement Learning2018-08-31T00:00:00+00:002018-08-31T00:00:00+00:00https://eigenfoo.xyz/bayesian-bandits<p>Let’s talk about Bayesianism. It’s developed a reputation (not entirely
justified, but not entirely unjustified either) for being too mathematically
sophisticated or too computationally intensive to work at scale. For instance,
inferring from a Gaussian mixture model is fraught with computational problems
(hierarchical funnels, multimodal posteriors, etc.), and may take a seasoned
Bayesian anywhere between a day and a month to do well. On the other hand, other
blunt hammers of estimation are as easy as a maximum likelihood estimate:
something you could easily get a SQL query to do if you wanted to.</p>
<p>In this blog post I hope to show that there is more to Bayesianism than just
MCMC sampling and suffering, by demonstrating a Bayesian approach to a classic
reinforcement learning problem: the <em>multi-armed bandit</em>.</p>
<p>The problem is this: imagine a gambler at a row of slot machines (each machine
being a “one-armed bandit”), who must devise a strategy so as to maximize
rewards. This strategy includes which machines to play, how many times to play
each machine, in which order to play them, and whether to continue with the
current machine or try a different machine.</p>
<p>This problem is a central problem in decision theory and reinforcement learning:
the agent (our gambler) starts out in a state of ignorance, but learns through
interacting with its environment (playing slots). For more details, Cam
Davidson-Pilon has a great introduction to multi-armed bandits in Chapter 6 of
his book <a href="https://nbviewer.jupyter.org/github/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/blob/master/Chapter6_Priorities/Ch6_Priors_PyMC3.ipynb"><em>Bayesian Methods for
Hackers</em></a>,
and Tor Lattimore and Csaba Szepesvári cover a breathtaking amount of the
underlying theory in their book <a href="http://banditalgs.com/"><em>Bandit Algorithms</em></a>.</p>
<p>So let’s get started! I assume that you are familiar with:</p>
<ul>
<li>some basic probability, at least enough to know some distributions: normal,
Bernoulli, binomial…</li>
<li>some basic Bayesian statistics, at least enough to understand what a
<a href="https://en.wikipedia.org/wiki/Conjugate_prior">conjugate prior</a> (and
conjugate model) is, and why one might like them.</li>
<li><a href="https://jeffknupp.com/blog/2013/04/07/improve-your-python-yield-and-generators-explained/">Python generators and the <code class="highlighter-rouge">yield</code>
keyword</a>,
to understand some of the code I’ve written<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>.</li>
</ul>
<p>Dive in!</p>
<h2 id="the-algorithm">The Algorithm</h2>
<p>The algorithm is straightforward. The description below is taken from Cam
Davidson-Pilon over at Data Origami<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>.</p>
<p>For each round,</p>
<ol>
<li>Sample a random variable <script type="math/tex">X_b</script> from the prior of bandit <script type="math/tex">b</script>, for all
<script type="math/tex">b</script>.</li>
<li>Select the bandit with largest sample, i.e. select bandit <script type="math/tex">B =
\text{argmax}(X_b)</script>.</li>
<li>Observe the result of pulling bandit <script type="math/tex">B</script>, and update your prior on bandit
<script type="math/tex">B</script> using the conjugate model update rule.</li>
<li>Repeat!</li>
</ol>
<p>What I find remarkable about this is how dumbfoundingly simple it is! No MCMC
sampling, no <script type="math/tex">\hat{R}</script>s to diagnose, no pesky divergences… all it requires is
a conjugate model, and the rest is literally just counting.</p>
<p><strong>NB:</strong> This algorithm is technically known as <em>Thompson sampling</em>, and is only
one of many algorithms out there. The main difference is that there are other
ways to go from our current priors to a decision on which bandit to play
next. E.g. instead of simply sampling from our priors, we could use the
upper bound of the 90% credible region, or some dynamic quantile of the
posterior (as in Bayes UCB). See Data Origami<sup id="fnref:2:1"><a href="#fn:2" class="footnote">2</a></sup> for more information.</p>
<h3 id="stochastic-aka-stationary-bandits">Stochastic (a.k.a. stationary) bandits</h3>
<p>Let’s take this algorithm for a spin! Assume we have rewards which are Bernoulli
distributed (this would be the situation we face when e.g. modelling
click-through rates). The conjugate prior for the Bernoulli distribution is the
Beta distribution (this is a special case of the Beta-Binomial model).</p>
<script src="https://gist.github.com/eigenfoo/3d8d318f5bd8fdea24f7b12936de77b5.js"></script>
<p>Here, <code class="highlighter-rouge">pull</code> returns the result of pulling on the <code class="highlighter-rouge">arm</code>‘th bandit, and
<code class="highlighter-rouge">make_bandits</code> is just a factory function for <code class="highlighter-rouge">pull</code>.</p>
<p>The <code class="highlighter-rouge">bayesian_strategy</code> function actually implements the algorithm. We only need
to keep track of the number of times we win and the number of times we played
(<code class="highlighter-rouge">num_rewards</code> and <code class="highlighter-rouge">num_trials</code>, respectively). It samples from all current
<code class="highlighter-rouge">np.random.beta</code> priors (where the original prior was a <script type="math/tex">\text{Beta}(2,
2)</script>, which is symmetrix about 0.5 and explains the odd-looking <code class="highlighter-rouge">a=2+</code> and
<code class="highlighter-rouge">b=2+</code> there), picks the <code class="highlighter-rouge">np.argmax</code>, <code class="highlighter-rouge">pull</code>s that specific bandit, and updates
<code class="highlighter-rouge">num_rewards</code> and <code class="highlighter-rouge">num_trials</code>.</p>
<p>I’ve omitted the data visualization code here, but if you want to see it, check
out the <a href="https://github.com/eigenfoo/wanderings/blob/afcf37a8c6c2a2ac38f6708c1f3dd50db2ebe71f/bayes/bayesian-bandits.ipynb">Jupyter notebook on my
GitHub</a></p>
<figure>
<a href="/assets/images/beta-binomial.png"><img style="float: middle" src="/assets/images/beta-binomial.png" /></a>
</figure>
<h3 id="generalizing-to-conjugate-models">Generalizing to conjugate models</h3>
<p>In fact, this algorithm isn’t just limited to Bernoulli-distributed rewards: it
will work for any <a href="https://en.wikipedia.org/wiki/Conjugate_prior#Table_of_conjugate_distributions">conjugate
model</a>!
Here I implement the Gamma-Poisson model (that is, Poisson distributed rewards,
with a Gamma conjugate prior) to illustrate how extensible this framework is.
(Who cares about Poisson distributed rewards, you ask? Anyone who worries about
returning customers, for one!)</p>
<p>Here’s what we need to change:</p>
<ul>
<li>The rewards distribution on line 5 (in practice, you don’t get to pick this,
so <em>technically</em> there’s nothing to change if you’re doing this in
production!)</li>
<li>The sampling from the prior on lines 17–18</li>
<li>The variables you need to keep track of and update rule on lines 12–13 and
24–25.</li>
</ul>
<p>Without further ado:</p>
<script src="https://gist.github.com/eigenfoo/e9a9933d94524e6dee717276c6b6f732.js"></script>
<figure>
<a href="/assets/images/gamma-poisson.png"><img style="float: middle" src="/assets/images/gamma-poisson.png" /></a>
</figure>
<p>This really demonstrates how lean and mean conjugate models can be, especially
considering how much of a pain MCMC or approximate inference methods would be,
compared to literal <em>counting</em>. Conjugate models aren’t just textbook examples:
they’re <em>(gasp)</em> actually useful!</p>
<h3 id="generalizing-to-arbitrary-rewards-distributions">Generalizing to arbitrary rewards distributions</h3>
<p>OK, so if we have a conjugate model, we can use Thompson sampling to solve the
multi-armed bandit problem. But what if our rewards distribution doesn’t have a
conjugate prior, or what if we don’t even <em>know</em> our rewards distribution?</p>
<p>In general this problem is very difficult to solve. Theoretically, we could
place some fairly uninformative prior on our rewards, and after every pull we
could run MCMC to get our posterior, but that doesn’t scale, especially for the
online algorithms that we have in mind. Luckily a recent paper by Agrawal and
Goyal<sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup> gives us some help, <em>if we assume rewards are bounded on the interval
<script type="math/tex">[0, 1]</script></em> (of course, if we have bounded rewards, then we can just normalize
them by their maximum value to get rewards between 0 and 1).</p>
<p>This solutions bootstraps the first Beta-Bernoulli model to this new situation.
Here’s what happens:</p>
<ol>
<li>Sample a random variable <script type="math/tex">X_b</script> from the (Beta) prior of bandit <script type="math/tex">b</script>, for
all <script type="math/tex">b</script>.</li>
<li>Select the bandit with largest sample, i.e. select bandit <script type="math/tex">B =
\text{argmax}(X_b)</script>.</li>
<li>Observe the reward <script type="math/tex">R</script> from bandit <script type="math/tex">B</script>.</li>
<li><strong>Observe the outcome <script type="math/tex">r</script> from a Bernoulli trial with probability of success <script type="math/tex">R</script>.</strong></li>
<li>Update posterior of <script type="math/tex">B</script> with this observation <script type="math/tex">r</script>.</li>
<li>Repeat!</li>
</ol>
<p>Here I do this for the logit-normal distribution (i.e. a random variable whose
logit is normally distributed). Note that <code class="highlighter-rouge">np.expit</code> is the inverse of the logit
function.</p>
<script src="https://gist.github.com/eigenfoo/7a397fef8aaa028c5119c9f86860d72e.js"></script>
<figure>
<a href="/assets/images/bounded.png"><img style="float: middle" src="/assets/images/bounded.png" /></a>
</figure>
<h2 id="final-remarks">Final remarks</h2>
<p>None of this theory is new: I’m just advertising it :blush:. See Cam
Davidson-Pilon’s great blog post about Bayesian bandits<sup id="fnref:2:2"><a href="#fn:2" class="footnote">2</a></sup> for a much more
in-depth treatment, and of course, read around papers on arXiv if you want to go
deeper!</p>
<p>Also, if you want to see all the code that went into this blog post, check out
<a href="https://github.com/eigenfoo/wanderings/blob/afcf37a8c6c2a2ac38f6708c1f3dd50db2ebe71f/bayes/bayesian-bandits.ipynb">the notebook
here</a>.</p>
<hr />
<div class="footnotes">
<ol>
<li id="fn:1">
<p>I’ve hopped on board the functional programming bandwagon, and couldn’t help but think that to demonstrate this idea, I didn’t need a framework, a library or even a class. Just two functions! <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>Davidson-Pilon, Cameron. “Multi-Armed Bandits.” DataOrigami, 6 Apr. 2013, <a href="https://dataorigami.net/blogs/napkin-folding/79031811-multi-armed-bandits">dataorigami.net/blogs/napkin-folding/79031811-multi-armed-bandits</a> <a href="#fnref:2" class="reversefootnote">↩</a> <a href="#fnref:2:1" class="reversefootnote">↩<sup>2</sup></a> <a href="#fnref:2:2" class="reversefootnote">↩<sup>3</sup></a></p>
</li>
<li id="fn:3">
<p><a href="https://arxiv.org/abs/1111.1797">arXiv:1111.1797</a> <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>George Hohttps://raw.githubusercontent.com/eigenfoo/eigenfoo.xyz/master/assets/images/email.pngLet's talk about Bayesianism. It's developed a reputation (not entirely justified, but not entirely unjustified either) for being too mathematically sophisticated or too computationally intensive to work at scale.Cookbook — Bayesian Modelling with PyMC32018-06-19T00:00:00+00:002018-06-24T00:00:00+00:00https://eigenfoo.xyz/bayesian-modelling-cookbook<p>Recently I’ve started using <a href="https://github.com/pymc-devs/pymc3">PyMC3</a> for
Bayesian modelling, and it’s an amazing piece of software! The API only exposes
as much of heavy machinery of MCMC as you need — by which I mean, just the
<code class="highlighter-rouge">pm.sample()</code> method (a.k.a., as <a href="http://twiecki.github.io/blog/2013/08/12/bayesian-glms-1/">Thomas
Wiecki</a> puts it, the
<em>Magic Inference Button™</em>). This really frees up your mind to think about your
data and model, which is really the heart and soul of data science!</p>
<p>That being said however, I quickly realized that the water gets very deep very
fast: I explored my data set, specified a hierarchical model that made sense to
me, hit the <em>Magic Inference Button™</em>, and… uh, what now? I blinked at the
angry red warnings the sampler spat out.</p>
<p>So began by long, rewarding and ongoing exploration of Bayesian modelling. This
is a compilation of notes, tips, tricks and recipes that I’ve collected from
everywhere: papers, documentation, peppering my <a href="https://twitter.com/twiecki">more
experienced</a>
<a href="https://twitter.com/aseyboldt">colleagues</a> with questions. It’s still very much
a work in progress, but hopefully somebody else finds it useful!</p>
<p><img style="float: middle" width="600" src="https://cdn.rawgit.com/pymc-devs/pymc3/master/docs/logos/svg/PyMC3_banner.svg" /></p>
<h2 id="for-the-uninitiated">For the Uninitiated</h2>
<ul>
<li>First of all, <em>welcome!</em> It’s a brave new world out there — where statistics
is cool, Bayesian and (if you’re lucky) even easy. Dive in!</li>
</ul>
<h3 id="bayesian-modelling">Bayesian modelling</h3>
<ul>
<li>
<p>If you don’t know any probability, I’d recommend <a href="https://betanalpha.github.io/assets/case_studies/probability_theory.html">Michael
Betancourt’s</a>
crash-course in practical probability theory.</p>
</li>
<li>
<p>For an introduction to general Bayesian methods and modelling, I really liked
<a href="http://camdavidsonpilon.github.io/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/">Cam Davidson Pilon’s <em>Bayesian Methods for
Hackers</em></a>:
it really made the whole “thinking like a Bayesian” thing click for me.</p>
</li>
<li>
<p>If you’re willing to spend some money, I’ve heard that <a href="https://sites.google.com/site/doingbayesiandataanalysis/"><em>Doing Bayesian Data
Analysis</em> by
Kruschke</a> (a.k.a.
<em>“the puppy book”</em>) is for the bucket list.</p>
</li>
<li>
<p>Here we come to a fork in the road. The central problem in Bayesian modelling
is this: given data and a probabilistic model that we think models this data,
how do we find the posterior distribution of the model’s parameters? There are
currently two good solutions to this problem. One is Markov-chain Monte Carlo
sampling (a.k.a. MCMC sampling), and the other is variational inference
(a.k.a. VI). Both methods are mathematical Death Stars: extremely powerful but
incredibly complicated. Nevertheless, I think it’s important to get at least a
hand-wavy understanding of what these methods are. If you’re new to all this,
my personal recommendation is to invest your time in learning MCMC: it’s been
around longer, we know that there are sufficiently robust tools to help you,
and there’s a lot more support/documentation out there.</p>
</li>
</ul>
<h3 id="markov-chain-monte-carlo">Markov-chain Monte Carlo</h3>
<ul>
<li>
<p>For a good high-level introduction to MCMC, I liked <a href="https://www.youtube.com/watch?v=DJ0c7Bm5Djk&feature=youtu.be&t=4h40m9s">Michael Betancourt’s
StanCon 2017
talk</a>:
especially the first few minutes where he provides a motivation for MCMC, that
really put all this math into context for me.</p>
</li>
<li>
<p>For a more in-depth (and mathematical) treatment of MCMC, I’d check out his
<a href="https://arxiv.org/abs/1701.02434">paper on Hamiltonian Monte Carlo</a>.</p>
</li>
</ul>
<h3 id="variational-inference">Variational inference</h3>
<ul>
<li>
<p>VI has been around for a while, but it was only in 2017 (2 years ago, at the
time of writing) that <em>automatic differentiation variational inference</em> was
invented. As such, variational inference is undergoing a renaissance and is
currently an active area of statistical research. Since it’s such a nascent
field, most resources on it are very theoretical and academic in nature.</p>
</li>
<li>
<p>Chapter 10 (on approximate inference) in Bishop’s <em>Pattern Recognition and
Machine Learning</em> and <a href="https://www.cs.princeton.edu/courses/archive/fall11/cos597C/lectures/variational-inference-i.pdf">this
tutorial</a>
by David Blei are excellent, if a bit mathematically-intensive, resources.</p>
</li>
<li>
<p>The most hands-on explanation of variational inference I’ve seen is the docs
for <a href="http://pyro.ai/examples/svi_part_i.html">Pyro</a>, a probabilistic
programming language developed by Uber that specializes in variational
inference.</p>
</li>
</ul>
<h2 id="model-formulation">Model Formulation</h2>
<ul>
<li>
<p>Try thinking about <em>how</em> your data would be generated: what kind of machine
has your data as outputs? This will help you both explore your data, as well
as help you arrive at a reasonable model formulation.</p>
</li>
<li>
<p>Try to avoid correlated variables. Some of the more robust samplers (<strong>cough</strong>
NUTS <strong>cough cough</strong>) can cope with <em>a posteriori</em> correlated random
variables, but sampling is much easier for everyone involved if the variables
are uncorrelated. By the way, the bar is pretty low here: if the
jointplot/scattergram of the two variables looks like an ellipse, thats
usually okay. It’s when the ellipse starts looking like a line that you should
be alarmed.</p>
</li>
<li>
<p>Try to avoid discrete latent variables, and discrete parameters in general.
There is no good method to sample them in a smart way (since discrete
parameters have no gradients); and with “naïve” samplers (i.e. those that do
not take advantage of the gradient), the number of samples one needs to make
good inferences generally scales exponentially in the number of parameters.
For an instance of this, see <a href="https://docs.pymc.io/notebooks/marginalized_gaussian_mixture_model.html">this example on marginal Gaussian
mixtures</a>.</p>
</li>
<li>
<p>The <a href="https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations">Stan GitHub
wiki</a> has
some excellent recommendations on how to choose good priors. Once you get a
good handle on the basics of using PyMC3, I <em>100% recommend</em> reading this wiki
from start to end: the Stan community has fantastic resources on Bayesian
statistics, and even though their APIs are quite different, the mathematical
theory all translates over.</p>
</li>
</ul>
<h3 id="hierarchical-models">Hierarchical models</h3>
<ul>
<li>
<p>First of all, hierarchical models are amazing! <a href="https://docs.pymc.io/notebooks/GLM-hierarchical.html">The PyMC3
docs</a> opine on this at
length, so let’s not waste any digital ink.</p>
</li>
<li>
<p>The poster child of a Bayesian hierarchical model looks something like this
(equations taken from
<a href="https://en.wikipedia.org/wiki/Bayesian_hierarchical_modeling">Wikipedia</a>):</p>
<p><img style="float: center" src="https://wikimedia.org/api/rest_v1/media/math/render/svg/765f37f86fa26bef873048952dccc6e8067b78f4" /></p>
<p><img style="float: center" src="https://wikimedia.org/api/rest_v1/media/math/render/svg/ca8c0e1233fd69fa4325c6eacf8462252ed6b00a" /></p>
<p><img style="float: center" src="https://wikimedia.org/api/rest_v1/media/math/render/svg/1e56b3077b1b3ec867d6a0f2539ba9a3e79b45c1" /></p>
<p>This hierarchy has 3 levels (some would say it has 2 levels, since there are
only 2 levels of parameters to infer, but honestly whatever: by my count there
are 3). 3 levels is fine, but add any more levels, and it becomes harder for
to sample. Try out a taller hierarchy to see if it works, but err on the side
of 3-level hierarchies.</p>
</li>
<li>
<p>If your hierarchy is too tall, you can truncate it by introducing a
deterministic function of your parameters somewhere (this usually turns out to
just be a sum). For example, instead of modelling your observations are drawn
from a 4-level hierarchy, maybe your observations can be modeled as the sum of
three parameters, where these parameters are drawn from a 3-level hierarchy.</p>
</li>
<li>
<p>More in-depth treatment here in <a href="https://arxiv.org/abs/1312.0906">(Betancourt and Girolami,
2013)</a>. <strong>tl;dr:</strong> hierarchical models all
but <em>require</em> you use to use Hamiltonian Monte Carlo; also included are some
practical tips and goodies on how to do that stuff in the real world.</p>
</li>
</ul>
<h2 id="model-implementation">Model Implementation</h2>
<ul>
<li>
<p>At the risk of overgeneralizing, there are only two things that can go wrong
in Bayesian modelling: either your data is wrong, or your model is wrong. And
it is a hell of a lot easier to debug your data than it is to debug your
model. So before you even try implementing your model, plot histograms of your
data, count the number of data points, drop any NaNs, etc. etc.</p>
</li>
<li>
<p>PyMC3 has one quirky piece of syntax, which I tripped up on for a while. It’s
described quite well in <a href="http://twiecki.github.io/blog/2014/03/17/bayesian-glms-3/#comment-2213376737">this comment on Thomas Wiecki’s
blog</a>.
Basically, suppose you have several groups, and want to initialize several
variables per group, but you want to initialize different numbers of variables
for each group. Then you need to use the quirky <code class="highlighter-rouge">variables[index]</code>
notation. I suggest using <code class="highlighter-rouge">scikit-learn</code>’s <code class="highlighter-rouge">LabelEncoder</code> to easily create the
index. For example, to make normally distributed heights for the iris dataset:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Different numbers of examples for each species</span>
<span class="n">species</span> <span class="o">=</span> <span class="p">(</span><span class="mi">48</span> <span class="o">*</span> <span class="p">[</span><span class="s">'setosa'</span><span class="p">]</span> <span class="o">+</span> <span class="mi">52</span> <span class="o">*</span> <span class="p">[</span><span class="s">'virginica'</span><span class="p">]</span> <span class="o">+</span> <span class="mi">63</span> <span class="o">*</span> <span class="p">[</span><span class="s">'versicolor'</span><span class="p">])</span>
<span class="n">num_species</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="nb">set</span><span class="p">(</span><span class="n">species</span><span class="p">)))</span> <span class="c"># 3</span>
<span class="c"># One variable per group </span>
<span class="n">heights_per_species</span> <span class="o">=</span> <span class="n">pm</span><span class="o">.</span><span class="n">Normal</span><span class="p">(</span><span class="s">'heights_per_species'</span><span class="p">,</span>
<span class="n">mu</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">sd</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">shape</span><span class="o">=</span><span class="n">num_species</span><span class="p">)</span>
<span class="n">idx</span> <span class="o">=</span> <span class="n">sklearn</span><span class="o">.</span><span class="n">preprocessing</span><span class="o">.</span><span class="n">LabelEncoder</span><span class="p">()</span><span class="o">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">species</span><span class="p">)</span>
<span class="n">heights</span> <span class="o">=</span> <span class="n">heights_per_species</span><span class="p">[</span><span class="n">idx</span><span class="p">]</span>
</code></pre></div> </div>
</li>
<li>
<p>You might find yourself in a situation in which you want to use a centered
parameterization for a portion of your data set, but a noncentered
parameterization for the rest of your data set (see below for what these
parameterizations are). There’s a useful idiom for you here:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">num_xs</span> <span class="o">=</span> <span class="mi">5</span>
<span class="n">use_centered</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">])</span> <span class="c"># len(use_centered) = num_xs</span>
<span class="n">x_sd</span> <span class="o">=</span> <span class="n">pm</span><span class="o">.</span><span class="n">HalfCauchy</span><span class="p">(</span><span class="s">'x_sd'</span><span class="p">,</span> <span class="n">sd</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">x_raw</span> <span class="o">=</span> <span class="n">pm</span><span class="o">.</span><span class="n">Normal</span><span class="p">(</span><span class="s">'x_raw'</span><span class="p">,</span> <span class="n">mu</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">sd</span><span class="o">=</span><span class="n">mu_x_sd</span><span class="o">**</span><span class="n">use_centered</span><span class="p">,</span> <span class="n">shape</span><span class="o">=</span><span class="n">num_xs</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">pm</span><span class="o">.</span><span class="n">Deterministic</span><span class="p">(</span><span class="s">'x'</span><span class="p">,</span> <span class="n">x_sd</span><span class="o">**</span><span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">use_centered</span><span class="p">)</span> <span class="o">*</span> <span class="n">x_raw</span><span class="p">)</span>
</code></pre></div> </div>
<p>You could even experiment with allowing <code class="highlighter-rouge">use_centered</code> to be <em>between</em> 0 and
1, instead of being <em>either</em> 0 or 1!</p>
</li>
<li>
<p>I prefer to use the <code class="highlighter-rouge">pm.Deterministic</code> function instead of simply using normal
arithmetic operations (e.g. I’d prefer to write <code class="highlighter-rouge">x = pm.Deterministic('x', y +
z)</code> instead of <code class="highlighter-rouge">x = y + z</code>). This means that you can index the <code class="highlighter-rouge">trace</code> object
later on with just <code class="highlighter-rouge">trace['x']</code>, instead of having to compute it yourself with
<code class="highlighter-rouge">trace['y'] + trace['z']</code>.</p>
</li>
</ul>
<h2 id="mcmc-initialization-and-sampling">MCMC Initialization and Sampling</h2>
<ul>
<li>
<p>Have faith in PyMC3’s default initialization and sampling settings: someone
much more experienced than us took the time to choose them! NUTS is the most
efficient MCMC sampler known to man, and <code class="highlighter-rouge">jitter+adapt_diag</code>… well, you get
the point.</p>
</li>
<li>
<p>However, if you’re truly grasping at straws, the more powerful initialization
setting would be <code class="highlighter-rouge">advi</code> or <code class="highlighter-rouge">advi+adapt_diag</code>, which uses variational
inference to initialize the sampler. An even better option would be to use
<code class="highlighter-rouge">advi+adapt_diag_grad</code>, which is (at the time of writing) an experimental
feature in beta.</p>
</li>
<li>
<p>Never initialize the sampler with the MAP estimate! In low dimensional
problems the MAP estimate (a.k.a. the mode of the posterior) is often quite a
reasonable point. But in high dimensions, the MAP becomes very strange. Check
out <a href="http://www.inference.vc/high-dimensional-gaussian-distributions-are-soap-bubble/">Ferenc Huszár’s blog
post</a>
on high-dimensional Gaussians to see why. Besides, at the MAP all the derivatives
of the posterior are zero, and that isn’t great for derivative-based samplers.</p>
</li>
</ul>
<h2 id="mcmc-trace-diagnostics">MCMC Trace Diagnostics</h2>
<ul>
<li>You’ve hit the <em>Magic Inference Button™</em>, and you have a <code class="highlighter-rouge">trace</code> object. Now
what? First of all, make sure that your sampler didn’t barf itself, and that
your chains are safe for consumption (i.e., analysis).</li>
</ul>
<ol>
<li>
<p>Run the chain for as long as you have the patience or resources for. Make
sure that the <code class="highlighter-rouge">tune</code> parameter increases commensurately with the <code class="highlighter-rouge">draws</code>
parameter.</p>
</li>
<li>
<p>Check for divergences. PyMC3’s sampler will spit out a warning if there are
diverging chains, but the following code snippet may make things easier:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Display the total number and percentage of divergent chains</span>
<span class="n">diverging</span> <span class="o">=</span> <span class="n">trace</span><span class="p">[</span><span class="s">'diverging'</span><span class="p">]</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Number of Divergent Chains: {}'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">diverging</span><span class="o">.</span><span class="n">nonzero</span><span class="p">()[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">size</span><span class="p">))</span>
<span class="n">diverging_perc</span> <span class="o">=</span> <span class="n">divergent</span><span class="o">.</span><span class="n">nonzero</span><span class="p">()[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">size</span> <span class="o">/</span> <span class="nb">len</span><span class="p">(</span><span class="n">trace</span><span class="p">)</span> <span class="o">*</span> <span class="mi">100</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Percentage of Divergent Chains: {:.1f}'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">diverging_perc</span><span class="p">))</span>
</code></pre></div> </div>
</li>
<li>
<p>Check the traceplot (<code class="highlighter-rouge">pm.traceplot(trace)</code>). You’re looking for traceplots
that look like “fuzzy caterpillars”. If the trace moves into some region and
stays there for a long time (a.k.a. there are some “sticky regions”), that’s
cause for concern! That indicates that once the sampler moves into some
region of parameter space, it gets stuck there (probably due to high
curvature or other bad topological properties).</p>
</li>
<li>
<p>In addition to the traceplot, there are <a href="https://docs.pymc.io/api/plots.html">a ton of other
plots</a> you can make with your trace:</p>
<ul>
<li><code class="highlighter-rouge">pm.plot_posterior(trace)</code>: check if your posteriors look reasonable.</li>
<li><code class="highlighter-rouge">pm.forestplot(trace)</code>: check if your variables have reasonable credible
intervals, and Gelman–Rubin scores close to 1.</li>
<li><code class="highlighter-rouge">pm.autocorrplot(trace)</code>: check if your chains are impaired by high
autocorrelation. Also remember that thinning your chains is a waste of
time at best, and deluding yourself at worst. See Chris Fonnesbeck’s
comment on <a href="https://github.com/pymc-devs/pymc/issues/23">this GitHub
issue</a> and <a href="https://twitter.com/junpenglao/status/1009748562136256512">Junpeng Lao’s
reply to Michael Betancourt’s
tweet</a></li>
<li><code class="highlighter-rouge">pm.energyplot(trace)</code>: ideally the energy and marginal energy
distributions should look very similar. Long tails in the distribution of
energy levels indicates deteriorated sampler efficiency.</li>
<li><code class="highlighter-rouge">pm.densityplot(trace)</code>: a souped-up version of <code class="highlighter-rouge">pm.plot_posterior</code>. It
doesn’t seem to be wildly useful unless you’re plotting posteriors from
multiple models.</li>
</ul>
</li>
<li>PyMC3 has a nice helper function to pretty-print a summary table of the
trace: <code class="highlighter-rouge">pm.summary(trace)</code> (I usually tack on a <code class="highlighter-rouge">.round(2)</code> for my sanity).
Look out for:
<ul>
<li>the <script type="math/tex">\hat{R}</script> values (a.k.a. the Gelman–Rubin statistic, a.k.a. the
potential scale reduction factor, a.k.a. the PSRF): are they all close to
1? If not, something is <em>horribly</em> wrong. Consider respecifying or
reparameterizing your model. You can also inspect these in the forest plot.</li>
<li>the sign and magnitude of the inferred values: do they make sense, or are
they unexpected and unreasonable? This could indicate a poorly specified
model. (E.g. parameters of the unexpected sign that have low uncertainties
might indicate that your model needs interaction terms.)</li>
</ul>
</li>
<li>
<p>As a drastic debugging measure, try to <code class="highlighter-rouge">pm.sample</code> with <code class="highlighter-rouge">draws=1</code>,
<code class="highlighter-rouge">tune=500</code>, and <code class="highlighter-rouge">discard_tuned_samples=False</code>, and inspect the traceplot.
During the tuning phase, we don’t expect to see friendly fuzzy caterpillars,
but we <em>do</em> expect to see good (if noisy) exploration of parameter space. So
if the sampler is getting stuck during the tuning phase, that might explain
why the trace looks horrible.</p>
</li>
<li>
<p>If you get scary errors that describe mathematical problems (e.g. <code class="highlighter-rouge">ValueError:
Mass matrix contains zeros on the diagonal. Some derivatives might always be
zero.</code>), then you’re <del>shit out of luck</del> exceptionally unlucky: those kinds of
errors are notoriously hard to debug. I can only point to the <a href="http://andrewgelman.com/2008/05/13/the_folk_theore/">Folk Theorem of
Statistical Computing</a>:</p>
<blockquote>
<p>If you’re having computational problems, probably your model is wrong.</p>
</blockquote>
</li>
</ol>
<h3 id="fixing-divergences">Fixing divergences</h3>
<blockquote>
<p><code class="highlighter-rouge">There were N divergences after tuning. Increase 'target_accept' or reparameterize.</code></p>
<p>— The <em>Magic Inference Button™</em></p>
</blockquote>
<ul>
<li>
<p>Divergences in HMC occur when the sampler finds itself in regions of extremely
high curvature (such as the opening of the a hierarchical funnel). Broadly
speaking, the sampler is prone to malfunction in such regions, causing the
sampler to fly off towards to infinity. The ruins the chains by heavily
biasing the samples.</p>
</li>
<li>
<p>Remember: if you have even <em>one</em> diverging chain, you should be worried.</p>
</li>
<li>
<p>Increase <code class="highlighter-rouge">target_accept</code>: usually 0.9 is a good number (currently the default
in PyMC3 is 0.8). This will help get rid of false positives from the test for
divergences. However, divergences that <em>don’t</em> go away are cause for alarm.</p>
</li>
<li>
<p>Increasing <code class="highlighter-rouge">tune</code> can sometimes help as well: this gives the sampler more time
to 1) find the typical set and 2) find good values for step sizes, scaling
factors, etc. If you’re running into divergences, it’s always possible that
the sampler just hasn’t started the mixing phase and is still trying to find
the typical set.</p>
</li>
<li>
<p>Consider a <em>noncentered</em> parameterization. This is an amazing trick: it all boils down
to the familiar equation <script type="math/tex">X = \sigma Z + \mu</script> from STAT 101, but it honestly
works wonders. See <a href="http://twiecki.github.io/blog/2017/02/08/bayesian-hierchical-non-centered/">Thomas Wiecki’s blog
post</a>
on it, and <a href="https://docs.pymc.io/notebooks/Diagnosing_biased_Inference_with_Divergences.html">this page from the PyMC3
documentation</a>.</p>
</li>
<li>
<p>If that doesn’t work, there may be something wrong with the way you’re
thinking about your data: consider reparameterizing your model, or
respecifying it entirely.</p>
</li>
<li>
<p>Now, here’s a little secret: a small number of divergences is acceptable, and
even to be expected but <em>only on particularly long and treacherous traces</em>.
That’s not to say that you can ignore all your divergences if you’re taking
<code class="highlighter-rouge">draws=10000</code>! Put it this way: if you have a single-digit number of traces,
you should be worried, and checking your posteriors, traces, correlation
matrix, <script type="math/tex">\hat{R}</script>s etc., and if you have a double-digit number of traces you
should be alarmed… and doing exactly the same thing.</p>
</li>
</ul>
<h3 id="other-common-warnings">Other common warnings</h3>
<ul>
<li>
<p>It’s worth noting that far and away the worst warning to get is the one about
divergences. While a divergent chain indicates that your inference may be
flat-out <em>invalid</em>, the rest of these warnings indicate that your inference is
merely (lol, “merely”) <em>inefficient</em>.</p>
</li>
<li><code class="highlighter-rouge">The number of effective samples is smaller than XYZ for some parameters.</code>
<ul>
<li>Quoting <a href="https://discourse.pymc.io/t/the-number-of-effective-samples-is-smaller-than-25-for-some-parameters/1050/3">Junpeng Lao on
discourse.pymc3.io</a>:
“A low number of effective samples is usually an indication of strong
autocorrelation in the chain.”</li>
<li>Make sure you’re using an efficient sampler like NUTS. (And not, for
instance, Metropolis–Hastings. (I mean seriously, it’s the 21st century, why
would you ever want Metropolis–Hastings?))</li>
<li>Tweak the acceptance probability (<code class="highlighter-rouge">target_accept</code>) — it should be large
enough to ensure good exploration, but small enough to not reject all
proposals and get stuck.</li>
</ul>
</li>
<li><code class="highlighter-rouge">The gelman-rubin statistic is larger than XYZ for some parameters. This
indicates slight problems during sampling.</code>
<ul>
<li>When PyMC3 samples, it runs several chains in parallel. Loosely speaking,
the Gelman–Rubin statistic measures how similar these chains are. Ideally it
should be close to 1.</li>
<li>Increasing the <code class="highlighter-rouge">tune</code> parameter may help, for the same reasons as described
in the <em>Fixing Divergences</em> section.</li>
</ul>
</li>
<li><code class="highlighter-rouge">The chain reached the maximum tree depth. Increase max_treedepth, increase
target_accept or reparameterize.</code>
<ul>
<li>NUTS puts a cap on the depth of the trees that it evaluates during each
iteration, which is controlled through the <code class="highlighter-rouge">max_treedepth</code>. Reaching the maximum
allowable tree depth indicates that NUTS is prematurely pulling the plug to
avoid excessive compute time.</li>
<li>Yeah, what the <em>Magic Inference Button™</em> says: try increasing
<code class="highlighter-rouge">max_treedepth</code> or <code class="highlighter-rouge">target_accept</code>.</li>
</ul>
</li>
</ul>
<h3 id="model-reparameterization">Model reparameterization</h3>
<ul>
<li>
<p>Countless warnings have told you to engage in this strange activity of
“reparameterization”. What even is that? Luckily, the <a href="https://github.com/stan-dev/stan/releases/download/v2.17.1/stan-reference-2.17.1.pdf">Stan User
Manual</a>
(specifically the <em>Reparameterization and Change of Variables</em> section) has
an excellent explanation of reparameterization, and even some practical tips
to help you do it (although your mileage may vary on how useful those tips
will be to you).</p>
</li>
<li>
<p>Asides from meekly pointing to other resources, there’s not much I can do to
help: this stuff really comes from a combination of intuition, statistical
knowledge and good ol’ experience. I can, however, cite some examples to give
you a better idea.</p>
<ul>
<li>The noncentered parameterization is a classic example. If you have a
parameter whose mean and variance you are also modelling, the noncentered
parameterization decouples the sampling of mean and variance from the
sampling of the parameter, so that they are now independent. In this way, we
avoid “funnels”.</li>
<li>The <a href="http://proceedings.mlr.press/v5/carvalho09a.html"><em>horseshoe
distribution</em></a> is known to
be a good shrinkage prior, as it is <em>very</em> spikey near zero, and has <em>very</em>
long tails. However, modelling it using one parameter can give multimodal
posteriors — an exceptionally bad result. The trick is to reparameterize and
model it as the product of two parameters: one to create spikiness at zero,
and one to create long tails (which makes sense: to sample from the
horseshoe, take the product of samples from a normal and a half-Cauchy).</li>
</ul>
</li>
</ul>
<h2 id="model-diagnostics">Model Diagnostics</h2>
<ul>
<li>Admittedly the distinction between the previous section and this one is
somewhat artificial (since problems with your chains indicate problems with
your model), but I still think it’s useful to make this distinction because
these checks indicate that you’re thinking about your data in the wrong way,
(i.e. you made a poor modelling decision), and <em>not</em> that the sampler is having
a hard time doing its job.</li>
</ul>
<ol>
<li>
<p>Run the following snippet of code to inspect the pairplot of your variables
one at a time (if you have a plate of variables, it’s fine to pick a couple
at random). It’ll tell you if the two random variables are correlated, and
help identify any troublesome neighborhoods in the parameter space (divergent
samples will be colored differently, and will cluster near such
neighborhoods).</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pm</span><span class="o">.</span><span class="n">pairplot</span><span class="p">(</span><span class="n">trace</span><span class="p">,</span>
<span class="n">sub_varnames</span><span class="o">=</span><span class="p">[</span><span class="n">variable_1</span><span class="p">,</span> <span class="n">variable_2</span><span class="p">],</span>
<span class="n">divergences</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
<span class="n">color</span><span class="o">=</span><span class="s">'C3'</span><span class="p">,</span>
<span class="n">kwargs_divergence</span><span class="o">=</span><span class="p">{</span><span class="s">'color'</span><span class="p">:</span> <span class="s">'C2'</span><span class="p">})</span>
</code></pre></div> </div>
</li>
<li>
<p>Look at your posteriors (either from the traceplot, density plots or
posterior plots). Do they even make sense? E.g. are there outliers or long
tails that you weren’t expecting? Do their uncertainties look reasonable to
you? If you had <a href="https://en.wikipedia.org/wiki/Plate_notation">a plate</a> of
variables, are their posteriors different? Did you expect them to be that
way? If not, what about the data made the posteriors different? You’re the
only one who knows your problem/use case, so the posteriors better look good
to you!</p>
</li>
<li>Broadly speaking, there are four kinds of bad geometries that your posterior
can suffer from:
<ul>
<li>highly correlated posteriors: this will probably cause divergences or
traces that don’t look like “fuzzy caterpillars”. Either look at the
jointplots of each pair of variables, or look at the correlation matrix of
all variables. Try using a centered parameterization, or reparameterize in
some other way, to remove these correlations.</li>
<li>posteriors that form “funnels”: this will probably cause divergences. Try
using a noncentered parameterization.</li>
<li>long tailed posteriors: this will probably raise warnings about
<code class="highlighter-rouge">max_treedepth</code> being exceeded. If your data has long tails, you should
model that with a long-tailed distribution. If your data doesn’t have long
tails, then your model is ill-specified: perhaps a more informative prior
would help.</li>
<li>multimodal posteriors: right now this is pretty much a death blow. At the
time of writing, all samplers have a hard time with multimodality, and
there’s not much you can do about that. Try reparameterizing to get a
unimodal posterior. If that’s not possible (perhaps you’re <em>modelling</em>
multimodality using a mixture model), you’re out of luck: just let NUTS
sample for a day or so, and hopefully you’ll get a good trace.</li>
</ul>
</li>
<li>
<p>Pick a small subset of your raw data, and see what exactly your model does
with that data (i.e. run the model on a specific subset of your data). I find
that a lot of problems with your model can be found this way.</p>
</li>
<li>Run <a href="https://docs.pymc.io/notebooks/posterior_predictive.html"><em>posterior predictive
checks</em></a> (a.k.a.
PPCs): sample from your posterior, plug it back in to your model, and
“generate new data sets”. PyMC3 even has a nice function to do all this for
you: <code class="highlighter-rouge">pm.sample_ppc</code>. But what do you do with these new data sets? That’s a
question only you can answer! The point of a PPC is to see if the generated
data sets reproduce patterns you care about in the observed real data set,
and only you know what patterns you care about. E.g. how close are the PPC
means to the observed sample mean? What about the variance?
<ul>
<li>For example, suppose you were modelling the levels of radon gas in
different counties in a country (through a hierarchical model). Then you
could sample radon gas levels from the posterior for each county, and take
the maximum within each county. You’d then have a distribution of maximum
radon gas levels across counties. You could then check if the <em>actual</em>
maximum radon gas level (in your observed data set) is acceptably within
that distribution. If it’s much larger than the maxima, then you would know
that the actual likelihood has longer tails than you assumed (e.g. perhaps
you should use a Student’s T instead of a normal?)</li>
<li>Remember that how well the posterior predictive distribution fits the data
is of little consequence (e.g. the expectation that 90% of the data should
fall within the 90% credible interval of the posterior). The posterior
predictive distribution tells you what values for data you would expect if
we were to remeasure, given that you’ve already observed the data you did.
As such, it’s informed by your prior as well as your data, and it’s not
its job to adequately fit your data!</li>
</ul>
</li>
</ol>George Hohttps://raw.githubusercontent.com/eigenfoo/eigenfoo.xyz/master/assets/images/email.pngRecently I've started using [PyMC3](https://github.com/pymc-devs/pymc3) for Bayesian modelling, and it's an amazing piece of software! The API only exposes as much of heavy machinery of MCMC as you need — by which I mean, just the `pm.sample()` method.Linear Regression for Starters2018-06-02T00:00:00+00:002018-06-02T00:00:00+00:00https://eigenfoo.xyz/linear-regression<p>I was recently inspired by this following PyData London talk by <a href="http://koaning.io/">Vincent
Warmerdam</a>. It’s a great talk: he has a lot of great tricks
to make simple, small-brain models really work wonders, and he emphasizes
thinking about your problem in a logical way over trying to use cutting-edge
<em>(Tensorflow)</em> or hyped-up <em>(deep learning)</em> methods just for the sake of using
them — something I’m amazed that people seem to need to be reminded of.</p>
<iframe width="640" height="360" src="https://www.youtube-nocookie.com/embed/68ABAU_V8qI?controls=0&showinfo=0" frameborder="0" allowfullscreen=""></iframe>
<p>One of my favorite tricks was the first one he discussed: extracting and
forecasting the seasonality of sales of some product, just by using linear
regression (and some other neat but ultimately simple tricks).</p>
<p>That’s when I started feeling guilty about not really
<a href="https://www.merriam-webster.com/dictionary/grok"><em>grokking</em></a> linear regression.
It sounds stupid for me to say, but I’ve never really managed to <em>really</em>
understand it in any of my studies. The presentation always seemed very canned,
each topic coming out like a sardine: packed so close together, but always
slipping from your hands whenever you pick them up.</p>
<p>So what I’ve done is take the time to really dig into the math and explain how
all of this linear regression stuff hangs together, trying (and only partially
succeeding) not to mention any domain-specific names. This post will hopefully
be helpful for people who have had some exposure to linear regression before,
and some fuzzy recollection of what it might be, but really wants to see how
everything fits together.</p>
<p>There’s going to be a fair amount of math (enough to properly explain the gist
of linear regression), but I’m really not emphasizing proofs here, and I’ll even
downplay explanations of the more advanced concepts, in favor of explaining the
various flavors of linear regression and how everything hangs together.</p>
<h2 id="so-uh-what-is-linear-regression">So Uh, What is Linear Regression?</h2>
<p>The basic idea is this: we have some number that we’re interested in. This
number could be the price of a stock, the number of stars a restaurant has on
Yelp… Let’s denote this <em>number-that-we-are-interested-in</em> by the letter
<script type="math/tex">y</script>. Occasionally, we may have multiple observations for <script type="math/tex">y</script> (e.g. we
monitored the price of the stock over many days, or we surveyed many restaurants
in a neighborhood). In this case, we stack these values of <script type="math/tex">y</script> and consider
them as a single vector: <script type="math/tex">{\bf y}</script>. To be explicit, if we have <script type="math/tex">n</script>
observations of <script type="math/tex">y</script>, then <script type="math/tex">{\bf y}</script> will be an <script type="math/tex">n</script>-dimensional vector.</p>
<p>We also have some other numbers that we think are related to <script type="math/tex">y</script>. More
explicitly, we have some other numbers that we suspect <em>tell us something</em> about
<script type="math/tex">y</script>. For example (in each of the above scenarios), they could be how the stock
market is doing, or the average price of the food at this restaurant. Let us
denote these <em>numbers-that-tell-us-something-about-y</em> by the letter <script type="math/tex">x</script>.
So if we have <script type="math/tex">p</script> such numbers, we’d call them <script type="math/tex">x_1, x_2, ..., x_p</script>. Again,
we occasionally have multiple observations: in which case, we arrange the <script type="math/tex">x</script>
values into an <script type="math/tex">n \times p</script> matrix which we call <script type="math/tex">X</script>.</p>
<p>If we have this setup, linear regression simply tells us that <script type="math/tex">y</script> is a
weighted sum of the <script type="math/tex">x</script>s, plus some constant term. Easier to show you.</p>
<script type="math/tex; mode=display">y = \alpha + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p + \epsilon</script>
<p>where the <script type="math/tex">\alpha</script> and <script type="math/tex">\beta</script>s are all scalars to be determined, and the
<script type="math/tex">\epsilon</script> is an error term (a.k.a. the <strong>residual</strong>).</p>
<p>Note that we can pull the same stacking trick here: the <script type="math/tex">\beta</script>s will become a
<script type="math/tex">p</script>-dimensional vector, <script type="math/tex">{\bf \beta}</script>, and similarly for the <script type="math/tex">\epsilon</script>s.
Note that <script type="math/tex">\alpha</script> remains common throughout all observations.</p>
<p>If we consider <script type="math/tex">n</script> different observations, we can write the equation much more
succinctly if we simply prepend a column of <script type="math/tex">1</script>s to the <script type="math/tex">{\bf X}</script> matrix and
prepend an extra element (what used to be the <script type="math/tex">\alpha</script>) to the
<script type="math/tex">{\bf \beta}</script> vector.</p>
<p>Then the equation can be written as:</p>
<script type="math/tex; mode=display">{\bf y} = {\bf X} {\bf \beta} + {\bf \epsilon}</script>
<p>That’s it. The hard part (and the whole zoo of different kinds of linear
regressions) now comes from two questions:</p>
<ol>
<li>What can we assume, and more importantly, what <em>can’t</em> we assume about <script type="math/tex">X</script> and <script type="math/tex">y</script>?</li>
<li>Given <script type="math/tex">X</script> and <script type="math/tex">y</script>, how exactly do we find <script type="math/tex">\alpha</script> and <script type="math/tex">\beta</script>?</li>
</ol>
<h2 id="the-small-brain-solution-ordinary-least-squares">The Small-Brain Solution: Ordinary Least Squares</h2>
<p><img style="float: middle" src="http://i1.kym-cdn.com/photos/images/facebook/001/232/375/3fb.jpg" /></p>
<p>This section is mostly just a re-packaging of what you could find in any
introductory statistics book, just in fewer words.</p>
<p>Instead of futzing around with whether or not we have multiple observations,
let’s just assume we have <script type="math/tex">n</script> observations: we can always set <script type="math/tex">n = 1</script> if
that’s the case. So,</p>
<ul>
<li>Let <script type="math/tex">{\bf y}</script> and <script type="math/tex">{\bf \beta}</script> be <script type="math/tex">p</script>-dimensional vectors</li>
<li>Let <script type="math/tex">{\bf X}</script> be an <script type="math/tex">n \times p</script> matrix</li>
</ul>
<p>The simplest, small-brain way of getting our parameter <script type="math/tex">{\bf \beta}</script> is by
minimizing the sum of squares of the residuals:</p>
<script type="math/tex; mode=display">{\bf \hat{\beta}} = \text{argmin} \|{\bf y} - {\bf X}{\bf \beta}\|^2</script>
<p>Our estimate for <script type="math/tex">{\bf \beta}</script> then has a “miraculous” closed-form
solution<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup> given by:</p>
<script type="math/tex; mode=display">{\bf \hat{\beta}} = ({\bf X}^T {\bf X})^{-1} {\bf X} {\bf y}</script>
<p>This solution is so (in)famous that it been blessed with a fairly universal
name, but cursed with the unimpressive name <em>ordinary least squares</em> (a.k.a.
OLS).</p>
<p>If you have a bit of mathematical statistics under your belt, it’s worth noting
that the least squares estimate for <script type="math/tex">{\bf \beta}</script> has a load of nice
statistical properties. It has a simple closed form solution, where the
trickiest thing is a matrix inversion: hardly asking for a computational
miracle. If we can assume that <script type="math/tex">\epsilon</script> is zero-mean Gaussian, the least
squares estimate is the maximum likelihood estimate. Even better, if the errors
are uncorrelated and homoskedastic, then the least squares estimate is the best
linear unbiased estimator. <em>Basically, this is very nice.</em> If most of that flew
over your head, don’t worry — in fact, forget I said anything at all.</p>
<h2 id="why-the-small-brain-solution-sucks">Why the Small-Brain Solution Sucks</h2>
<p><a href="http://www.clockbackward.com/2009/06/18/ordinary-least-squares-linear-regression-flaws-problems-and-pitfalls/">There are a ton of
reasons.</a>
Here, I’ll just highlight a few.</p>
<ol>
<li>Susceptibilty to outliers</li>
<li>Assumption of homoskedasticity</li>
<li>Collinearity in features</li>
<li>Too many features</li>
</ol>
<p>Points 1 and 2 are specific to the method of ordinary least squares, while 3 and
4 are just suckish things about linear regression in general.</p>
<h3 id="outliers">Outliers</h3>
<p>The OLS estimate for <script type="math/tex">{\bf \beta}</script> is famously susceptible to outliers. As an
example, consider the third data set in <a href="https://en.wikipedia.org/wiki/Anscombe%27s_quartet">Anscombe’s
quartet</a>. That is, the data
is almost a perfect line, but the <script type="math/tex">n</script>th data point is a clear outlier. That
single data point pulls the entire regression line closer to it, which means it
fits the rest of the data worse, in order to accommodate that single outlier.</p>
<p><img style="float: middle" src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/Anscombe%27s_quartet_3.svg/990px-Anscombe%27s_quartet_3.svg.png" /></p>
<h3 id="heteroskedasticity-and-correlated-residuals">Heteroskedasticity and correlated residuals</h3>
<p>Baked into the OLS estimate is an implicit assumption that the <script type="math/tex">\epsilon</script>s all
have the same variance. That is, the amount of noise in our data is independent
of what region of our feature space we’re in. However, this is usually not a
great assumption. For example, harking back to our stock price and Yelp rating
examples, this assumption states that the price of a stock fluctuates just as
much in the hour before lunch as it does in the last 5 minutes before market
close, or that Michelin-starred restaurants have as much variation in their Yelp
ratings as do local coffee shops.</p>
<p>Even worse: not only can the residuals have different variances, but they may
also even be correlated! There’s no reason why this can’t be the case. Going
back to the stock price example, we know that high-volatility regimes introduce
much higher noise in the price of a stock, and volatility regimes tend to stay
fairly constant over time (notwithstanding structural breaks), which means that
the level of volatility (i.e. noise, or residual) suffers very high
autocorrelation.</p>
<p>The long and short of this is that some points in our training data are more
likely to be impaired by noise and/or correlation than others, which means that
some points in our training set are more reliable/valuable than others. We don’t
want to ignore the less reliable points completely, but they should count less
in our computation of <script type="math/tex">{\bf \beta}</script> than points that come from regions of
space with less noise, or not impaired as much by correlation.</p>
<h3 id="collinearity">Collinearity</h3>
<p>Collinearity (or multi-collinearity) is just a fancy way of saying that our
features are correlated. In the worst case, suppose that two of our columns in
the <script type="math/tex">{\bf X}</script> matrix are identical: that is, we have repeated data. Then, bad
things happen: the matrix <script type="math/tex">{\bf X}^T {\bf X}</script> no longer has full rank (or at
least, becomes
<a href="https://en.wikipedia.org/wiki/Condition_number">ill-conditioned</a>), which means
the actual inversion becomes an extremely sensitive operation and is liable to
give you nonsensically large or small regression coefficients, which will impact
model performance.</p>
<h3 id="too-many-features">Too many features</h3>
<p>Having more data may be a good thing, but more specifically, having more
<em>observations</em> is a good thing. Having more <em>features</em> might not be a great
thing. In the extreme case, if you have more features than observations, (i.e.
<script type="math/tex">% <![CDATA[
n < p %]]></script>), then the OLS estimate of <script type="math/tex">{\bf \beta}</script> generally fails to be
unique. In fact, as you add more and more features to your model, you will find
that model performance will begin to degrade long before you reach this point
where <script type="math/tex">% <![CDATA[
n < p %]]></script>.</p>
<h2 id="expanding-brain-solutions">Expanding-Brain Solutions</h2>
<p><img style="float: middle" src="http://i1.kym-cdn.com/entries/icons/facebook/000/022/266/brain.jpg" /></p>
<p>Here I’ll discuss some add-ons and plugins you can use to upgrade your Ordinary
Least Squares Linear Regression™ to cope with the four problems I described
above.</p>
<h3 id="heteroskedasticity-and-correlated-residuals-1">Heteroskedasticity and correlated residuals</h3>
<p>To cope with different levels of noise, we can turn to <em>generalized least
squares</em> (a.k.a. GLS), which is basically a better version of ordinary least
squares. A little bit of math jargon lets us explain GLS very concisely. Instead
of minimizing the <em>Euclidean norm</em> of the residuals, we minimize its
<em>Mahalanobis norm</em>: in this way, we take into account the second-moment
structure of the residuals, and allows us to put more weight on the data points
on more valuable data points (i.e. those not impaired by noise or correlation).</p>
<p>Mathematically, the OLS estimate is given by</p>
<script type="math/tex; mode=display">{\bf \hat{\beta}} = \text{argmin} \|{\bf y} - {\bf X}{\bf \beta}\|^2</script>
<p>whereas the GLS estimate is given by</p>
<script type="math/tex; mode=display">{\bf \hat{\beta}} = \text{argmin} ({\bf y} - {\bf X}{\bf \beta})^T {\bf \Sigma} ({\bf y} - {\bf X}{\bf \beta})</script>
<p>where <script type="math/tex">{\bf \Sigma}</script> is the <em>known</em> covariance matrix of the residuals.</p>
<p>Now, the GLS estimator enjoys a lot of statistical properties: it is unbiased,
consistent, efficient, and asymptotically normal. <em>Basically, this is very
<strong>very</strong> nice.</em></p>
<p>In practice though, since <script type="math/tex">\Sigma</script> is usually not known, approximate methods
(such as <a href="https://en.wikipedia.org/wiki/Least_squares#Weighted_least_squares">weighted least
squares</a>, or
<a href="https://en.wikipedia.org/wiki/Generalized_least_squares#Feasible_generalized_least_squares">feasible generalized least
squares</a>)
which attempt to estimate the optimal weight for each training point, are used.
One thing that I found interesting while researching this was that these
methods, while they attempt to approximate something better than OLS, may end up
performing <em>worse</em> than OLS! In other words (and more precisely), it’s true that
these approximate estimators are <em>asymptotically</em> more efficient, for small or
medium data sets, they can end up being <em>less</em> efficient than OLS. This is why
some authors prefer to just use OLS and find <em>some other way</em> to estimate the
variance of the estimator (where this <em>some other way</em> is, of course, robust to
heteroskedasticity or correlation).</p>
<h3 id="outliers-1">Outliers</h3>
<p>Recall that OLS minimizes the sum of squares (of residuals):</p>
<script type="math/tex; mode=display">{\bf \hat{\beta}} = \text{argmin} \|{\bf y} - {\bf X}{\bf \beta}\|^2</script>
<p>A <em>regularized estimation</em> scheme adds a penalty term on the size of the coefficients:</p>
<script type="math/tex; mode=display">{\bf \hat{\beta}} = \text{argmin} \|{\bf y} - {\bf X}{\bf \beta}\|^2 + P({\bf \beta})</script>
<p>where <script type="math/tex">P</script> is some function of <script type="math/tex">{\bf \beta}</script>. Common choices for <script type="math/tex">P</script> are:</p>
<ul>
<li>
<p>The <script type="math/tex">l_1</script> norm: <script type="math/tex">P({\bf \beta}) = \|{\bf \beta}\|_1</script></p>
</li>
<li>
<p>The <script type="math/tex">l_2</script> norm: <script type="math/tex">P({\bf \beta}) = \|{\bf \beta}\|_2</script></p>
</li>
<li>
<p>Interpolating between the the first two options:
<script type="math/tex">P({\bf \beta}) = a \|{\bf \beta}\|_1 + (1-a) \|{\bf \beta}\|_2</script>, where <script type="math/tex">% <![CDATA[
0 < a < 1 %]]></script></p>
</li>
</ul>
<p>While regularized regression has empirically been found to be more resilient to
outliers, it comes at a cost: the regression coefficients lose their nice
interpretation of “the effect on the regressand of increasing this regressor by
one unit”. Indeed, regularization can be thought of as telling the universe: “I
don’t care about interpreting the regression coefficients, so long as I get a
reasonable fit that is robust to overfitting”. For this reason, regularization
is usually used for prediction problems, and not for inference.</p>
<p>An alternative solution would be to apply some pre-processing to our data: for
example, some anomaly detection on our data points could remove outliers from
the consideration of our linear regression. However, this method also comes with
its own problems — what if it removes the wrong points? It has the potential to
really mess up our model if it did.</p>
<p>The main takeaway, then, is that <em>outliers kinda just suck</em>.</p>
<h3 id="collinearity-1">Collinearity</h3>
<p>Collinearity a problem that comes and goes — sometimes it’s there, othertimes
not, and it’s better to always pretend it’s there than it is to risk forgetting
about it.</p>
<p>There are many ways to <a href="https://en.wikipedia.org/wiki/Multicollinearity#Detection_of_multicollinearity">detect
multicollinearity</a>,
many ways to <a href="https://en.wikipedia.org/wiki/Multicollinearity#Remedies_for_multicollinearity">remedy
it</a>
and <a href="https://en.wikipedia.org/wiki/Multicollinearity#Consequences_of_multicollinearity">many consequences if you
don’t</a>.
The Wikipedia page is pretty good at outlining all of those, so I’ll just point
to that.</p>
<p>An alternative that Wikipedia doesn’t mention is principal components regression
(PCR), which is literally just principal components analysis followed by
ordinary least squares. As you can imagine, by throwing away some of the
lower-variance components, you can usually remove some of the collinearity.
However, this comes at the cost of interpretability: there is no easy way to
intuit the meaning of a principal component.</p>
<p>A more sophisticated approach would be a close cousin of PCR: <a href="https://en.wikipedia.org/wiki/Partial_least_squares_regression">partial least
squares
regression</a>.
It’s a bit more mathematically involved, and I definitely don’t have the time to
do it full justice here. Google!</p>
<h3 id="too-many-features-1">Too many features</h3>
<p>Having too many features to choose from sounds like the first-world problem of
data science, but it opens up the whole world of high-dimensional statistics and
feature selection. There are a lot of techniques that are at your disposal to
winnow down the number of features here, but the one that is most related to
linear regression is <a href="https://en.wikipedia.org/wiki/Least-angle_regression">least angle
regression</a> (a.k.a. LAR or
LARS). It’s an iterative process that determines the regression coefficients
according to which features are most correlated with the target, and increases
(or decreases) these regression coefficients until some other feature looks like
it has more explanatory power (i.e. more correlated with the target). Like so
many other concepts in this post, I can’t properly do LAR justice in such a
short space, but hopefully the idea was made apparent.</p>
<p>Of course, there are other methods for feature selection too: you can run a
regularized regression to force most of the features to have zero or near-zero
coefficients, or you could use any of the tools in
<a href="http://scikit-learn.org/stable/modules/feature_selection.html"><code class="highlighter-rouge">sklearn.feature_selection</code></a>.</p>
<h2 id="now-what">Now What?</h2>
<p>So that was pretty rushed and a bit hand-wavy, but hopefully it gave you a
high-level view of what linear regression is, and how all these other flavors of
linear regression differ from the ordinary least squares, and how they were made
to remedy specific shortcomings of OLS.</p>
<p>And it should come as no surprise that there are even more directions to take
the concept of linear regression: <a href="https://en.wikipedia.org/wiki/Generalized_linear_model">generalized linear models (a.k.a.
GLMs)</a> allow you to
model different kinds of <script type="math/tex">y</script> variables (e.g. what if <script type="math/tex">y</script> is a binary
response, instead of a continuous variable?), and <a href="https://www.quantstart.com/articles/Bayesian-Linear-Regression-Models-with-PyMC3">Bayesian linear
regression</a>
offers an amazing way to quantify the uncertainty in your coefficients. Big
world; happy hunting!</p>
<hr />
<div class="footnotes">
<ol>
<li id="fn:1">
<p>Insert obligatory footnote here about <a href="https://en.wikipedia.org/wiki/Moore%E2%80%93Penrose_inverse">the Moore–Penrose inverse a.k.a. the pseudoinverse</a>. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>George Hohttps://raw.githubusercontent.com/eigenfoo/eigenfoo.xyz/master/assets/images/email.pngI was recently inspired by this following PyData London talk by [Vincent Warmerdam](http://koaning.io/). It's a great talk: he has a lot of great tricks to make simple, small-brain models really work wonders, and he emphasizes thinking about your problem in a logical way over trying to use _(Tensorflow)_ cutting-edge or _(deep learning)_ hyped-up methods just for the sake of using them…Understanding Hate Speech on Reddit through Text Clustering2018-03-18T00:00:00+00:002018-03-18T00:00:00+00:00https://eigenfoo.xyz/reddit-clusters<blockquote>
<p>Note: the following article contains several examples of hate speech
(including but not limited to racist, misogynistic and homophobic views).</p>
</blockquote>
<p>Have you heard of <code class="highlighter-rouge">/r/TheRedPill</code>? It’s an online forum (a subreddit, but I’ll
explain that later) where people (usually men) espouse an ideology predicated
entirely on gender. “Swallowers of the red pill”, as they call themselves,
maintain that it is <em>men</em>, not women, who are socially marginalized; that feminism
is something between a damaging ideology and a symptom of societal retardation;
that the patriarchy should actively assert its dominance over female
compatriots.</p>
<p>Despite being shunned by the world (or perhaps, because of it), <code class="highlighter-rouge">/r/TheRedPill</code>
has grown into a sizable community and evolved its own slang, language and
culture. Let me give you an example.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Cluster #14:
Cluster importance: 0.0489376285127
shit: 2.433590
test: 1.069885
frame: 0.396684
pass: 0.204953
bitch: 0.163619
</code></pre></div></div>
<p>This is a snippet from a text clustering of <code class="highlighter-rouge">/r/TheRedPill</code> — you don’t really
need to understand the details right now: all you need to know is that each
cluster is simply a bunch of words that frequently appear together in Reddit
posts and comments. Following each word is a number indicating its importance in
the cluster, and on line 2 is the importance of this cluster to the subreddit
overall.</p>
<p>As it turns out, this cluster has picked up on a very specific meme on
<code class="highlighter-rouge">/r/TheRedPill</code>: the concept of the <em>shit test</em>, and how your frame can <em>pass</em> the
<em>shit tests</em> that life (but predominantly, <em>bitches</em>) can throw at you.</p>
<p>There’s absolutely no way I could explain this stuff better than the swallowers
of the red pill themselves, so I’ll just quote from a post on <code class="highlighter-rouge">/r/TheRedPill</code> and
a related blog.</p>
<p>The concept of the shit test very broad:</p>
<blockquote>
<p>… when somebody “gives you shit” and fucks around with your head to see how
you will react, what you are experiencing is typically a (series of) shit
test(s).</p>
</blockquote>
<p>A shit test is designed to test your temperament, or more colloquially,
<em>“determine your frame”</em>.</p>
<blockquote>
<p>Frame is a concept which essentially means “composure and self-control”.</p>
<p>… if you can keep composure/seem unfazed and/or assert your boundaries
despite a shit test, generally speaking you will be considered to have passed
the shit test. If you get upset, offended, doubt yourself or show weakness in
any discernible way when shit tested, it will be generally considered that you
failed the test.</p>
</blockquote>
<p>Finally, not only do shit tests test your frame, but they also serve a specific, critical social function:</p>
<blockquote>
<p>When it comes right down to it shit tests are typically women’s way of
flirting.</p>
<p>… Those who “pass” show they can handle the woman’s BS and is “on her
level”, so to speak. This is where the evolutionary theory comes into play:
you’re demonstrating her faux negativity doesn’t phase you [sic] and that
you’re an emotionally developed person who isn’t going to melt down at the
first sign of trouble. Ergo you’ll be able to protect her when threats to
her safety emerge.</p>
</blockquote>
<p>If you want to learn more, I took all the above quotes from
<a href="https://www.reddit.com/r/TheRedPill/comments/22qnmk/newbies_read_this_the_definitive_guide_to_shit/">here</a>
and <a href="https://illimitablemen.com/2014/12/14/the-shit-test-encyclopedia/">here</a>:
feel free to toss yourself down that rabbit hole (but you may want to open those
links in Incognito mode).</p>
<p>Clearly though, the cluster did a good job of identifying one topic of
discussion on <code class="highlighter-rouge">/r/TheRedPill</code>. In fact, not only can clustering pick up on a
general topic of conversation, but also on specific memes, motifs and vocabulary
associated with it.</p>
<p>Interested? Read on! I’ll explain what I did, and describe some of my other
results.</p>
<hr />
<p>Reddit is — well, it’s pretty hard to describe what Reddit <em>is</em>, mainly because
Reddit comprises several thousand communities, called <em>subreddits</em>, which center
around topics broad (<code class="highlighter-rouge">/r/Sports</code>) and niche (<code class="highlighter-rouge">/r/thinkpad</code>), delightful
(<code class="highlighter-rouge">/r/aww</code>) and unsavory (<code class="highlighter-rouge">/r/Incels</code>).</p>
<p>Each subreddit is a unique community with its own rules, culture and standards.
Some are welcoming and inclusive, and anyone can post and comment; others, not
so much: you must be invited to even read their front page. Some have pliant
standards about what is acceptable as a post; others have moderators willing to
remove posts and ban users upon any infraction of community guidelines.</p>
<p>Whatever Reddit is though, two things are for certain:</p>
<ol>
<li>
<p>It’s widely used. <em>Very</em> widely used. At the time of writing, it’s the <a href="https://www.alexa.com/topsites/countries/US">fourth
most popular website in the United
States</a> and the <a href="https://www.alexa.com/topsites">sixth most popular
globally</a>.</p>
</li>
<li>
<p>Where there is free speech, there is hate speech. Reddit’s hate speech
problem is <a href="https://www.wired.com/2015/08/reddit-mods-handle-hate-speech/">well
documented</a>,
the <a href="https://www.inverse.com/article/43611-reddit-ceo-steve-huffman-hate-speech">center of recent
controversy</a>,
and even <a href="https://fivethirtyeight.com/features/dissecting-trumps-most-rabid-online-following/">the subject of statistical
analysis</a>.</p>
</li>
</ol>
<p>Now, there are many well-known hateful subreddits. The three that I decided to
focus on were <code class="highlighter-rouge">/r/TheRedPill</code>, <code class="highlighter-rouge">/r/The_Donald</code>, and<code class="highlighter-rouge">/r/CringeAnarchy</code>.</p>
<p>The goal here is to understand what these subreddits are like, and expose their
culture for people to see. To quote <a href="https://www.inverse.com/article/43611-reddit-ceo-steve-huffman-hate-speech">Steve Huffman, Reddit’s
CEO</a>:</p>
<blockquote>
<p>“I believe the best defense against racism and other repugnant views, both
on Reddit and in the world, is instead of trying to control what people
can and cannot say through rules, is to repudiate these views in a free
conversation, and empower our communities to do so on Reddit.”</p>
</blockquote>
<p>And there’s no way we can refute and repudiate these deplorable views without
knowing what those views are. And instead of spending hours of each of these
subreddits ourselves, let’s have a machine learn what gets talked about on these
subreddits.</p>
<hr />
<p>Now, how do we do this? This can be done using <em>clustering</em>, a machine learning
technique in which we’re given data points, and tasked with grouping them in
some way. A picture will explain better than words:</p>
<p><a href="https://cdn-images-1.medium.com/max/600/1*yeDcQuDzOa4yPwnP-FyRnA.png"><img align="middle" src="https://cdn-images-1.medium.com/max/800/1*_M5Nx0AjQTGsYzrCHWP4Fw.png" /></a></p>
<p>The clustering algorithm was hard to decide on. After several dead ends were
explored, I settled on non-negative matrix factorization of the document-term
matrix, featurized using tf-idfs. I don’t really want to go into the technical
details now: suffice to say that this technique is <a href="http://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html">known to work well for this
application</a>
(perhaps I’ll write another piece on this in the future).</p>
<p>Finally, we need our data points: <a href="https://bigquery.cloud.google.com/dataset/fh-bigquery:reddit_comments">Google
BigQuery</a>
has all posts and comments across all of Reddit, from the the beginning of
Reddit right up until the end of 2017. We decided to focus on the last two
months for which there is data: November and December, 2017.</p>
<p>I could talk at length about the technical details, but right now, I want to
focus on the results of the clustering. What follows are two hand-picked
clusters from each of the three subreddits, visualized as word clouds (you can
think of word clouds as visual representations of the code snippet above), as
well as an example comment from each of the clusters.</p>
<h2 id="rtheredpill"><code class="highlighter-rouge">/r/TheRedPill</code></h2>
<p>You already know <code class="highlighter-rouge">/r/TheRedPill</code>, so let me describe the clusters in more detail:
a good number of them are about sex, or about how to approach girls. Comments in
these clusters tend to give advice on how to pick up girls, or describe the
social/sexual exploits of the commenter.</p>
<p>What is interesting is that, as sex-obsessed as <code class="highlighter-rouge">/r/TheRedPill</code> is, many
swallowers (of the red pill) profess that sex is <em>not</em> the purpose of the
subreddit: the point is to becoming an “alpha male”. Even more interesting,
there is more talk about what an alpha male <em>is</em>, and what kind of people
<em>aren’t</em> alpha, than there is about how people can <em>become</em> alpha. This is the
first cluster shown below, and comprises around 3% of all text on
<code class="highlighter-rouge">/r/TheRedPill</code>.</p>
<p>The second cluster comprises around 6% of all text on <code class="highlighter-rouge">/r/TheRedPill</code>, and
contains comments that expound theories on the role of men, women and feminism
in today’s society (it isn’t pretty). Personally, the most repugnant views that
I’ve read are to be found in this cluster.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>I feel like the over dramatization of beta qualities in media/pop culture is due
to the fact that anyone representing these qualities is already Alpha by
default.
The actors who play the white knight lead roles, the rock stars that sing about
pining for some chick… these men/characters are already very Alpha in both looks
and status, so when beta BS comes from their mouths, it’s seen as attractive
because it balances out their already alpha state into that "mostly alpha but
some beta" balance that makes women swoon.
…
</code></pre></div></div>
<figure class="half">
<a href="https://cdn-images-1.medium.com/max/600/1*tD_vrXqkvWjKiDBvXQcV9g.png"><img src="https://cdn-images-1.medium.com/max/600/1*tD_vrXqkvWjKiDBvXQcV9g.png" /></a>
<a href="https://cdn-images-1.medium.com/max/600/1*yeDcQuDzOa4yPwnP-FyRnA.png"><img src="https://cdn-images-1.medium.com/max/600/1*yeDcQuDzOa4yPwnP-FyRnA.png" /></a>
</figure>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>…
Since the dawn of humanity men were always in control, held all the power and
women were happy because of it. But now men are forced to lose their masculinity
and power or else they'll be killed/punished by other pussy men with big guns
and laws who believe feminism is the right path for humanity.
…
Feminism is really a blessing in disguise because it's a wake up call for men
and a hidden cry for help from women for men to regain their masculinity,
integrity and control over women.
…
</code></pre></div></div>
<h2 id="rthe_donald"><code class="highlighter-rouge">/r/The_Donald</code></h2>
<p>You may have already heard of <code class="highlighter-rouge">/r/The_Donald</code> (a.k.a. the “pro-Trump cesspool”),
famed for their <a href="https://en.wikipedia.org/wiki//r/The_Donald#Conflict_with_Reddit_management">takeover of the Reddit front
page</a>,
and their <a href="https://en.wikipedia.org/wiki//r/The_Donald#Controversies">involvement in several recent
controversies</a>. It
may therefore be surprising to learn that there is an iota of lucid discussion
that goes on, although in a jeering, bullying tone.</p>
<p><code class="highlighter-rouge">/r/The_Donald</code> is the subreddit which has developed the most language and inside
jokes: from “nimble navigators” to “swamp creatures”, “spezzes” to the
“Trumpire”… Explaining these memes would take too long: reach out, or Google, if
you really want to know.</p>
<p>The first cluster accounts for 5% of all text on <code class="highlighter-rouge">/r/The_Donald</code>, and contains
(relatively) coherent arguments both for and against net neutrality. The second
cluster accounts for 1% of the all text on <code class="highlighter-rouge">/r/The_Donald</code>, and is actually from
the subreddit’s <code class="highlighter-rouge">MAGABrickBot</code>, which is a bot that keeps count of how many times
the word “brick” has been used in comments, by automatically generating this
comment.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>So much misinformation perpetuated by the Swamp... Abolishing Net Neutrality
would benefit swamp creatures with corporate payouts but would be most damaging
to conservatives long term.
Net Neutrality was NOT created by Obama, it was actually in effect from the very
beginning...
</code></pre></div></div>
<figure class="half">
<a href="https://cdn-images-1.medium.com/max/600/1*_8INm_IScvEnVWSZrvbXmw.png"><img src="https://cdn-images-1.medium.com/max/600/1*_8INm_IScvEnVWSZrvbXmw.png" /></a>
<a href="https://cdn-images-1.medium.com/max/600/1*FBnzykJ4RzEOhkIIE3hP0w.png"><img src="https://cdn-images-1.medium.com/max/600/1*FBnzykJ4RzEOhkIIE3hP0w.png" /></a>
</figure>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>**FOR THE LOVE OF GOD GET THIS PATRIOT A BRICK! THAT'S 92278 BRICKS HANDED
OUT!**
We are at **14.3173880911%** of our goal to **BUILD THE WALL** starting from Imperial
Beach, CA to Brownsville, Texas! Lets make sure everyone gets a brick in the
United States! For every Centipede a brick, for every brick a Centipede!
At this rate, the wall will be **1071.35224786 MILES WIDE** and **353.552300867 FEET
HIGH** by tomorrow! **DO YOUR PART!**
</code></pre></div></div>
<h2 id="rcringeanarchy"><code class="highlighter-rouge">/r/CringeAnarchy</code></h2>
<p>On the Internet, <em>cringe</em> is the second-hand embarrassment you feel when someone
acts extremely awkwardly or uncomfortably. And on <code class="highlighter-rouge">/r/CringeAnarchy</code> you can find
memes about the <em>real</em> cringe, which is, um, liberals and anyone else who
advocates for an inclusionary, equitable ideology. Their morally grey jokes run
the gamut of delicate topics: gender, race, sexuality, nationality…</p>
<p>In some respects, the clustering provided very little insight into this
subreddit: each such delicate topic had one or two clusters, and there’s nothing
really remarkable about any of them. This speaks to the inherent difficulty of
training a topic model on memes: I rant at greater length about this topic on
<a href="https://eigenfoo.xyz/lda-sucks/">one of my blog posts</a>.</p>
<p>Both clusters below comprise around 3% of text on <code class="highlighter-rouge">/r/CringeAnarchy</code>: one is to do
with race, and the other is to do with homosexuality.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Has anyone here, non-black or otherwise, ever wished someone felt sorry for
being black? Maybe it's just where I live... the majority is black. It's
whatever.
</code></pre></div></div>
<figure class="half">
<a href="https://cdn-images-1.medium.com/max/600/1*HBzJNutPwxOZdPIKcaxf5Q.png"><img src="https://cdn-images-1.medium.com/max/600/1*HBzJNutPwxOZdPIKcaxf5Q.png" /></a>
<a href="https://cdn-images-1.medium.com/max/600/1*m91cvXrui_72R70BcQh2jg.png"><img src="https://cdn-images-1.medium.com/max/600/1*m91cvXrui_72R70BcQh2jg.png" /></a>
</figure>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>…
Also, the distinction between bisexual and gay is academic. If you do a gay
thing, you have done a gay thing. That's what "being gay" means to a LOT of
people. Redefining it is as useful as all the other things SJWs are redefining.
</code></pre></div></div>
<hr />
<p>As much information as that might have been, this was just a glimpse into what
these subreddits are like: I made 20 clusters for each subreddit, and you could
argue that (for somewhat technical reasons) 20 clusters isn’t even enough!
Moreover, there is just no way I could distill everything I learned about these
communities into one Medium story: I’ve curated just the more remarkable or
provocative results to put here.</p>
<p>If you still have the stomach for this stuff, scroll through the complete log
files
<a href="https://github.com/eigenfoo/reddit-clusters/tree/master/clustering/results">here</a>,
or look through images of the word clouds
<a href="https://github.com/eigenfoo/reddit-clusters/tree/master/wordclouds/images">here</a>.</p>
<p>Finally, as has been said before, “Talk is cheap. Show me the code.” For
everything I’ve written to make these clusters, check out <a href="https://github.com/eigenfoo/reddit-clusters">this GitHub
repository</a>.</p>
<hr />
<p><strong>EDIT (11-08-2018):</strong> If you’re interested in the technical, data science side
of the project, check out the slide deck and speaker notes from my recent talk
on exactly that!</p>
<hr />
<p><em>This post was originally published <a href="https://medium.com/@_eigenfoo/understanding-hate-speech-on-reddit-through-text-clustering-7dc7675bccae">on
Medium</a>
on May 18, 2018. This post was also reprinted in the inaugural issue of Cooper
Union’s <a href="https://www.facebook.com/theunionjournal/">UNION Journal</a>.</em></p>George Hohttps://raw.githubusercontent.com/eigenfoo/eigenfoo.xyz/master/assets/images/email.pngHave you heard of `/r/TheRedPill`? It’s an online forum (a subreddit, but I’ll explain that later) where people (usually men) espouse an ideology predicated entirely on gender. 'Swallowers of the red pill', as they call themselves, maintain that it is _men_, not women, who are socially marginalized…Why Latent Dirichlet Allocation Sucks2018-03-06T00:00:00+00:002018-03-06T00:00:00+00:00https://eigenfoo.xyz/lda-sucks<p>As I learn more and more about data science and machine learning, I’ve noticed
that a lot of resources out there go something like this:</p>
<blockquote>
<p>Check out this thing! It’s great at this task! The important task! The one
that was impossible/hard to do before! Look how well it does! So good! So
fast!</p>
<p>Take this! It’s our algorithm/code/paper! We used it to do the thing! And now
you can do the thing too!</p>
</blockquote>
<p>Jokes aside, I do think it’s true that a lot of research and resources focus on
what things <em>can</em> do, or what things are <em>good</em> at doing. Whenever I actually
implement the hyped-up “thing”, I’m invariably frustrated when it doesn’t
perform so well as originally described.</p>
<p>Maybe I’m not smart enough to see this, but after I learn about a new technique
or tool or model, it’s not immediately obvious to me when <em>not</em> to use it. I
think it would be very helpful to learn what things <em>aren’t</em> good at doing, or
why things just plain <em>suck</em> at times. Doing so not only helps you understand
the technique/tool/model better, but also sharpens your understanding of your
use case and the task at hand: what is it about your application that makes it
unsuitable for such a technique?</p>
<p>Which is why I’m writing the first of what will (hopefully) be a series of posts
on <em>“Why [Thing] Sucks”</em>. The title is provocative but reductive: a better name
might be <em>When and Why [Thing] Might Suck</em>… but that doesn’t have quite the
same ring to it! In these articles I’ll be outlining what I tried and why it
didn’t work: documenting my failures and doing a quick post-mortem, if you will.
My hope is that this will be useful to anyone else trying to do the same thing
I’m doing.</p>
<hr />
<p>So first up: topic modelling. Specifically, <a href="https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation">latent Dirichlet
allocation</a>, or LDA
for short (not to be confused with <a href="https://eigenfoo.xyz/lda/">the other
LDA</a>, which I wrote a blog post about before).</p>
<p>If you’ve already encountered LDA and have seen <a href="https://en.wikipedia.org/wiki/Plate_notation">plate
notation</a> before, this picture
will probably refresh your memory:</p>
<p><a title="By Bkkbrad [GFDL (http://www.gnu.org/copyleft/fdl.html) or CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)], from Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:Latent_Dirichlet_allocation.svg"><img width="512" alt="Latent Dirichlet allocation" src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/d3/Latent_Dirichlet_allocation.svg/512px-Latent_Dirichlet_allocation.svg.png" /></a></p>
<p>If you don’t know what LDA is, fret not, for there is
<a href="http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf">no</a>
<a href="http://obphio.us/pdfs/lda_tutorial.pdf">shortage</a>
<a href="http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/">of</a>
<a href="https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html">resources</a>
<a href="http://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation">about</a>
<a href="https://radimrehurek.com/gensim/models/ldamodel.html">this</a>
<a href="https://www.quora.com/What-is-a-good-explanation-of-Latent-Dirichlet-Allocation">stuff</a>.
I’m going to move on to when and why LDA isn’t the best idea.</p>
<p><strong>tl;dr:</strong> <em>LDA and topic modelling doesn’t work well with a) short documents,
in which there isn’t much text to model, or b) documents that don’t coherently
discuss a single topic.</em></p>
<p>Wait, what? Did George just say that topic modelling sucks when there’s not much
topic, and not much text to model? Isn’t that obvious?</p>
<p><em>Yes! Exactly!</em> Of course it’s <a href="https://en.wikipedia.org/wiki/Egg_of_Columbus">obvious in
retrospect</a>! Which is why I was
so upset when I realized I spent two whole weeks faffing around with LDA when
topic models were the opposite of what I needed, and so frustrated that more
people aren’t talking about when <em>not</em> to use/do certain things.</p>
<p>But anyways, <code class="highlighter-rouge"><\rant></code> and let’s move on to why I say what I’m saying.</p>
<p>Recently, I’ve taken up a project in modelling the textual data on Reddit using
NLP techniques. There are, of course, many ways one count take this, but
something I was interested in was finding similarities between subreddits,
clustering comments, and visualizing these clusters somehow: what does Reddit
talk about on average? Of course, I turned to topic modelling and dimensionality
reduction.</p>
<p>The techniques that I came across first were LDA (<a href="https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation">latent Dirichlet
allocation</a>) and
t-SNE (<a href="https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding">t-distributed stochastic neighbor
embedding</a>).
Both techniques are well known and well documented, but I can’t say that using
them together is a popular choice of two techniques. However, there have been
some successes. For instance, Shuai had some success with this method <a href="https://shuaiw.github.io/2016/12/22/topic-modeling-and-tsne-visualzation.html">when
using it the 20 newsgroups
dataset</a>;
some work done by Kagglers have <a href="https://www.kaggle.com/ykhorramz/lda-and-t-sne-interactive-visualization">yielded reasonable
results</a>,
and <a href="https://stats.stackexchange.com/questions/305356/plot-latent-dirichlet-allocation-output-using-t-sne">the StackExchange community doesn’t think its a ridiculous
idea</a>.</p>
<p>The dataset that I applied this technique to was the <a href="bigquery.cloud.google.com/dataset/fh-bigquery:reddit">Reddit dataset on Google
BigQuery</a>, which contains
data on all subreddits, posts and comments for as long as Reddit has been around.
I limited myself to the top 10 most active subreddits in December 2017 (the most
recent month for which we have data, at the time of writing), and chose 20 to be
the number of topics to model (any choice is as arbitrary as any other).</p>
<p>I ran LDA and t-SNE exactly as Shuai described on <a href="https://shuaiw.github.io/2016/12/22/topic-modeling-and-tsne-visualzation.html">this blog
post</a>,
except using the great <a href="https://radimrehurek.com/gensim/"><code class="highlighter-rouge">gensim</code></a> library to
perform LDA, which was built with large corpora and efficient online algorithms
in mind. (Specifically, <code class="highlighter-rouge">gensim</code> implements online variational inference with
the EM algorthm, instead of using MCMC-based algorithms, which <code class="highlighter-rouge">lda</code> does. It
seems that variational Bayes scales better to very large corpora than collapsed
Gibbs sampling.)</p>
<p>Here are the results:</p>
<figure>
<a href="/assets/images/lda-sucks.png"><img style="float: middle" width="600" height="600" src="/assets/images/lda-sucks.png" /></a>
</figure>
<p>Horrible, right? Nowhere near the well-separated clusters that Shuai got with
the 20 newsgroups. In fact, the tiny little huddles of around 5 to 10 comments
are probably artifacts of the dimensionality reduction done by t-SNE, so those
might even just be noise! You might say that there are at least 3 very large
clusters, but even that’s bad news! If they’re clustered together, you would
hope that they have the same topics, and that’s definitely not the case here!
These large clusters tells us that a lot of comments have roughly the same topic
distribution (i.e. they’re close to each other in the high-dimensional
topic-space), but their dominant topics (i.e. the topic with greatest
probability) don’t end up being the same.</p>
<p>By the way, t-SNE turns out to be <a href="https://distill.pub/2016/misread-tsne/">a really devious dimensionality reduction
technique</a>, and you really need to
experiment with the perplexity values in order to use it properly. I used the
default <code class="highlighter-rouge">perplexity=30</code> from sklearn for the previous plot, but I repeated the
visualizations for multiple other values and the results aren’t so hot either.
You can check out the results <a href="https://www.flickr.com/photos/155778261@N04/albums/72157694226050095">on my
Flickr</a>.
Note that I did these on a random subsample of 1000 comments, so as to reduce
compute time.</p>
<figure class="half">
<a href="/assets/images/perplexity50.png"><img src="/assets/images/perplexity50.png" /></a>
<a href="/assets/images/perplexity100.png"><img src="/assets/images/perplexity100.png" /></a>
<figcaption>t-SNE with perplexity values of 50 and 100, respectively.</figcaption>
</figure>
<p>So, what went wrong? There’s a <a href="https://stackoverflow.com/questions/29786985/whats-the-disadvantage-of-lda-for-short-texts">nice StackOverflow
post</a>
that describes the problem well.</p>
<p>Firstly, latent Dirichlet allocation and other probabilistic topic models are
very complex and flexible. While this means that they have very high variance
and low bias, it also means that they need a lot of data (or data with a decent
signal-to-noise ratio) for them to learn anything meaningful. Particularly for
LDA, which infers topics on a document-by-document basis, if there aren’t enough
words in a document, there simply isn’t enough data to infer a reliable topic
distribution for that document.</p>
<p>Secondly, Reddit comments are by their nature very short and very-context
dependent, since they respond to a post, or another comment. So not only are
Reddit comments just short: it’s actually worse than that! They don’t even
discuss a certain topic coherently (by which I mean, they don’t necessarily use
words that pertain to what they’re talking about). I’ll give an example:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"I'm basing my knowledge on the fact that I watched the fucking rock fall."
</code></pre></div></div>
<p>Now, stopwords compose a little less than half of this comment, and they would
be stripped before LDA even looks at it. But that aside, what is this comment
about? What does the rock falling mean? What knowledge is this user claiming?
It’s a very confusing comment, but probably made complete sense in the context
of the post it responded to and the comments that came before it. As it is,
however, its impossible for <em>me</em> to figure out what topic this comment is about,
let alone an algorithm!</p>
<p>Also, just to drive the point home, here are the top 10 words in each of the 20
topics that LDA came up with, on the same dataset as before:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Topic #0:
got just time day like went friend told didn kids
Topic #1:
just gt people say right doesn know law like government
Topic #2:
removed com https www https www tax money http watch news
Topic #3:
people don just like think really good know want things
Topic #4:
years time did great ago ve just work life damn
Topic #5:
movie like love just really school star movies film story
Topic #6:
like just fucking shit head car looks new makes going
Topic #7:
game team season year good win play teams playing best
Topic #8:
right thing yeah don think use internet ok water case
Topic #9:
going like work just need way want money free fuck
Topic #10:
better just play games make ve ll seen lol fun
Topic #11:
like don know did feel shit big man didn guys
Topic #12:
deleted fuck guy year old man amp year old state lmao
Topic #13:
sure believe trump wrong saying comment post mueller evidence gt
Topic #14:
gt yes https com good oh wikipedia org en wiki
Topic #15:
think like good 10 look point lebron just pretty net
Topic #16:
gt said fucking american agree trump thanks obama states did
Topic #17:
trump vote party republicans election moore president republican democrats won
Topic #18:
war world country israel countries china military like happy does
Topic #19:
reddit message askreddit post questions com reddit com subreddit compose message compose
</code></pre></div></div>
<p>Now, it’s not entirely bad: topic 2 seems like its collecting the tokens from links
(I didn’t stopword those out, oops), topic 7 looks like its about football or
some other sport, 13 is probably about American politics, and 18 looks like
its about world news, etc.</p>
<p>But almost all other topics are just collections of words: it’s not immediately
obvious to me what each topic represents.</p>
<p>So yeah, there you have it, LDA really sucks sometimes.</p>
<hr />
<p><strong>EDIT (8/12/2018):</strong> In retrospect, I think that this whole blog post is
summarized well in the following tweet thread. Clustering algorithms will give
you clusters because that’s what they do, not because there actually <em>are</em>
clusters. In this case, extremely short and context-dependent documents make it
hard to justify that there are topic clusters in the first place.</p>
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">Algorithms that have to report something will always report something, even if it's a bad idea. Please do not use these algorithms unless you have principled reasons why there should be something. <a href="https://t.co/kzxZiuBfmm">https://t.co/kzxZiuBfmm</a></p>— \mathfrak{Michael Betancourt} (@betanalpha) <a href="https://twitter.com/betanalpha/status/1026619046626828288?ref_src=twsrc%5Etfw">August 7, 2018</a></blockquote>
<script async="" src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>George Hohttps://raw.githubusercontent.com/eigenfoo/eigenfoo.xyz/master/assets/images/email.pngAs I learn more and more about data science and machine learning, I've noticed that a lot of resources out there go something like this…~~Fruit~~ Loops and Learning - The LUPI Paradigm and SVM+2018-01-30T00:00:00+00:002018-01-30T00:00:00+00:00https://eigenfoo.xyz/lupi<p>Here’s a short story you might know: you have a black box, whose name is
<em>Machine Learning Algorithm</em>. It’s got two modes: training mode and testing
mode. You set it to training mode, and throw in a lot (sometimes <em>a lot</em> a lot)
of ordered pairs <script type="math/tex">(x_i, y_i), 1 \leq i \leq l</script>. Here, the <script type="math/tex">x_i</script> are called
the <em>examples</em> and the <script type="math/tex">y_i</script> are called the <em>targets</em>. Then, you set it to
testing mode and throw in some more examples, for which you don’t have the
corresponding targets. You hope the <script type="math/tex">y_i</script>s that come out are in some sense
the “right” ones.</p>
<p>Generally speaking, this is a parable of <em>supervised learning</em>. However, Vapnik
(the inventor of the
<a href="https://en.wikipedia.org/wiki/Support_vector_machine">SVM</a>) recently described
a new way to think about machine learning
(<a href="http://www.engr.uconn.edu/~jinbo/doc/vladimir_newparadiam.pdf">here</a> and
<a href="http://jmlr.csail.mit.edu/papers/volume16/vapnik15b/vapnik15b.pdf">here</a>):
<em>learning using privileged information</em>, or <em>LUPI</em> for short.</p>
<p>This post is meant to introduce the LUPI paradigm of machine learning to
people who are generally familiar with supervised learning and SVMs, and are
interested in seeing the math and intuition behind both things extended to the
LUPI paradigm.</p>
<h2 id="what-is-lupi">What is LUPI?</h2>
<p>The main idea is that instead of two-tuples <script type="math/tex">(x_i, y_i)</script>, the black box is fed
three-tuples <script type="math/tex">(x_i, x_i^{*}, y_i)</script>, where the <script type="math/tex">x^{*}</script>s are the so-called
<em>privileged information</em> that is only available during training, and not during
testing. The hope is that this information will train the model to better
generalize during the testing phase.</p>
<p>Vapnik offers many examples in which LUPI can be applied in real life: in
bioinformatics and proteomics (where advanced biological models, which the
machine might not necessarily “understand”, serve as the privileged
information), in financial time series analysis (where future movements of the
time series are the unknown at prediction time, but are available
retrospectively), and in the classic MNIST dataset, where the images were
converted to a lower resolution, but each annotated with a “poetic description”
(which was available for the training data but not for the testing data).</p>
<p>Vapnik’s team ran tests on well-known datasets in all three application areas
and found that his newly-developed LUPI methods performed noticeably better than
classical SVMs in both convergence time (i.e. the number of examples necessary
to achieve a certain degree of accuracy) and estimation of a good predictor
function. In fact, Vapnik’s proof-of-concept experiments are so whacky that
they actually <a href="https://nautil.us/issue/6/secret-codes/teaching-me-softly">make for an entertaining read
</a>!</p>
<h2 id="classical-svms-separable-and-non-separable-case">Classical SVMs (separable and non-separable case)</h2>
<p>There are many ways of thinking about SVMs, but I think that the one that is
most instructive here is to think of them as solving the following optimization
problem:</p>
<blockquote>
<p>Minimize <script type="math/tex">\frac{1}{2} \|w\|^2</script></p>
<p>subject to <script type="math/tex">y_i [ w \cdot x_i + b ] \geq 1, \; 1 \leq i \leq l</script>.</p>
</blockquote>
<p>Basically all this is saying is that we want to find the hyperplane that
separates our data by the maximum margin. More technically speaking, this finds
the parameters (<script type="math/tex">w</script> and <script type="math/tex">b</script>) of the maximum margin hyperplane, with <script type="math/tex">l_2</script>
regularization.</p>
<p>In the non-separable case, we concede that our hyperplane may not classify all
examples perfectly (or that it may not be desireable to do so: think of
overfitting), and so we introduce a so-called <em>slack variable</em> <script type="math/tex">\xi_i \geq 0</script> for each example <script type="math/tex">i</script>, which measures the severity of misclassification of
that example. With that, the optimization becomes:</p>
<blockquote>
<p>Minimize <script type="math/tex">\frac{1}{2} \|w\|^2 + C\sum_{i=1}^{l}{\xi_i}</script></p>
<p>subject to <script type="math/tex">y_i [ w \cdot x_i + b ] \geq 1 - \xi_i, \; \xi_i \geq 0, 1
\leq i \leq l</script>.</p>
</blockquote>
<p>where <script type="math/tex">C</script> is some regularization parameter.</p>
<p>This says the same thing as the previous optimization problem, but now allows
points to be (a) classified properly (<script type="math/tex">\xi_i = 0</script>), (b) within the margin but
still classified properly (<script type="math/tex">% <![CDATA[
0 < \xi_i < 1 %]]></script>), or (c) misclassified
(<script type="math/tex">1 \leq \xi_i</script>).</p>
<p>In both the separable and non-separable cases, the decision rule is simply <script type="math/tex">\hat{y} = \text{sign}(w \cdot x + b)</script>.</p>
<p>An important thing to note is that, in the separable case, the SVM uses <script type="math/tex">l</script>
examples to estimate the <script type="math/tex">n</script> components of <script type="math/tex">w</script>, whereas in the nonseparable
case, the SVM uses <script type="math/tex">l</script> examples to estimate <script type="math/tex">n+l</script> parameters: the <script type="math/tex">n</script>
components of <script type="math/tex">w</script> and <script type="math/tex">l</script> values of slacks <script type="math/tex">\xi_i</script>. Thus, in the
non-separable case, the number of parameters to be estimated is always larger
than the number of examples: it does not matter here that most of slacks may be
equal to zero: the SVM still has to estimate all of them.</p>
<p>The way both optimization problems are actually <em>solved</em> is fairly involved (they
require <a href="https://en.wikipedia.org/wiki/Lagrange_multiplier">Lagrange
multipliers</a>), but in terms
of getting an intuitive feel for how SVMs work, I think that examining the
optimization problems suffice!</p>
<h2 id="what-is-svm">What is SVM+?</h2>
<p>In his paper introducing the LUPI paradigm, Vapnik outlines <em>SVM+</em>, a
modified form of the SVM that fits well into the LUPI paradigm, using privileged
information to improve performance. It should be emphasized that LUPI is a
paradigm - a way of thinking about machine learning - and not just a collection
of algorithms. SVM+ is just one technique that interoperates with the LUPI
paradigm.</p>
<p>The innovation of the SVM+ algorithm is that is uses the privileged information
to estimate the slack variables. Given the training three-tuple <script type="math/tex">(x, x^{*}, y)</script>, we map <script type="math/tex">x</script> to the feature space <script type="math/tex">Z</script>, and <script type="math/tex">x^{*}</script> to a separate feature
space <script type="math/tex">Z^{*}</script>. Then, the decision rule is <script type="math/tex">\hat{y} = \text{sign}(w \cdot x +
b)</script> and the slack variables are estimated by <script type="math/tex">\xi = w^{*} \cdot x^{*} + b^{*}</script>.</p>
<p>In order to find <script type="math/tex">w</script>, <script type="math/tex">b</script>, <script type="math/tex">w^{*}</script> and <script type="math/tex">b^{*}</script>, we solve the following
optimization problem:</p>
<blockquote>
<p>Minimize <script type="math/tex">\frac{1}{2} (\|w\|^2 + \gamma \|w^{*}\|^2) +
C \sum_{i=1}^{l}{(w^{*} \cdot x_i^{*} + b^{*})}</script></p>
<p>subject to <script type="math/tex">y_i [ w \cdot x_i + b ] \geq 1 - (w^{*} \cdot x^{*} + b^{*}),
\; (w^{*} \cdot x^{*} + b^{*}) \geq 0, 1 \leq i \leq l</script>.</p>
</blockquote>
<p>where <script type="math/tex">\gamma</script> indicates the extent to which the slack estimation should be
regularized in comparison to the SVM. Notice how this optimization problem is
essentially identical to the non-separable classical SVM, except the slacks
<script type="math/tex">\xi_i</script> are now estimated with <script type="math/tex">w^{*} \cdot x^{*} + b^{*}</script>.</p>
<p>Again, the method of actually solving this optimization problem involves
Lagrange multipliers and quadratic programming, but I think the intuition is
captured in the optimization problem statement.</p>
<h2 id="interpretation-of-svm">Interpretation of SVM+</h2>
<p>The SVM+ has a very ready interpretation. Instead of a single feature space, it
has two: one in which the non-privileged information lives (where decisions are
made), and one in which the privileged information lives (where slack variables
are estimated).</p>
<p>But what’s the point of this second feature space? How does it help us? Vapnik
terms this problem <em>knowledge transfer</em>: it’s all well and good for us to learn
from the privileged information, but it’s all for naught if we can’t use this
newfound knowledge in the test phase.</p>
<p>The way knowledge transfer is resolved here is by assuming that <em>examples in the
training set that are hard to separate in the privileged space, are also hard to
separate in the regular space</em>. Therefore, we can use the privileged information
to obtain an estimate for the slack variables.</p>
<p>Of course, SVMs are a technique with many possible interpretations, of which my
presentation (in terms of the optimization of <script type="math/tex">w</script> and <script type="math/tex">b</script>) is just one. For
example, it’s possible to think of SVMs in terms of kernels functions, or as
linear classifiers minimizing hinge loss. In all cases, it’s possible and
worthwhile to understand that interpretation of SVMs, and how the LUPI paradigm
contributes to or extends that interpretation. I’m hoping to write a piece later
to explain these exact topics.</p>
<p>Vapnik also puts a great emphasis on analyzing SVM+ based on its statistical
learning theoretic properties (in particular, analyzing its rate of convergence
via the <a href="https://en.wikipedia.org/wiki/VC_dimension">VC dimension</a>). Vapnik was
one of the main pioneers behind statistical learning theory, and has written an
<a href="https://www.amazon.com/Statistical-Learning-Theory-Vladimir-Vapnik/dp/0471030031">entire
book</a>
on this stuff <del>which I have not read</del>, so I’ll leave that part aside for now. I
hope to understand this stuff one day.</p>
<h2 id="implementation-of-svm">Implementation of SVM+</h2>
<p>There’s just one catch: SVM+ is actually an fairly inefficient algorithm, and
definitely will not scale to large data sets. What’s so bad about it? <em>It has
<script type="math/tex">n</script> training examples but <script type="math/tex">2n</script> variables to estimate.</em> This is twice as many
variables to estimate as the standard formulation of the <a href="https://en.wikipedia.org/wiki/Support_vector_machine#Computing_the_SVM_classifier">vanilla
SVM</a>.
This isn’t something that we can patch: the problem is inherent to the
Lagrangian dual formulation that Vapnik and Vashist proposed in 1995.</p>
<p>Even worse, the optimization problem has constraints that are very different
from those of the standard SVM. In essence, this means that efficient libraries
out-of-the-box solvers for the standard SVM (e.g.
<a href="https://www.csie.ntu.edu.tw/~cjlin/libsvm/">LIBSVM</a> and
<a href="https://www.csie.ntu.edu.tw/~cjlin/liblinear/">LIBLINEAR</a>) can’t be used to
train an SVM+ model.</p>
<p>Luckily, <a href="https://www.researchgate.net/publication/301880839_Simple_and_Efficient_Learning_using_Privileged_Information">a recent paper by Xu et
al.</a>
describes a neat mathematical trick to implement SVM+ in a simple and efficient
way. With this amendment, the authors rechristen the algorithm as SVM2+.
Essentially, instead of using the hinge loss when training SVM+, we will instead
use the <em>squared</em> hinge loss. It turns out that changing the loss function in
this way leads to a tiny miracle.</p>
<p>This (re)formulation of SVM+ becomes <em>identical</em> to that of the standard SVM,
except we replace the Gram matrix (a.k.a. kernel matrix) <script type="math/tex">\bf K</script> by <script type="math/tex">\bf K +
\bf Q_\lambda \odot (\bf y y^t)</script>, where</p>
<ul>
<li><script type="math/tex">\bf y</script> is the target vector</li>
<li><script type="math/tex">\odot</script> denotes the Hadamard product</li>
<li><script type="math/tex">\bf{Q_\lambda}</script> is given by <script type="math/tex">Q_\lambda = \frac{1}{\lambda} (\tilde{K}
(\frac{\lambda}{C} I_n + \tilde{K})^{-1} \tilde{K})</script>, and</li>
<li><script type="math/tex">\bf \tilde{K}</script> is the Gram matrix formed by the privileged information</li>
</ul>
<p>So by replacing the hinge loss with the squared hinge loss, the SVM+ formulation
can now be solved with existing libraries!</p>
<h2 id="extensions-to-svm">Extensions to SVM+</h2>
<p>In his paper, Vapnik makes it clear that LUPI is a very general and abstract
paradigm, and as such there is plenty of room for creativity and innovation -
not just in researching and developing new LUPI methods and algorithms, but also
in implementing and applying them. It is unknown how to best go about supplying
privileged information so as to get good performance. How should the data be
feature engineered? How much signal should be in the privileged information?
These are all open questions.</p>
<p>Vapnik himself opens up three avenues to extend the SVM+ algorithm:</p>
<ol>
<li><em>a mixture model of slacks:</em> when slacks are estimated by a mixture of a
smooth function and some prior</li>
<li><em>a model where privileged information is available only for a part of the
training data:</em> where we can only supply privileged information on a small
subset of the training examples</li>
<li><em>multiple-space privileged information:</em> where the privileged information we
can supply do not all share the same features</li>
</ol>
<p>Clearly, there’s a lot of potential in the LUPI paradigm, as well as a lot of
reasons to be skeptical. It’s very much a nascent perspective of machine
learning, so I’m interested in keeping an eye on it going forward. I’m hoping
to write more posts on LUPI in the future!</p>George Hohttps://raw.githubusercontent.com/eigenfoo/eigenfoo.xyz/master/assets/images/email.pngHere's a short story you might know: you have a black box, whose name is Machine Learning Algorithm. It's got two modes: training mode and testing mode.Linear Discriminant Analysis for Starters2017-12-16T00:00:00+00:002017-12-30T00:00:00+00:00https://eigenfoo.xyz/lda<p><em>Linear discriminant analysis</em> (commonly abbreviated to LDA, and not to be
confused with <a href="https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation">the other
LDA</a>) is a very
common dimensionality reduction technique for classification problems. However,
that’s something of an understatement: it does so much more than “just”
dimensionality reduction.</p>
<p>In plain English, if you have high-dimensional data (i.e. a large number of
features) from which you wish to classify observations, LDA will help you
transform your data so as to make the classes as distinct as possible. More
rigorously, LDA will find the linear projection of your data into a
lower-dimensional subspace that optimizes some measure of class separation. The
dimension of this subspace is necessarily strictly less than the number of
classes.</p>
<p>This separation-maximizing property of LDA makes it so good at its job that it’s
sometimes considered a classification algorithm in and of itself, which leads to
some confusion. <em>Linear discriminant analysis</em> is a form of dimensionality
reduction, but with a few extra assumptions, it can be turned into a classifier.
(Avoiding these assumptions gives its relative, <em>quadratic discriminant
analysis</em>, but more on that later). Somewhat confusingly, some authors call the
dimensionality reduction technique “discriminant analysis”, and only prepend the
“linear” once we begin classifying. I actually like this naming convention more
(it tracks the mathematical assumptions a bit better, I think), but most people
nowadays call the entire technique “LDA”, so that’s what I’ll call it.</p>
<p>The goal of this post is to give a comprehensive introduction to, and
explanation of, LDA. I’ll look at LDA in three ways:</p>
<ol>
<li>LDA as an algorithm: what does it do, and how does it do it?</li>
<li>LDA as a theorem: a mathematical derivation of LDA</li>
<li>LDA as a machine learning technique: practical considerations when using LDA</li>
</ol>
<p>This is a lot for one post, but my hope is that there’s something in here for
everyone.</p>
<h2 id="lda-as-an-algorithm">LDA as an Algorithm</h2>
<h3 id="problem-statement">Problem statement</h3>
<p>Before we dive into LDA, it’s good to get an intuitive grasp of what LDA
tries to accomplish.</p>
<p>Suppose that:</p>
<ol>
<li>You have very high-dimensional data, and that</li>
<li>You are dealing with a classification problem</li>
</ol>
<p>This could mean that the number of features is greater than the number of
observations, or it could mean that you suspect there are noisy features that
contain little information, or anything in between.</p>
<p>Given that this is the problem at hand, you wish to accomplish two things:</p>
<ol>
<li>Reduce the number of features (i.e. reduce the dimensionality of your feature
space), and</li>
<li>Preserve (or even increase!) the “distinguishability” of your classes or the
“separatedness” of the classes in your feature space.</li>
</ol>
<p>This is the problem that LDA attempts to solve. It should be fairly obvious why
this problem might be worth solving.</p>
<p>To judiciously appropriate a term from signal processing, we are interested in
increasing the signal-to-noise ratio of our data, by both extracting or
synthesizing features that are useful in classifying our data (amplifying our
signal), and throwing out the features that are not as useful (attenuating our
noise).</p>
<p>Below is simple illustration I made, inspired by <a href="https://www.quora.com/Can-you-explain-the-comparison-between-principal-component-analysis-and-linear-discriminant-analysis-in-dimensionality-reduction-with-MATLAB-code-Which-one-is-more-efficient">Sebastian
Raschka</a>
that may help our intuition about the problem:</p>
<p><a href="https://raw.githubusercontent.com/eigenfoo/eigenfoo.xyz/master/assets/images/lda-pic.png"><img style="float: middle" width="500" height="500" src="https://raw.githubusercontent.com/eigenfoo/eigenfoo.xyz/master/assets/images/lda-pic.png" /></a></p>
<p>A couple of points to make:</p>
<ul>
<li>LD1 and LD2 are among the projections that LDA would consider. In reality, LDA
would consider <em>all possible</em> projections, not just those along the x and y
axes.</li>
<li>LD1 is the one that LDA would actually come up with: this projection gives the
best “separation” of the two classes.</li>
<li>LD2 is a horrible projection by this metric: both classes get horribly
overlapped… (this actually relates to PCA, but more on that later)</li>
</ul>
<p><strong>UPDATE:</strong> For another illustration, Rahul Sangole made a simple but great
interactive visualization of LDA
<a href="https://rsangole.shinyapps.io/LDA_Visual/">here</a> using
<a href="https://shiny.rstudio.com/">Shiny</a>.</p>
<h3 id="solution">Solution</h3>
<p>First, some definitions:</p>
<p>Let:</p>
<ul>
<li><script type="math/tex">n</script> be the number of classes</li>
<li><script type="math/tex">\mu</script> be the mean of all observations</li>
<li><script type="math/tex">N_i</script> be the number of observations in the <script type="math/tex">i</script>th class</li>
<li><script type="math/tex">\mu_i</script> be the mean of the <script type="math/tex">i</script>th class</li>
<li><script type="math/tex">\Sigma_i</script> be the covariance matrices of the <script type="math/tex">i</script>th class</li>
</ul>
<p>Now, define <script type="math/tex">S_B</script> to be the <em>within-class scatter matrix</em>, given by</p>
<script type="math/tex; mode=display">\begin{align*}
S_W = \sum_{i=1}^{n}{\Sigma_i}
\end{align*}</script>
<p>and define <script type="math/tex">S_W</script> to be the <em>between-class scatter matrix</em>, given by</p>
<script type="math/tex; mode=display">\begin{align*}
S_W = \sum_{i=1}^{n}{N_i (\mu_i - \mu) (\mu_i - \mu)^T}
\end{align*}</script>
<p><a href="https://en.wikipedia.org/wiki/Diagonalizable_matrix">Diagonalize</a> <script type="math/tex">S_W^{-1}
S_B</script> to get its eigenvalues and eigenvectors.</p>
<p>Pick the <script type="math/tex">k</script> largest eigenvalues, and their associated eigenvectors. We will
project our observations onto the subspace spanned by these vectors.</p>
<p>Concretely, what this means is that we form the matrix <script type="math/tex">A</script>, whose columns are the
<script type="math/tex">k</script> eigenvectors chosen above. <script type="math/tex">W</script> will allow us to transform our
observations into the new subspace via the equation <script type="math/tex">y = A^T x</script>, where <script type="math/tex">y</script> is
our transformed observation, and <script type="math/tex">x</script> is our original observation.</p>
<p>And that’s it!</p>
<p>For a more detailed and intuitive explanation of the LDA “recipe”, see
<a href="http://sebastianraschka.com/Articles/2014_python_lda.html">Sebastian Raschka’s blog post on
LDA</a>.</p>
<h2 id="lda-as-a-theorem">LDA as a Theorem</h2>
<p><strong>Sketch of Derivation:</strong></p>
<p>In order to maximize class separability, we need some way of measuring it as a
number. This number should be bigger when the between-class scatter is bigger,
and smaller when the within-class scatter is larger. There are many such
formulas/numbers that have this property: <a href="https://www.elsevier.com/books/introduction-to-statistical-pattern-recognition/fukunaga/978-0-08-047865-4">Fukunaga’s <em>Introduction to
Statistical Pattern
Recognition</em></a>
considers no less than four! Here, we’ll concern ourselves with just one:</p>
<script type="math/tex; mode=display">J_1 = tr(S_{WY}^{-1} S_{BY})</script>
<p>where I denote the within and between-class scatter matrices of the projection
vector <script type="math/tex">Y</script> by <script type="math/tex">S_{WY}</script> and <script type="math/tex">S_{BY}</script>, to avoid confusion with the
corresponding matrices for the projected vector <script type="math/tex">X</script>.</p>
<p>Now, a standard result from probability is that for any random variable <script type="math/tex">X</script>
and matrix <script type="math/tex">A</script>, we have <script type="math/tex">cov(A^T X) = A^T cov(X) A</script>. We’ll apply this
result to our projection <script type="math/tex">y = A^T x</script>. It follows that</p>
<script type="math/tex; mode=display">S_{WY} = A^T S_{WX} A</script>
<p>and</p>
<script type="math/tex; mode=display">S_{BY} = A^T S_{BX} A</script>
<p>where <script type="math/tex">S_{BX}</script> and <script type="math/tex">S_{BY}</script> are the between-class scatter matrices, and
<script type="math/tex">S_{WX}</script> and <script type="math/tex">S_{WY}</script> are the within-class scatter matrices, for <script type="math/tex">X</script>
and its projection <script type="math/tex">Y</script>, respectively.</p>
<p>It’s now a simple matter to write <script type="math/tex">J_1</script> in terms of <script type="math/tex">A</script>, and maximize
<script type="math/tex">J_1</script>. Without going into the details, we set <script type="math/tex">\frac{\partial J_1}{\partial
A} = 0</script> (whatever that means), and use the fact that <a href="https://math.stackexchange.com/questions/546155/proof-that-the-trace-of-a-matrix-is-the-sum-of-its-eigenvalues">the trace of a matrix is
the sum of its
eigenvalues</a>.</p>
<p>I don’t want to go into the weeds with this here, but if you really want to see
the algebra, Fukunaga is a great resource. The end result, however, is the same
condition on the eigenvalues and eigenvectors as stated above: in other words,
the optimization gives us LDA as presented.</p>
<p>There’s one more quirk of LDA that’s very much worth knowing. Suppose you have
10 classes, and you run LDA. It turns out that the <em>maximum</em> number of features
LDA can give you is one less than the number of class, so in this case, 9!</p>
<p><strong>Proposition:</strong> <script type="math/tex">S_W^{-1} S_B</script> has at most <script type="math/tex">n-1</script> non-zero eigenvalues, which
implies that LDA is must reduce the dimension to <em>at least</em> <script type="math/tex">n-1</script>.</p>
<p>To prove this, we first need a lemma.</p>
<p><strong>Lemma:</strong> Suppose <script type="math/tex">{v_i}_{i=1}^{n}</script> is a set of linearly dependent vectors, and
let <script type="math/tex">\alpha_i</script> be <script type="math/tex">n</script> coefficients. Then, <script type="math/tex">M = \sum_{i=1}^{n}{\alpha_i v_i
v_i^{T}}</script>, a linear combination of outer products of the vectors with
themselves, is rank deficient.</p>
<p><strong>Proof:</strong> The row space of <script type="math/tex">M</script> is generated by the set of vectors <script type="math/tex">{v_1, v_2,
..., v_n}</script>. However, because this set of vectors is linearly dependent, it must
span a vector space of dimension strictly less than <script type="math/tex">n</script>, or in other words
less than or equal to <script type="math/tex">n-1</script>. But the dimension of the row space is precisely
the rank of the matrix <script type="math/tex">M</script>. Thus, <script type="math/tex">rank(M) \leq n-1</script>, as desired.</p>
<p>With the lemma, we’re now ready to prove our proposition.</p>
<p><strong>Proof:</strong> We have that</p>
<script type="math/tex; mode=display">\begin{align*}
\frac{1}{n} \sum_{i=1}^{n}{\mu_i} = \mu \implies \sum_{i=1}^{n}{\mu_i-\mu} = 0
\end{align*}</script>
<p>So <script type="math/tex">\{\mu_i-\mu\}_{i=1}^{n}</script> is a linearly dependent set. Applying our lemma, we
see that</p>
<script type="math/tex; mode=display">S_B = \sum_{i=1}^{n}{N_i (\mu_i-\mu)(\mu_i-\mu)^{T}}</script>
<p>must be rank deficient. Thus, <script type="math/tex">rank(S_W) \leq n-1</script>. Now, <script type="math/tex">rank(AB) \leq
rank(A)rank(B)</script>, so</p>
<script type="math/tex; mode=display">\begin{align*}
rank(S_W^{-1}S_B) \leq \min{(rank(S_W^{-1}), rank(S_B))} = n-1
\end{align*}</script>
<p>as desired.</p>
<h2 id="lda-as-a-machine-learning-technique">LDA as a Machine Learning Technique</h2>
<p>OK so we’re done with the math, but how is LDA actually used in practice? One of
the easiest ways is to look at how LDA is actually implemented in the real
world. <code class="highlighter-rouge">scikit-learn</code> has <a href="http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html#sklearn.discriminant_analysis.LinearDiscriminantAnalysis">a very well-documented implementation of
LDA</a>:
I find that reading the docs is a great way to learn stuff.</p>
<p>Below are a few miscellaneous comments on practical considerations when using
LDA.</p>
<h3 id="regularization-aka-shrinkage">Regularization (a.k.a. shrinkage)</h3>
<p><code class="highlighter-rouge">scikit-learn</code>’s implementation of LDA has an interesting optional parameter:
<code class="highlighter-rouge">shrinkage</code>. What’s that about?</p>
<p><a href="https://stats.stackexchange.com/questions/106121/does-it-make-sense-to-combine-pca-and-lda/109810#109810">Here’s a wonderful Cross Validated
post</a>
on how LDA can introduce overfitting. In essence, matrix inversion is an
extremely sensitive operation (in that small changes in the matrix may lead to
large changes in its inverse, so that even a tiny bit of noise will be amplified
upon inverting the matrix), and so unless the estimate of the within-class
scatter matrix <script type="math/tex">S_W</script> is very good, its inversion is likely to introduce
overfitting.</p>
<p>One way to combat that is through regularizing LDA. It basically replaces
<script type="math/tex">S_W</script> with <script type="math/tex">(1-t)S_W + tI</script>, where <script type="math/tex">I</script> is the identity matrix, and <script type="math/tex">t</script> is
the <em>regularization parameter</em>, or the <em>shrinkage constant</em>. That’s what
<code class="highlighter-rouge">scikit</code>’s <code class="highlighter-rouge">shrinkage</code> parameter is: it’s <script type="math/tex">t</script>.</p>
<p>If you’re interested in <em>why</em> this linear combination of the within-class
scatter and the identity give such a well-conditioned estimate of <script type="math/tex">S_W</script>, check
out <a href="https://www.sciencedirect.com/science/article/pii/S0047259X03000964">the original paper by Ledoit and
Wolf</a>.
Their original motivation was in financial portfolio optimization, so they’ve
also authored several other papers
(<a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=433840&rec=1&srcabs=290916&alg=7&pos=6">here</a>
and <a href="https://www.sciencedirect.com/science/article/pii/S0927539803000070">here</a>)
that go into the more financial details. That needn’t concern us though:
covariance matrices are literally everywhere.</p>
<p>For an illustration of this, <code class="highlighter-rouge">amoeba</code>’s post on Cross Validated gives a good
example of LDA overfitting, and how regularization can help combat that.</p>
<h3 id="lda-as-a-classifier">LDA as a classifier</h3>
<p>We’ve talked a lot about how LDA is a dimensionality reduction technique. But in
addition to it, you can make two extra assumptions, and LDA becomes a very
robust classifier as well! Here they are:</p>
<ol>
<li>Assume that the class conditional distributions are Gaussian, and</li>
<li>Assume that these Gaussians have the same covariance matrix (a.k.a.
assume <a href="https://en.wikipedia.org/wiki/Homoscedasticity">homoskedasticity</a>)</li>
</ol>
<p>Now, <em>how</em> LDA acts as a classifier is a bit complicated: the problem is solved
fairly easily if there are only two classes. In this case, the optimal Bayesian
solution is to classify the observation depending on whether the log of the
likelihood ratio is less than or greater than some threshold. This turns out to
be a simple dot product: <script type="math/tex">\vec{w} \cdot \vec{x} > c</script>, where <script type="math/tex">\vec{w} =
\Sigma^{-1} (\vec{\mu_1} - \vec{\mu_2})</script>. <a href="https://en.wikipedia.org/wiki/Linear_discriminant_analysis#LDA_for_two_classes">Wikipedia has a good derivation of
this</a>.</p>
<p>There isn’t really a nice dot-product solution for the multiclass case. So,
what’s commonly done is to take a “one-against-the-rest” approach, in which
there are <script type="math/tex">k</script> binary classifiers, one for each of the <script type="math/tex">k</script> classes. Another
common technique is to take a pairwise approach, in which there are <script type="math/tex">k(k-1)/2</script>
classifiers, one for each pair of classes. In either case, the outputs of all
the classifiers are combined in some way to give the final classification.</p>
<h3 id="close-relatives-pca-qda-anova">Close relatives: PCA, QDA, ANOVA</h3>
<p>LDA is similar to a lot of other techniques, and the fact that they all go by
acronyms doesn’t do anyone a favor. My goal here isn’t to introduce or explain
these various techniques, but rather point out their differences.</p>
<p><em>1) Principal components analysis (PCA):</em></p>
<p>LDA is very similar to <a href="http://setosa.io/ev/principal-component-analysis">PCA</a>:
in fact, the question posted in the Cross Validated post above was actually
about whether or not it would make sense to perform PCA followed by LDA.</p>
<p>There is a crucial difference between the two techniques, though. PCA tries to
find the axes with <em>maximum variance</em> for the whole data set, whereas LDA tries
to find the axes for best <em>class separability</em>.</p>
<p><a href="https://raw.githubusercontent.com/eigenfoo/eigenfoo.xyz/master/assets/images/lda-pic.png"><img style="float: middle" width="500" height="500" src="https://raw.githubusercontent.com/eigenfoo/eigenfoo.xyz/master/assets/images/lda-pic.png" /></a></p>
<p>Going back to the illustration from before (reproduced on the right), it’s not
hard to see that PCA would give us LD1, whereas LDA would give us LD2. This
makes the main difference between PCA and LDA painfully obvious: just because a
feature has a high variance, doesn’t mean that it’s predictive of the classes!</p>
<p><em>2) Quadratic discriminant analysis (QDA):</em></p>
<p>QDA is a generalization of LDA as a classifer. As mentioned above, LDA must
assume that the class contidtional distributions are Gaussian with the same
covariance matrix, if we want it to do any classification for us.</p>
<p>QDA doesn’t make this homoskedasticity assumption (assumption number 2 above),
and attempts to estimate the covariance of all classes. While this might seem
like a more robust algorithm (fewer assumptions! Occam’s razor!), this means
there is a much larger number of parameters to estimate. In fact, the number of
parameters grows quadratically with the number of classes! So unless you can
guarantee that your covariance estimates are reliable, you might not want to use
QDA.</p>
<p>After all of this, there might be some confusion about the relationship between
LDA, QDA, what’s for dimensionality reduction, what’s for classification, etc.
<a href="https://stats.stackexchange.com/questions/71489/three-versions-of-discriminant-analysis-differences-and-how-to-use-them/71571#71571">This CrossValidated
post</a>
and everything that it links to, might help clear things up.</p>
<p><em>3) Analysis of variance (ANOVA):</em></p>
<p>LDA and <a href="https://en.wikipedia.org/wiki/Analysis_of_variance">ANOVA</a> seem to have
similar aims: both try to “decompose” an observed variable into several
explanatory/discriminatory variables. However, there is an important difference
that <a href="https://en.wikipedia.org/wiki/Linear_discriminant_analysis">the Wikipedia article on
LDA</a> puts very
succinctly (my emphases):</p>
<blockquote>
<p>LDA is closely related to analysis of variance (ANOVA) and regression
analysis, which also attempt to express one dependent variable as a linear
combination of other features or measurements. However, ANOVA uses
<strong>categorical</strong> independent variables and a <strong>continuous</strong> dependent variable,
whereas discriminant analysis has <strong>continuous</strong> independent variables and a
<strong>categorical</strong> dependent variable (i.e. the class label).</p>
</blockquote>George Hohttps://raw.githubusercontent.com/eigenfoo/eigenfoo.xyz/master/assets/images/email.pngLinear discriminant analysis (commonly abbreviated to LDA, and not to be confused with latent Dirichlet allocation) is a very common dimensionality reduction technique for classification problems.Portfolio Risk Analytics and Performance Attribution with Pyfolio2017-08-23T00:00:00+00:002017-12-16T00:00:00+00:00https://eigenfoo.xyz/pyfolio<p><img style="float: right; margin: 20px 20px" src="https://camo.githubusercontent.com/3b820de5af1d3e62ecdd614349abd46f4d46d7d6/68747470733a2f2f6d656469612e7175616e746f7069616e2e636f6d2f6c6f676f732f6f70656e5f736f757263652f7079666f6c696f2d6c6f676f2d30332e706e67" /></p>
<p>I was lucky enough to have the chance to intern at
<a href="https://www.quantopian.com/">Quantopian</a> this summer. During that time I
contributed some exciting stuff to their open-source portfolio analytics engine,
<a href="https://github.com/quantopian/pyfolio"><code class="highlighter-rouge">pyfolio</code></a>, and learnt a truckload of
stuff while doing it! In this blog post, I’ll describe and walk through two of
the new features that I authored: the risk and performance attribution tear
sheets.</p>
<h2 id="risk-analytics">Risk Analytics</h2>
<p>A well-known truth of algorithmic trading is that it’s insufficient to merely
maximize the returns of your algorithm: you must also do so while minimizing the
risk it takes on board. This idea is probably most famously codified in the
<a href="https://en.wikipedia.org/wiki/Sharpe_ratio#Definition">Sharpe ratio</a>, which
divides by the volatility of the returns stream in order to give a measure of
the “risk-adjusted returns”.</p>
<p>However, the volatility of returns is a rather poor proxy for the amount of
“risk” that an algorithm takes on. What if our algo loaded all of its money in
the real estate sector? What if the algo shorted extremely large-cap stocks?
What if half of our portfolio is in illiquid, impossible-to-exit positions?</p>
<p>These are all “risky” behavior for an algorithm to have, and we’d like to know
about and understand this kind of behavior before we seriously consider investing
money in the algo. However, these formulations of risk are neither captured nor
quantified by the volatility of returns (as in the Sharpe ratio). Finally,
there is no easy, free, open-source way to get this sort of analysis.</p>
<p>Enter <code class="highlighter-rouge">pyfolio</code>’s new risk tear sheet! It addresses all the problems outlined
above, and more. Let’s jump right in with an example.</p>
<p><a href="https://user-images.githubusercontent.com/19851673/27990609-375258e2-642a-11e7-9f51-76aa8c309ad1.png"><img align="middle" src="https://user-images.githubusercontent.com/19851673/27990609-375258e2-642a-11e7-9f51-76aa8c309ad1.png" /></a></p>
<p><em>(Note: this example risk tear sheet came from the <a href="https://github.com/quantopian/pyfolio/pull/391">original pull
request</a>, and may therefore be
out of date)</em></p>
<p>The first 4 plots show the exposure to common style factors: specifically, the
size of the company (natural log of the market cap), mean reversion (measured
by the <a href="http://www.investopedia.com/terms/m/macd.asp">MACD Signal</a>), long-term
momentum, and volatility.
A style factor is best explained with examples: mean reversion, momentum,
volatility and the Fama-French canonical factors (SMB, HML, UMD) are all
examples of style factors. They are factors that indicate broad market trends
(instead of being characteristic to individual stocks, like sectors or market
caps) and characterize a particular <em>style</em> of investing (e.g. mean reversion,
trend-following strategies, etc.).
The analysis is not limited to 4 style factors, though: <code class="highlighter-rouge">pyfolio</code> will handle
as many as you pass in (but see below for a possible complication). As we can
see, the algorithm has a significant exposure to the MACD signal, which may or
may not worry us. For instance, it wouldn’t worry us if we knew that it was a
mean-reversion algo, but we would raise some eyebrows if it was something
else… perhaps the author <em>wanted</em> to write a wonderful, event-driven
sentiment algo, but inadvertently <em>ended up</em> writing a mean reversion algo!
One important caveat here is that <code class="highlighter-rouge">pyfolio</code> requires you to supply your own
style factors, for every stock in your universe. This is an unfortunately large
complication for the average user, as it would require you to formulate and
implement your own risk model — I explain this in greater detail below.</p>
<p>The next 3 plots show the exposures to sectors. This first plot shows us how much
the algorithm longed or shorted a specific sector: above the x-axis if it
longed, and below if it shorted. The second plot simply shows the gross exposure
to each sector: taking the absolute value of the positions before normalizing.
The last plot shows the net exposure to each sector: taking the long position
<em>less the short position</em> before normalizing. This particular algo looks
beautiful: it is equally exposed to all sectors, and not overly exposed to any
one of them. Evidently, this algo must be taking account its sector exposures
in its trading logic: given what we know from above, perhaps it is longing the
top 10 most “mean reverting” stocks in each sector at the start of every
week… This analysis requires no addition data other than your algorithm’s
positions: you can supply your own sectors if you like, but if not, the analysis
will default to the <a href="https://www.quantopian.com/help/fundamentals#asset-classification">Morningstar sector
mappings</a>
(specifically, the <code class="highlighter-rouge">morningstar_sector_code</code> field), available for free on the
Quantopian platform.</p>
<p>The next 3 plots show the exposures to market caps. In every other respect, it
is identical to the previous 3 plots. These plots look fairly reasonable: most
algos spend most of their positions in large and mega cap names, and have almost
no positions in micro cap stocks. (Quantopian actually discourages investing in
micro cap stocks by pushing users towards using the <a href="https://www.quantopian.com/posts/the-q500us-and-q1500us">Q500 or
Q1500</a> as a tradeable
universe). This analysis uses <a href="https://www.quantopian.com/help/fundamentals#valuation">Morningstar’s <code class="highlighter-rouge">market cap</code>
field</a>.</p>
<p>The last 2 plots show the portfolio’s exposure to illiquidity (or low trading
volume). This one is a bit trickier to understand: every the end of every day,
we take the number of shares held in each position and divide that by the
total volume. That gives us a number per position per day. We find the 10th
percentile of this number (i.e. the most illiquid) and plot that as a time
series. So it is a measure of how exposed our portfolio is to illiquid stocks.
The first plot shows the illiquid exposure in our long and short positions,
respectively: that is, it takes the number of shares held in each long/short
position, and divides it by the daily total volume. The second plot shows the
gross illiquid exposure, taking the absolute value of positions before
dividing. So it looks like for this particular algo, for the 10% most illiquid
stock in our portfolio, our position accounts for around 0.2–0.6% (<em>not</em>
0.002–0.006%!) of market volume, on any given day. That’s an acceptably low
number! This analysis obviously requires daily volume data per stock, but that’s
freely available on Quantopian’s platform.</p>
<p>That’s it for the risk tear sheet! There are some more cool ideas in the
works (there always are), such as including plots to show a portfolio’s
concentration risk exposure, or a portfolio’s exposure to penny stocks. If you
have any suggestions, please file a <a href="https://github.com/quantopian/pyfolio/issues">new GitHub
issue</a> to let the dev team know!
Pyfolio is open-source and under active development, and outside contributions
are always loved and appreciated. Alternatively, if you just want to find out
more about the nuts and bolts (i.e. the math and the data) that goes into risk
tear sheet, you can dig around <a href="https://github.com/quantopian/pyfolio/tree/master/pyfolio">the source code
itself</a>!</p>
<h2 id="risk-models-and-performance-attribution">Risk Models and Performance Attribution</h2>
<p>There are two things in the discussion of the risk tear sheet that are worth
talking about in further detail:</p>
<ol>
<li>
<p>I mentioned how the computation of style factor exposures (i.e. the
first 4 plots) required your own “risk model” (whatever that is), and</p>
</li>
<li>
<p>It was nice that we can guess at the inner workings of the algo, just by
seeing its exposure to common factors. E.g., I guessed that the example algo
was a sector-neutral mean reversion algo, because it was equally exposed to
all 11 sectors, and had a high (in magnitude) exposure to the MACD signal.</p>
</li>
</ol>
<p>I’ll talk about both points in order.</p>
<p>In order to find out your exposure to a style factor, you obviously must first
know how much each stock is affected by the style factor. But how do you get
that? That is what a risk model is for!</p>
<p>At the end of every period (usually every trading day), the risk model wakes
up, looks at all the pricing data and style factor data for that day.
It then tries to explain as best it can how much each stock was affected by
each style factor. The end result is that each stock will have a couple of
numbers associated with it, one for every style factor. These numbers indicate
how sensitive the stock’s returns were to movements in the style factors. These
numbers are called <em>factor loadings</em> or <em>betas</em> (although I prefer “factor
loadings” because a lot of things in quant finance are called “beta”).</p>
<p>Even better, there’s no reason why the risk model should limit itself to style
factors! I previously made the distinction between style factors and other
factors such as sectors: theoretically, a risk model should also be able to find
out how sensitive a stock’s returns are to movements in its sector: compute a
“sector factor loading”, if you will. Collectively, all the factors that we want
the risk model to consider — be they sector, style or otherwise — are called
<em>common factors</em>.</p>
<p>Clearly, having a risk model allows us to do a whole lot of stuff! This is
because, if we want to know how style factors and other prevailing market trends
are affecting our <em>portfolio</em>, we must first know how they affect the <em>stocks</em>
in our portfolio. Or, to be a bit more ambitious, if we knew how style factors
and prevailing market trends are impacting our <em>universe</em> of stocks, then we’re
well on the way to knowing how they’re impacting our portfolio! The value of
this kind of portfolio analysis should, of course, be self-evident.</p>
<p>So, suppose we have a risk model. How do we get from a <em>stock-level</em> understanding
of how market trends are affecting us, to a <em>portfolio-level</em> understanding of the
same? The answer to this question is called <em>performance attribution</em>, and is
one of the main reasons a risk model is worth having.</p>
<p>Instead of prattling on about performance attribution, it’d just be easier to
show you the miracles it can do. Below are some (fake, made up) examples of some
analysis performance attribution can give us:</p>
<p>Date: 08–23–2017</p>
<table>
<thead>
<tr>
<th style="text-align: left">Factor</th>
<th style="text-align: right">PnL ($)</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">Total PnL</td>
<td style="text-align: right">-1,000</td>
</tr>
<tr>
<td style="text-align: left">Technology</td>
<td style="text-align: right">70</td>
</tr>
<tr>
<td style="text-align: left">Real Estate</td>
<td style="text-align: right">-40</td>
</tr>
<tr>
<td style="text-align: left">Momentum</td>
<td style="text-align: right">-780</td>
</tr>
<tr>
<td style="text-align: left">Mean Reversion</td>
<td style="text-align: right">100</td>
</tr>
<tr>
<td style="text-align: left">Volatility</td>
<td style="text-align: right">-110</td>
</tr>
<tr>
<td style="text-align: left">Stock-Specific</td>
<td style="text-align: right">480</td>
</tr>
</tbody>
</table>
<p>The table shows that today, our algo suffered a $1000 loss, and the breakdown of
that loss indicates that the main culprit is momentum. In other words, our poor
performance today is mostly attributable to the poor performance of the momentum
factor (hence the name, “performance attribution”). The sector factors account
for very little PnL, while the other style factors (mean reversion and
volatility) drive fairly significant profits and losses, but the real smoking
gun here is the fact that momentum completely tanked today.</p>
<p>There are a few more useful summary statistics that performance attribution can
give us! Traditional computations for the alpha and the Sharpe ratio of a
strategy usually take into account the performance of the market: i.e., the
traditional alpha is a measure of how much our strategy <em>outperformed</em> the
market, and the traditional Sharpe ratio is a measure of the same, but
accounting for the volatility of returns. These may be dubbed <em>single-factor
alphas</em>, because they only measure performance once one factor has been
accounted for — namely, the market. In reality, we would like to not only
account for the market, but also any other common factors, such as style or
sector. This leads to the concept of the <em>multi-factor alpha and Sharpe ratio</em>,
which is exactly the same as the alpha and Sharpe ratio we’re familiar with, but
taking into account a lot more factors. In other words, whereas the returns in
excess of the market is quantified by the single factor alpha, the returns in
excess of the market, momentum, mean reversion, volatility etc., is
quantified by the multi factor alpha. The same goes for the single factor and
multi factor Sharpe, in the case of risk-adjusted returns.</p>
<p>Adding performance attribution capabilities to <code class="highlighter-rouge">pyfolio</code> is an active project! A
couple of pull requests have already been merged to this effect, so definitely
stay tuned! A new version of <code class="highlighter-rouge">pyfolio</code> will probably be made once performance
attribution is up and running. As always, feel free to
<a href="https://github.com/quantopian/pyfolio">contribute to <code class="highlighter-rouge">pyfolio</code></a>, be it by
making feature requests, issues with bugs, or submitting a pull request!</p>
<hr />
<p><strong>EDIT (12–16–2017):</strong> Quantopian recently launched their risk model for anyone to
use! This is a great resource that usually only large and deep-pocketed
financial institutions have access to. Check it out
<a href="https://www.quantopian.com/risk-model">here</a>!</p>
<p><strong>EDIT (05–11–2018):</strong> Quantopian’s now integrated pyfolio analytics into their
backtest engine! This makes it much easier to see how your algorithm stacks up
against expectations. Check out the announcement
<a href="https://www.quantopian.com/posts/improved-backtest-analysis">here</a>!</p>
<p><strong>EDIT (05–29–2018:</strong> Quantopian recently released a white paper on how the risk
model works! Read all about it <a href="https://www.quantopian.com/papers/risk">here</a>.</p>George Hohttps://raw.githubusercontent.com/eigenfoo/eigenfoo.xyz/master/assets/images/email.pngI was lucky enough to have the chance to intern at [Quantopian](https://www.quantopian.com/) this summer. During that time I contributed some exciting stuff to their open-source portfolio analytics engine, [`pyfolio`](https://github.com/quantopian/pyfolio), and learnt a truckload of stuff while doing it!