George Hohttps://www.georgeho.org/George Hohttps://www.georgeho.org/Hugo gohugo.ioCopyright © 2022, George Ho.2024-03-28T07:38:57ZData is Plural Podcast — Crossword Datahttps://www.georgeho.org/data-is-plural-podcast/2023-12-20T00:00:00Z2023-12-20T00:00:00Z<p>I was on a podcast!</p>
<p>Check out <a href="https://podcast.data-is-plural.com/2159594/14179132-s2e5-crosswords">the latest episode of the Data Is Plural
postcast</a>
to hear me and fellow crossword archivist <a href="http://saul.pw/">Saul Pwanson</a> talk
about our respective crossword datasets. I also opine a bit on milk carton data
collection, and I’m <a href="https://www.economist.com/united-states/2023/11/30/a-national-milk-carton-shortage-sours-americas-dairy-industry">only now
realizing</a>
how prescient those comments were! <sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup></p>
<p>For more, check out <a href="https://www.georgeho.org/cryptic-clues/">my previous blog post</a> introducing the
dataset, <a href="https://cryptics.georgeho.org/">the dataset itself</a> or <a href="https://github.com/eigenfoo/cryptics">the source
code.</a></p>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>I mean, just listen to this:</p>
<blockquote>
<p>[Pactiv Evergreen, a milk carton supplier,] has … resurrected a defunct
generic brand and will use its design for all cartons, rather than
interrupting the line to change logos. This should speed [milk]
production and increase capacity by 10%.</p>
</blockquote>
<p>Wouldn’t that be cool to see in a dataset? <a href="#fnref:1" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
</ol>
</div>“So What Is Your Job, Exactly?”https://www.georgeho.org/ml-for-ehr-rwd-extraction/2023-04-04T00:00:00Z2023-04-04T00:00:00Z<p>My team at <a href="https://flatiron.com/">Flatiron Health</a> (but especially the
inimitable <a href="https://www.blytheadamson.com/">Blythe Adamson</a>) recently wrote an
excellent overview of how we use machine learning to build research-ready
oncology datasets from patient medical records — in a paper called <em>Approach
to Machine Learning for Extraction of Real-World Data Variables from Electronic
Health Records.</em></p>
<p>This is a refreshingly simple and concise articulation of what I do at my day
job! You can read the paper on <a href="https://www.frontiersin.org/articles/10.3389/fphar.2023.1180962/full">Frontiers in
Pharmacology</a>,
<a href="https://www.medrxiv.org/content/10.1101/2023.03.02.23286522v1">medRxiv</a>, or
read <a href="https://flatiron.com/resources/approach-to-machine-learning-for-extraction-of-real-world-data-variables-from-electronic-health-records">a brief overview on the Flatiron
website</a>.</p>
<p>To spare you a few clicks, here’s what I think is the most enlightening figure
in the paper:</p>
<figure>
<a href="https://www.georgeho.org/assets/images/ehr-variable-snippets.jpg"><img src="https://www.georgeho.org/assets/images/ehr-variable-snippets.jpg" alt="EHR"></a>
<figcaption><i>Snippets of text from electronic health records are inputs to deep learning models that produce a variable value for each patient as an output. Source: <a href="https://www.medrxiv.org/content/10.1101/2023.03.02.23286522v1">Figure 5 in the paper.</a></i></figcaption>
</figure>Thoughts on Hanukkah of Data 2022https://www.georgeho.org/hanukkah-of-data-2022/2022-12-26T00:00:00Z2022-12-26T00:00:00Z<blockquote>
<p>This blog post contains spoilers for Hanukkah of Data 2022.</p>
</blockquote>
<p>This holiday season I’ve been doing the <a href="https://hanukkah.bluebird.sh/">Hanukkah of
Data</a>, which is a puzzle suite by a group of
hackers called <a href="https://bluebird.sh/">the Devottys</a>. It’s a sequence of
programming puzzles, with one puzzle dropping for every day of Hanukkah. If
you’re familiar with <a href="https://adventofcode.com/">Advent of Code</a>, it’s very
similar to that, except (a) it only lasts 8 days instead of 25, and (b) it’s
more data-oriented, instead of coding or algorithms-oriented.</p>
<p>I did it in <a href="https://www.visidata.org/">VisiData</a>, which is a tool I’ve been
using a lot recently (both at work and for my side projects) that I really
wanted to develop expert proficiency with.</p>
<p>Here were my solve statistics:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-text" data-lang="text"><span style="display:flex;"><span> Puzzle | Solve Time | # Attempts ║
</span></span><span style="display:flex;"><span> 0 | 3 minutes | 2 ║
</span></span><span style="display:flex;"><span> 1 | 82 minutes | 2 ║
</span></span><span style="display:flex;"><span> 2 | 20 minutes | 1 ║
</span></span><span style="display:flex;"><span> 3 | 20 minutes | 5 ║
</span></span><span style="display:flex;"><span> 4 | 37 minutes | 1 ║
</span></span><span style="display:flex;"><span> 5 | 7 minutes | 1 ║
</span></span><span style="display:flex;"><span> 6 | 6 minutes | 1 ║
</span></span><span style="display:flex;"><span> 7 | 24 minutes | 1 ║
</span></span><span style="display:flex;"><span> 8 | 5 minutes | 1 ║
</span></span></code></pre></div><h2 id="overall-impressions">Overall Impressions</h2>
<p><strong>Hanukkah of Data is much shorter than Advent of Code</strong>, which I think is a
hugely underrated benefit — in previous years, Advent of Code sometimes felt
more like homework than a puzzle suite.</p>
<p>It’s also <strong>more of a puzzle than Advent of Code</strong> — for example, Puzzle 2
required very non-trivial reading comprehension and logical inference to
realize that you were looking for (a) a customer with the initials JD (b) who
had, in the same order, bought coffee and bagels at Noah’s market (c) in 2017.
I found this much more enjoyable than Advent of Code, where the solution is
usually straightforward, and the implementation is the meat of the challenge.</p>
<p>On VisiData: for those comfortable with command line interfaces, Vim-style key
bindings, or simply willing to put in the time to learn a mini-language of
keyboard shortcuts, I think <strong>VisiData is the best tool for doing well-scoped,
one-off data explorations or analyses.</strong></p>
<p>Towards the end of Hanukkah, the puzzles became less conceptually ambiguous and
more technically difficult (in terms of the sophistication of the data
wrangling required). As someone already experienced with querying data, I was
pleased to finish these puzzles in single-digit minutes — an achievement that
I credit almost entirely to VisiData, which makes visualizing, filtering and
aggregating data seamlessly interactive.</p>
<p>I only see two downsides of VisiData: the sparse documentation of advanced
features (more on that below) and performance. Performance is most obviously an
issue when you’re doing joins — joining two tables with a few thousand rows
each takes a noticeably long time. I’m looking forward to
<a href="https://github.com/visidata/vdsql"><code>vdsql</code></a>, which is VisiData’s sibling
project that skins various databases with a VisiData interface (via
<a href="https://ibis-project.org/">Ibis</a>), and should therefore be as performant as
the underlying database.</p>
<h2 id="some-miscellaneous-thoughts">Some Miscellaneous Thoughts</h2>
<ul>
<li>
<p>Puzzle 1 required a non-trivial function (basically a “phonespell” to convert
words to numbers, as if you were dialling on a phone). I struggled a lot with
making this custom function available to me in VisiData — I spent around an
hour figuring out how to make <a href="https://www.visidata.org/docs/plugins/">a custom
plugin</a> (this is what really blew up
my solve time on the first day). I later learnt that adding a Python function
to your <code>.visidatarc</code> is a much simpler way to achieve the same thing.</p>
<p>While I think the basics of VisiData are <a href="https://jsvine.github.io/intro-to-visidata/">exceptionally well
documented</a>, the advanced
features are not — I still don’t really understand how to extend VisiData
with its API. Nevertheless, this won’t be an issue for most users, since 90%
of VisiData’s value is in its interactivity and interoperability, not in its
extensibility.</p>
</li>
<li>
<p>For me, the most challenging puzzle was Puzzle 4, which asked to find someone
who buys pastries. When you have a dataset with over a thousand products, how
do find all the pastries?</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-txt" data-lang="txt"><span style="display:flex;"><span>sku | desc | wholesale_cost ║
</span></span><span style="display:flex;"><span>DLI0002 | Smoked Whitefish Sandwich | 9.33 ║
</span></span><span style="display:flex;"><span>PET0005 | Vegan Cat Food, Turkey & Chicken | 4.35 ║
</span></span><span style="display:flex;"><span>HOM0018 | Power Radio (red) | 21.81 ║
</span></span><span style="display:flex;"><span>KIT0034 | Azure Ladle | 2.81 ║
</span></span><span style="display:flex;"><span>PET0041 | Gluten-free Cat Food, Pumpkin & Pumpkin | 4.60 ║
</span></span></code></pre></div><p>What I ended up doing was to split out the “suffix” of each product
<code>desc</code>ription (with some special handling for parenthetical modifiers), like
so:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-txt" data-lang="txt"><span style="display:flex;"><span>sku | desc | descsuffix | wholesale_cost ║
</span></span><span style="display:flex;"><span>DLI0002 | Smoked Whitefish Sandwich | Sandwich | 9.33 ║
</span></span><span style="display:flex;"><span>PET0005 | Vegan Cat Food, Turkey & Chicken | Chicken | 4.35 ║
</span></span><span style="display:flex;"><span>HOM0018 | Power Radio (red) | Radio | 21.81 ║
</span></span><span style="display:flex;"><span>KIT0034 | Azure Ladle | Ladle | 2.81 ║
</span></span><span style="display:flex;"><span>PET0041 | Gluten-free Cat Food, Pumpkin & Pumpkin | Pumpkin | 4.60 ║
</span></span></code></pre></div><p>This number of <em>kinds</em> of products is drastically fewer than the number of
products, to the point where it’s feasible to look through them all manually
and pick out the pastries.</p>
<p>This obviously won’t work in general: for example, <code>Vegan Cat Food, Turkey & Chicken</code> isn’t a kind of <code>Chicken</code>, and you could imagine that this would
really let you down for a product called <code>Rugelach, Raspberry</code> instead of
<code>Raspberry Rugelach</code>. Still, I thought this was a neat trick, and I managed
to eke out the correct solution.</p>
<p>Later in the week I realized that all pastries had an <code>sku</code> that started with
<code>BKY</code>, which would’ve helped considerably — similarly, cat foods start with
<code>PET</code> and collectibles start with <code>COL</code>. Sometimes it pays to actually read
random-looking alphanumeric codes!</p>
</li>
<li>
<p>I was surprised that the puzzles didn’t seem to be monotonically increasing
in difficulty — as you’ll see from my times, and as you might expect from
Advent of Code. <a href="https://www.saul.pw/">Saul Pwanson</a> (the creator of Hanukkah
of Data) had this to say:</p>
<blockquote>
<p>There is a ramp in difficulty, but it is not very steep, and for people who
are already familiar with data queries, it might feel like not much has
been added between puzzles. But if you look at each puzzle compared with
the previous one, there is always something new. Sometimes it’s structural
(now you need to do a join), sometimes it’s worldly (what is a pastry?),
and sometimes it’s technical (it’s surprisingly difficult in most tools to
filter based on a date range that doesn’t include the year).</p>
</blockquote>
<p>It’s a really good observation — I suppose I shouldn’t be surprised that
Saul’s thought about the puzzle design a lot more than I have! 😅</p>
</li>
<li>
<p>The text art is just stunning! Each solved puzzle reveals a new animal, until
the whole tapestry is illuminated:</p>
<p><a href="https://www.georgeho.org/assets/images/hanukkah-of-data.png"><img src="https://www.georgeho.org/assets/images/hanukkah-of-data.png" alt="The whole tapestry for Hanukkah of Data
2022"></a></p>
</li>
</ul>Use Your Computer Faster By Reading Lesshttps://www.georgeho.org/computer-faster-reading-less/2022-11-06T00:00:00Z2022-11-06T00:00:00Z<p>Suppose you want your computer to take some action (whether it’s showing you information about something, navigating to a particular file, etc.). You’re not reading through a menu and thinking through what to do: you already know what you want to do, and you just have to execute.</p>
<p>In these instances, <strong>the slowest thing you could possibly do is read.</strong><sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup> The more you can make your computer do what you want <em>with your eyes literally closed</em>, the faster and more efficient you will be at the computer.<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup></p>
<h2 id="coding-environment">Coding Environment</h2>
<p>I spend a lot of my time writing code, and I experienced a real step change in my programming quality of life once I configured my coding environment to be able to do all of the below without looking at the screen. (For those <del>un</del>lucky enough to code in Vim, I’ve briefly outlined my setup.)</p>
<ul>
<li>Open a file, even if you only partially know its name and path (mappings to <code>fzf.vim</code> <code>:Buffers</code> and <code>:Files</code>)</li>
<li>Search text in files, both currently open and not (mappings to <code>fzf.vim</code> <code>:Lines</code> and <code>:Rg</code>)</li>
<li>Run the current file (a mapping to <code>dispatch.vim</code> <code>:Dispatch</code>)</li>
<li>Run the current file’s tests (again <code>dispatch.vim</code>)</li>
</ul>
<h2 id="search-dont-skim">Search, Don’t Skim</h2>
<p>I generally try to avoid skimming webpages and documents (which is basically light reading), and instead search for what I’m looking for.</p>
<p>For example, I recently had the <a href="https://xgboost.readthedocs.io/en/stable/python/python_api.html">XGBoost Python API documentation</a> open. It is long and tortuous, and I only want to know the name of the argument that controls the sample weight (which corrects for label class imbalance).</p>
<p>I could scroll and skim for <code>XGBClassifier</code>, and then skim each of the several dozen arguments to find it… or I could simply search for it directly. I searched for <code>weight</code>, saw several hundred search hits and realized that was too vague a term (it could also refer to the weights of the XGBoost model itself). I then tried <code>reweight</code> and got no hits, so I tried <code>balanc</code> (I don’t search the final <code>e</code> because that would exclude conjugations like <code>balancing</code>), and found what I was looking for: the first search hit was next to the argument <code>scale_pos_weight</code>. A bit of scrolling around to double check, and I was done.</p>
<h2 id="keyboard-shortcuts">Keyboard Shortcuts</h2>
<p>Outside of coding, I interact with a lot of other applications every day, almost all of which have keyboard shortcuts.</p>
<p>Anecdotally, I know many people get turned off from learning keyboard shortcuts, because there are just so many of them. I would start small: learning two or three keyboard shortcuts per application will probably cover 90% of your use cases. I’ll just go over the ones I use most frequently.</p>
<p><strong>Google Chrome:</strong></p>
<ul>
<li><code>Ctrl-Shift-A</code> to search tabs, both currently open and recently closed</li>
<li><code>Ctrl-F</code> to search in the current tab</li>
<li><code>Ctrl-T</code> to open a new tab</li>
<li><code>Ctrl-W</code> to close a tab</li>
</ul>
<p>Here are a ton more <a href="https://support.google.com/chrome/answer/157179">Chrome</a> and <a href="https://support.mozilla.org/en-US/kb/keyboard-shortcuts-perform-firefox-tasks-quickly">Firefox</a> shortcuts.</p>
<p><strong>Slack:</strong></p>
<ul>
<li><code>Cmd-G</code> to search Slack</li>
<li><code>Cmd-F</code> to send a new Slack</li>
<li><code>Cmd-Shift-A</code> to go to your unread messages, <code>Esc</code> to mark unread messages as read</li>
</ul>
<p>Here are a ton more <a href="https://slack.com/help/articles/201374536-Slack-keyboard-shortcuts">Slack</a> keyboard shortcuts.</p>
<p><strong>Gmail:</strong></p>
<ul>
<li><code>/</code> to search your emails</li>
<li><code>gi</code> to go to your inbox</li>
<li><code>c</code> to compose a new email</li>
<li><code>?</code> to see all shortcuts</li>
</ul>
<p>Here are a ton more <a href="https://support.microsoft.com/en-us/office/keyboard-shortcuts-for-outlook-3cdeb221-7ae5-4c1d-8c1d-9e63216c1efd">Outlook</a> and <a href="https://proton.me/support/keyboard-shortcuts">ProtonMail</a> keyboard shortcuts.</p>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>The particular phrasing of this idea is from Gary Bernhardt’s <a href="https://www.destroyallsoftware.com/screencasts/catalog/some-vim-tips">Destroy All Software</a> screencasts, which I can’t recommend highly enough. <a href="#fnref:1" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
<li id="fn:2">
<p>As an aside, this is also true with crosswords! Crossword speed solvers know that the slowest thing you can do is <em>read clues</em>, so the trick is to solve only the down clues while reading the across entries as they solve to make sure that they still form valid words or phrases. In this way, they can cut down on around half of the time they would have normally spent reading the across clues. <a href="#fnref:2" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
</ol>
</div>Merriam-Webster and Unstructured Data Processinghttps://www.georgeho.org/webster-unstructured-data/2022-09-18T00:00:00Z2022-09-18T00:00:00Z<p>I recently finished reading <a href="https://bookshop.org/books/word-by-word-the-secret-life-of-dictionaries/9781101970263"><em>Word by Word: The Secret Life of Dictionaries</em> by
Kory
Stamper</a>,
which was an unexpected page-turner. What intrigued me most was (perhaps
unsurprisingly) Stamper’s description of how Merriam-Webster gets written, and
what a striking resemblance that process has to many successful unstructured
data projects in the wild. I want to use this blog post to ruminate on this.</p>
<hr>
<p><strong>First</strong> it begins with collection and curation of raw, unstructured data.
Stamper describes a fascinating process called <em>“reading and marking”</em>, whereby
editors are assigned reading of current magazines, periodicals, blogs —
almost anything written in English, it seems — and read and underline any
words that catch their eye: new words, or words that get used in new ways.
(This is, contrary to first impressions, a non-trivial task for which requires
training: good readers-and-markers will pick up on the recent trend of <em>“bored
of”</em>, instead of the more historically common <em>“bored with”</em> — this doesn’t
imply that <em>bored</em> is picking up a new meaning, but rather that <em>of</em> is…
which as you can imagine, can get lexicographers very excited.)</p>
<p>Stamper also describes the use of corpora, which are basically large structured
datasets of English being used in the wild — a dataset of tweets, say, or
transcripts of popular TV shows. As data gets increasingly commoditized, data
projects will increasingly have the luxury of starting with structured data (or
at least, supplementing their raw unstructured data with structured data).</p>
<p><strong>Second</strong> is the actual structuring of the data. This entails a small army of
editors dividing the entire dictionary amongst themselves, and defining (or
revising definitions of) each word by hand. In practice, that means opening up
the database of read-and-marked words (and maybe also the structured corpora),
seeing if the current definition needs to be revised to accommodate new senses
or usage of the word, and potentially writing or rewriting a definition for new
words… all in the span of maybe 15 minutes per word, on average.</p>
<p>This seems to be the most labor-intensive step in the “Merriam-Webster data
pipeline”, but of course is also the one that adds the most value. There’s no
reason to think that this phase (or any of these three phases, really!) needs
to be technologically sophisticated — the dictionary-maker still makes use of
index cards and filing cabinets today. Lucrative products <a href="https://vicki.substack.com/p/neural-nets-are-just-people-all-the">being underpinned by
vast amounts of manual human labor is unfortunately nothing
new</a>, but
it’s good to be reminded of it. The fact that product value and technological
sophistication are unrelated is underappreciated: you don’t unlock more value
from your data by writing better code or training better machine learning
models.</p>
<p><strong>Finally</strong> comes any ancillary features or datasets that Merriam-Webster
offers on top of their existing data (a.k.a. the dictionary), simply because
they are best positioned to deliver them. Think of things like etymology,
pronunciations and dates<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup>.</p>
<p>It can seem funny that a dataset’s true value to users (or, if you like, the
dataset’s “product-market fit”) might come from one of these subsidiary
datasets or features, instead of “the real thing”. This makes sense though:
just as companies pivot products and business models to stay relevant, so too
can unstructured datasets — after all, it’s not a huge stretch to think of
unstructured datasets as products in their own right.</p>
<hr>
<p>So here we have a recipe for a successful data project:</p>
<ol>
<li>Collect and curate raw, unstructured data,</li>
<li>Structure it (ideally also adding some value to the data in the process, but
structuring the data is value enough), and</li>
<li>Offer subsidiary datasets that you are best positioned to offer</li>
</ol>
<p>What other data projects have followed this recipe?</p>
<ol>
<li>
<p><strong>Google Search</strong>: Google <a href="https://developers.google.com/search/docs/advanced/crawling/googlebot">crawled the
internet</a>,
and continues to do so on an ongoing basis; they invented
<a href="https://en.wikipedia.org/wiki/PageRank">PageRank</a> and other methods
algorithms to make searching (a weak form of “structuring”, I suppose) the
internet possible; and their question-answering and
<a href="https://developers.google.com/search/docs/advanced/structured-data/carousel">carousels</a>
are good examples of ancillary features on top of their core offering.</p>
</li>
<li>
<p><strong><a href="https://cryptics.georgeho.org/"><code>cryptics.georgeho.org</code></a></strong>: my <a href="https://www.georgeho.org/cryptic-clues/">dataset
of cryptic crossword clues</a> started by indexing several
blogs for cryptic crosswords; I then wrote a ton of <code>BeautifulSoup</code> to parse
structured clue information out of the blog post HTML; finally, I ran some
simple searches and regular expressions to produce more valuable resources
for constructors of cryptic crosswords.</p>
</li>
</ol>
<p>I wouldn’t be convinced that this is the <em>only</em> way for data projects succeed,
but it does seem like a helpful pattern to keep in mind!</p>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>I was surprised to learn that words with multiple definitions are defined
in chronological order of first usage, and not, as I imagined, some kind of
“importance” of definitions. <a href="#fnref:1" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
</ol>
</div>Link Bulletin, August 2022https://www.georgeho.org/link-bulletin-2022-08/2022-08-04T00:00:00Z2022-08-04T00:00:00Z<blockquote>
<p>This is an at-most-monthly link bulletin, where I compile and post a handful
of links that I’ve read and thought about with minimal explanation and
commentary, on the theory that the links I find interesting might also be
interesting to others.</p>
</blockquote>
<ul>
<li><a href="https://blog.ceejbot.com/posts/reduce-friction/">Reduce Friction</a></li>
<li><a href="https://twitter.com/sh_reya/status/1521903041003225088"><em>“Some MLOps principles I think every ML platform should
have”</em></a></li>
<li><a href="https://twitter.com/seanjtaylor/status/1523096896532664320"><em>“It’s not uncommon for tech companies to have more employees than people
expect. What’s your best explanation for this
phenomenon?”</em></a></li>
<li><a href="https://www.jsvine.com/consulting/pdf-data-extraction/">PDF Data Extraction
Consulting</a></li>
<li><a href="https://joeldueck.com/wiki?name=A+lightweight+Pollen+replacement">A Lightweight Pollen
Replacement</a></li>
<li><a href="https://begriffs.com/posts/2019-07-19-history-use-vim.html">History and Effective Use of
Vim</a></li>
</ul>Datasets and Dictionaries for Crosswordshttps://www.georgeho.org/crosswords-datasets-dictionaries/2022-07-30T00:00:00Z2022-07-30T00:00:00Z<p>Lately, I’ve become worryingly knowledgeable in datasets for crosswords… so
I’ve written up basically everything I know that might be helpful to crossword
constructors (and makers of other word puzzles, too). However, in writing this,
I realized that this may be helpful to just about anybody who works with words
— lyricists, poets, marketers, scholars, etc. Hopefully there’s something for
everybody! So without further ado,</p>
<h2 id="dictionaries">Dictionaries</h2>
<p>I’ll assume you know what a dictionary is — if you’re reading this you may
even have a <em>favorite</em> dictionary (or a favorite dictionary <em>edition!</em>),
whether it’s <a href="https://chambers.co.uk">Chambers</a>,
<a href="https://www.merriam-webster.com">Merriam-Webster</a> or <a href="https://en.wikipedia.org/wiki/Google_Dictionary">Google
Dictionary</a> (which, <a href="https://support.google.com/websearch/answer/10106608">fun
fact</a>, is mostly sourced
from <a href="https://languages.oup.com/google-dictionary-en/">Oxford Languages</a>).</p>
<p>More interesting are dictionaries that allow you to search or query them in
more sophisticated ways: the most popular are <a href="https://onelook.com">OneLook</a>
and <a href="https://www.onelook.com/thesaurus">OneLook Thesaurus</a>, where a user can,
for example, search <code>bl????rd</code> to find words that start with <em>bl</em>, end with
<em>rd</em>, and have four letters in between — so <code>bluebird</code> would be a result.</p>
<p>The main asset with these dictionaries is the <em>expressiveness</em> of the query
language, and in that regard <a href="https://www.quinapalus.com/qat.html">Qat</a> (which
is also available in French) handily beats OneLook: it can match vowels and
consonants (<code>bl@@#@rd</code>) and ranges of letters and lengths (<code>8-10:bl*rd</code>). Qat
is also able to solve “word equations” (e.g.
<a href="https://www.quinapalus.com/cgi-bin/qat?pat=ABCDE%3D.....%3B!%3DA%3CB%3CC%3CD%3CE"><code>ABCDE=.....;!=A<B<C<D<E</code></a>
finds five-letter words whose letters are in strictly alphabetical order, such
as <code>abhor</code> and <code>first</code>), and even <em>simultaneous</em> word equations (e.g.
<a href="https://www.quinapalus.com/cgi-bin/qat?pat=ACB%3BADB%3BAEB%3B%7CACB%7C%3D5%3B%7CE%7C%3D1%3B!%3DC%3CD%3CE"><code>ACB;ADB;AEB;|ACB|=5;|E|=1;!=C<D<E</code></a>
finds sets of three five-letter words that are all one letter apart, such as
<code>beats, boats, brats</code> — useful for finding crossing words!).</p>
<h2 id="augmented-dictionaries">Augmented Dictionaries</h2>
<p>Many tools supplement dictionaries with other data, such as etymology,
pronunciation or sets of related words. You might think that your favorite
dictionary would already give you all of those things, but the strength here is
in the ability to easily write very sophisticated queries, such as <a href="https://api.datamuse.com/words?rel_com=car&sp=t*"><em>“what
comprises a car that starts with the letter
T?”</em></a>, to give you phrases
like <code>trunk, throttle, tailfin, third gear</code>.</p>
<ul>
<li>The <a href="https://www.etymonline.com/">Online Etymology Dictionary</a> looks up word
etymologies, which is helpful for avoiding <em>“shared roots”</em> in cryptic
crosswords.</li>
<li>The <a href="http://www.speech.cs.cmu.edu/cgi-bin/cmudict">Carnegie Mellon University Pronouncing
Dictionary</a> looks up word
pronunciations, splitting words up into phonemes. This may seem silly
<em>(“can’t you just Google to learn the pronounciation of words?”),</em> but with a
bit of work, this dataset lets you look up homophones and Spoonerisms, as
some crossword construction software — such as <a href="https://exet.app">Exet</a> — do!</li>
<li><a href="https://rhymezone.com/">RhymeZone</a> and its Spanish cousin
<a href="https://rimar.io/">Rimar.io</a> let you look up homophones, rhymes or near
rhymes (RhymeZone actually uses the CMU Pronouncing Dictionary, among other
datasets!)</li>
<li><a href="https://onelook.com/spruce/">Spruce</a> looks up “inspiring sentences” —
quotes, lyrics, proverbs and jokes, which are indexed from
<a href="https://en.wikiquote.org/wiki/Main_Page">WikiQuote</a> and <a href="https://commoncrawl.org/">Common
Crawl</a>.</li>
<li><a href="https://nutrimatic.org/">Nutrimatic</a> looks up words or phrases mined from
Wikipedia. This allows you to, for example, find anagrams that form
natural-sounding phrases (e.g. <code><dictionaries></code> finds anagrams like <code>is a direction</code> or <code>i consider it a</code>, instead of anagrams that technically work
but are not natural-sounding, such as <code>ratio incised</code> or <code>tonic dairies</code>).</li>
<li>The <a href="https://www.datamuse.com/api/">Datamuse API</a> is a very expressive search
engine that sits on top of OneLook and RhymeZone. Unfortunately, there isn’t
a user-friendly frontend, so it’s effectively restricted to people who are
able to make use of programmatic access.</li>
</ul>
<p>Here, another shoutout goes to <a href="https://www.onelook.com/thesaurus/">OneLook
Thesaurus</a> and
<a href="https://www.quinapalus.com/qat.html">Qat</a>, which use several datasets (such as
the <a href="https://wordnet.princeton.edu/">Princeton WordNet</a> and Wikipedia category
lists) to search words based on their meaning. For example, in OneLook,
<code>process by which plants eat</code> gives you <code>photosynthesis</code> as the top result; in
Qat, <code>{hypo:color}</code> gives you words that mean “color”, such as <code>acrylic apricot blacken blueing</code>; also in Qat, <code>{hyper:agate}</code> gives you words that “agate”
means, such as <code>entity matter quartz</code>. These searches make it easy to find
synonyms, hypernyms, hyponyms and other related words.</p>
<h2 id="curated-dictionaries">Curated Dictionaries</h2>
<p>In the other direction are datasets that don’t <em>augment</em> dictionaries, but
rather <em>curate</em> them: their usefulness comes not just in what you <em>can</em> find in
them, but equally in what you <em>can’t</em>.</p>
<p>The most prevalent examples are wordlists and their cousins, seedlists. As far
as I can tell, these are more useful for American-style crosswords, where there
is a hard requirement for fully interlocking grids (and grid-filling
consequently is a more difficult and computer-assisted task).</p>
<p>Wordlists tend to be personalized by puzzle constructors, and you can find some
wordlists for sale, most notably <a href="https://www.xwordinfo.com/WordList">Jeff Chen’s Personal
List</a>. There are also several
freely-accessible ones such as <a href="https://www.spreadthewordlist.com/">spread the
word(list)</a>, <a href="https://github.com/Crossword-Nexus/collaborative-word-list">The Collaborative Word
List</a>, and <a href="https://peterbroda.me/crosswords/wordlist/">Peter
Broda’s wordlist</a>.</p>
<p>Other examples of curated dictionaries would just be lists of specific things.
One amazing example is the <a href="https://sites.google.com/view/expandedcrosswordnamedatabase/home">Expanded Crossword Name
Database</a>,
which contains the names of notable women and non-binary people, with an eye to
increasing their representation in crosswords. Aside from that, I’ve found
Wikipedia’s “listicles” to be very helpful (e.g. here’s a list of <a href="https://en.wikipedia.org/wiki/List_of_Native_Americans_of_the_United_States">notable
Native Americans of the United
States</a>).</p>
<h2 id="datasets-of-crosswords">Datasets of Crosswords</h2>
<p>Finally, let’s not neglect the most obvious thing: literal datasets of
crosswords! These datasets are are significant works of crossword archivism,
since acquiring crosswords in bulk and structuring their contents requires
effort and cleaning that few are willing to do for such trivial data. (Fun
fact: according to <a href="https://cryptics.georgeho.org/static/documents/Selection_AppendixE_v2.pdf">this 2004 selection
guide</a>,
the Library of Congress explicitly does not collect crossword puzzles,
suggesting that they’re too trivial for the national library!)</p>
<ul>
<li><a href="https://www.xwordinfo.com/">XWord Info</a> is probably the dataset with largest
following, as it covers the <em>The New York Times’</em> crossword and is actively
maintained.</li>
<li>Among constructors of American-style crosswords, <a href="https://tiwwdty.com/clue/">Matt Ginsberg’s clue
dataset</a> is the go-to dataset (since it’s free and
accessible to download), but it’s unfortunately no longer actively
maintained.</li>
<li><a href="https://xd.saul.pw/"><code>xd.saul.pw</code></a> is an excellent dataset of American-style
crossword and clues from various publications that is also free and
accessible to download.</li>
<li>The <a href="https://www.cruciverb.com/data.php">Cruciverb database</a> is also a
dataset of American-style crossword and clues, but unfortunately requires a
membership to access.</li>
<li>Finally, to plug my own dataset,
<a href="https://cryptics.georgeho.org/"><code>cryptics.georgeho.org</code></a> is a dataset of
cryptic clues, with auxiliary datasets of cryptic indicators and charades.</li>
</ul>Link Bulletin, April 2022https://www.georgeho.org/link-bulletin-2022-04/2022-04-23T00:00:00Z2022-04-23T00:00:00Z<p>This is the first of a new kind of blog post that I’m trying, on the theory
that the links I find interesting might also be interesting to others: at most
once a month, I’ll compile and post a handful of links that I’ve read and
thought about, with minimal explanation or commentary. Here we go!</p>
<ul>
<li><a href="https://cdn.ca9.uscourts.gov/datastore/opinions/2022/04/18/17-16783.pdf">Ninth Circuit Court Opinion, <em>hiQ Labs, Inc. v. Linkedin
Corporation</em></a></li>
<li><a href="https://www.courtlistener.com/docket/6071320/hiq-labs-inc-v-linkedin-corporation/">California District Court Case, <em>hiQ Labs, Inc. v. Linkedin
Corporation</em></a></li>
<li><a href="https://github.com/simonw/shot-scraper"><code>shot-scraper</code>: a tool for taking automated screenshots of
websites</a></li>
<li><a href="https://palewi.re/docs/news-homepages/"><code>news-homepages</code>: a bot that gathers, archives and shares screenshots of
news homepages</a></li>
<li><a href="https://crosshare.org/">Crosshare: a free, ad-free, and open-source place to create, share and solve
crossword puzzles</a></li>
</ul>How to Improve Your Static Site's Typographyhttps://www.georgeho.org/static-site-typography/2022-03-21T00:00:00Z2022-03-21T00:00:00Z<p>You’ve read that <a href="https://ia.net/topics/the-web-is-all-about-typography-period">web design is 95%
typography</a>. You
have a static website. You’ve wanted to improve its typography but have never
had the time or patience. You’ve might’ve even heard of Butterick’s <a href="https://practicaltypography.com/"><em>Practical
Typography</em></a>. If this sounds like you, you’re
in luck!</p>
<p>A foreword: you can achieve almost everything I describe here by adding CSS in
a <code><style></code> tag at the end of your webpages’ <code><head></code>s, but the code snippets
I include here aren’t meant to be copypasta solutions, but illustrative
examples.</p>
<div>
<h2>Contents</h2>
<nav id="TableOfContents">
<ul>
<li><a href="#easy-wins">Easy Wins</a>
<ul>
<li><a href="#choose-a-font">Choose a font</a></li>
<li><a href="#adjust-the-line-width-and-point-size">Adjust the line width and point size</a></li>
<li><a href="#adjust-the-line-height">Adjust the line height</a></li>
</ul>
</li>
<li><a href="#low-hanging-fruit">Low-Hanging Fruit</a>
<ul>
<li><a href="#adjust-paragraph-and-header-spacing">Adjust paragraph and header spacing</a></li>
<li><a href="#choose-a-monospaced-font-and-display-font">Choose a monospaced font and display font</a></li>
<li><a href="#set-a-background-color">Set a background color</a></li>
</ul>
</li>
<li><a href="#braver-undertakings">Braver Undertakings</a>
<ul>
<li><a href="#format-code-blocks">Format code blocks</a></li>
<li><a href="#support-sidenotes">Support sidenotes</a></li>
</ul>
</li>
</ul>
</nav>
</div>
<h2 id="easy-wins">Easy Wins</h2>
<p>Body text — the text that forms the main content of your website — is the
most important part of your website. These three things largely determine how
your body text looks, and nailing them can immediately improve your website’s
typography.</p>
<h3 id="choose-a-font">Choose a font</h3>
<p>Many static sites default to system fonts<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup>: that is, fonts that are likely
already installed on readers’ devices. This putatively boosts performance
(because readers need not download font files), and can give a more comfortable
look, since it can blend in with the fonts of the reader’s operating system.</p>
<p>However, many system fonts aren’t good, and many others have become hackneyed
<em>precisely because they are default fonts</em>. It’s also straightforward to use
custom webfonts or font hosting services like <a href="https://fonts.google.com/">Google
Fonts</a>.</p>
<p>Obviously you should do what you think is best for your website, but I’d point
out that <strong>changing your body font is an easy and effective way to upgrade your
typography and distinguish your writing from the sea of sans-serif on the
Internet.</strong> Live a little!</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-css" data-lang="css"><span style="display:flex;"><span><span style="color:#75715e">/* Use your own static font file(s).
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"> You should have a font face for regular, bold and italics. */</span>
</span></span><span style="display:flex;"><span>@<span style="color:#66d9ef">font-face</span>{
</span></span><span style="display:flex;"><span> <span style="color:#f92672">font-family</span><span style="color:#f92672">:</span> <span style="color:#e6db74">"Fira Sans"</span><span style="color:#f92672">;</span>
</span></span><span style="display:flex;"><span> <span style="color:#f92672">src</span><span style="color:#f92672">:</span> <span style="color:#f92672">url</span><span style="color:#f92672">(</span><span style="color:#e6db74">"/assets/fonts/FiraSansRegular.woff2"</span><span style="color:#f92672">)</span> <span style="color:#f92672">format</span><span style="color:#f92672">(</span><span style="color:#e6db74">"woff2"</span><span style="color:#f92672">);</span>
</span></span><span style="display:flex;"><span> <span style="color:#f92672">font-style</span><span style="color:#f92672">:</span> <span style="color:#f92672">normal</span><span style="color:#f92672">;</span>
</span></span><span style="display:flex;"><span> <span style="color:#f92672">font-weight</span><span style="color:#f92672">:</span> <span style="color:#f92672">400</span><span style="color:#f92672">;</span>
</span></span><span style="display:flex;"><span>}
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">/* Fall back on system fonts. */</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">body</span> { <span style="color:#66d9ef">font-family</span>: <span style="color:#e6db74">"Fira Sans"</span>, Verdana, <span style="color:#66d9ef">sans-serif</span>; }
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-css" data-lang="css"><span style="display:flex;"><span><span style="color:#75715e">/* Alternatively, use a font hosting service like Google Fonts.
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"> Again, have a font face for regular, bold and italics. */</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672"><</span><span style="color:#f92672">link</span> <span style="color:#f92672">rel</span><span style="color:#f92672">=</span><span style="color:#e6db74">"preconnect"</span> <span style="color:#f92672">href</span><span style="color:#f92672">=</span><span style="color:#e6db74">"https://fonts.googleapis.com"</span><span style="color:#f92672">></span>
</span></span><span style="display:flex;"><span><span style="color:#f92672"><</span><span style="color:#f92672">link</span> <span style="color:#f92672">rel</span><span style="color:#f92672">=</span><span style="color:#e6db74">"preconnect"</span> <span style="color:#f92672">href</span><span style="color:#f92672">=</span><span style="color:#e6db74">"https://fonts.gstatic.com"</span> <span style="color:#f92672">crossorigin</span><span style="color:#f92672">></span>
</span></span><span style="display:flex;"><span><span style="color:#f92672"><</span><span style="color:#f92672">link</span> <span style="color:#f92672">href</span><span style="color:#f92672">=</span><span style="color:#e6db74">"https://fonts.googleapis.com/css2?family=Fira+Sans&display=swap"</span> <span style="color:#f92672">rel</span><span style="color:#f92672">=</span><span style="color:#e6db74">"stylesheet"</span><span style="color:#f92672">></span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">/* Fall back on system fonts. */</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">body</span> { <span style="color:#66d9ef">font-family</span>: <span style="color:#e6db74">"Fira Sans"</span>, Verdana, <span style="color:#66d9ef">sans-serif</span>; }
</span></span></code></pre></div><h3 id="adjust-the-line-width-and-point-size">Adjust the line width and point size</h3>
<p>The ultimate goal is to control the <em>average number of characters per line:</em>
too many, and lines run on interminably; too few, and you force readers’ eyes
to dart uncomfortably back and forth. <strong>Aim to fit between two and three full
English alphabets per line.</strong></p>
<p>The twist is that this has to be done regardless of the screen size — most
obviously, it has to work on both desktop and mobile screens. This leads to
concept of <em>fluid type</em>, which just means that the font size changes in reponse
to the screen width.</p>
<p>Try adjusting your window size (or rotating your phone) to see how the line
width and point size adjust to always fit between two and three alphabets in
the following paragraph:</p>
<p>abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz</p>
<p>CSS Tricks has an <a href="https://css-tricks.com/simplified-fluid-typography/">excellent
tutorial</a> for fluid type
with CSS, which boils down to clever use of <code>min</code>, <code>max</code> and <code>vw</code>: the font
sizes goes between 16px on a 320px screen to 22px on a 1000px screen.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-css" data-lang="css"><span style="display:flex;"><span><span style="color:#f92672">body</span> { <span style="color:#66d9ef">max-width</span>: <span style="color:#ae81ff">720</span><span style="color:#66d9ef">px</span>; }
</span></span><span style="display:flex;"><span><span style="color:#f92672">html</span> { <span style="color:#66d9ef">font-size</span>: <span style="color:#a6e22e">min</span>(<span style="color:#a6e22e">max</span>(<span style="color:#ae81ff">16</span><span style="color:#66d9ef">px</span>, <span style="color:#ae81ff">4</span><span style="color:#66d9ef">vw</span>), <span style="color:#ae81ff">22</span><span style="color:#66d9ef">px</span>); }
</span></span></code></pre></div><h3 id="adjust-the-line-height">Adjust the line height</h3>
<p>The goal is to control <em>how closely consecutive lines sit next to each other:</em>
too tightly and you get intimidating walls of text; too loosely and your text
becomes a vaporous jumble of lines. <strong>Aim to space lines between 120% to 145%
of the point size.</strong> (The text in this paragraph has a spacing of 145%. Just
right!)</p>
<p style="line-height:1.1">
The goal is to control <i>how closely consecutive lines sit next to each
other:</i> too tightly and you get intimidating walls of text; too loosely
and your text becomes a vaporous jumble of lines. <b>Aim to space lines
between 120% to 145% of the point size.</b> (The text in this paragraph has a
spacing of 110%. Too dense.)
</p>
<p style="line-height:1.6">
The goal is to control <i>how closely consecutive lines sit next to each
other:</i> too tightly and you get intimidating walls of text; too loosely
and your text becomes a vaporous jumble of lines. <b>Aim to space lines
between 120% to 145% of the point size.</b> (The text in this paragraph has a
spacing of 160%. Too sparse.)
</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-css" data-lang="css"><span style="display:flex;"><span><span style="color:#f92672">body</span> { <span style="color:#66d9ef">line-height</span>: <span style="color:#ae81ff">1.45</span>; }
</span></span></code></pre></div><h2 id="low-hanging-fruit">Low-Hanging Fruit</h2>
<h3 id="adjust-paragraph-and-header-spacing">Adjust paragraph and header spacing</h3>
<p>The goal is to <em>enclose related pieces of text (i.e. sections and paragraphs)
with whitespace.<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup></em> Done right, readers are presented with a structured and
scannable hierarchy of sections and paragraphs, instead of a soup of
equally-spaced lines.</p>
<p><strong>Aim for paragraph spacing that is just large enough to be easily noticed:</strong> a
space equal to 50–100% of the body text size usually suffices. <strong>Header spacing
is more of a judgement call.</strong> However, to quote <a href="https://practicaltypography.com/space-above-and-below.html">Matthew
Butterick</a>:</p>
<blockquote>
<p>Semantically, headings relate to the text that follows, not the text before.
Thus you’ll probably want the space below to be smaller than the space above
so the heading is visually closer to the text it introduces.</p>
</blockquote>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-css" data-lang="css"><span style="display:flex;"><span><span style="color:#f92672">p</span> { <span style="color:#66d9ef">margin-top</span>: <span style="color:#ae81ff">20</span><span style="color:#66d9ef">px</span>; <span style="color:#66d9ef">margin-bottom</span>: <span style="color:#ae81ff">20</span><span style="color:#66d9ef">px</span>; }
</span></span><span style="display:flex;"><span><span style="color:#f92672">h1</span><span style="color:#f92672">,</span> <span style="color:#f92672">h2</span><span style="color:#f92672">,</span> <span style="color:#f92672">h3</span><span style="color:#f92672">,</span> <span style="color:#f92672">h4</span><span style="color:#f92672">,</span> <span style="color:#f92672">h5</span><span style="color:#f92672">,</span> <span style="color:#f92672">h6</span> { <span style="color:#66d9ef">margin-top</span>: <span style="color:#ae81ff">8</span><span style="color:#66d9ef">%</span>; <span style="color:#66d9ef">margin-bottom</span>: <span style="color:#ae81ff">-1</span><span style="color:#66d9ef">%</span>; }
</span></span></code></pre></div><h3 id="choose-a-monospaced-font-and-display-font">Choose a monospaced font and display font</h3>
<p>Body text is the most important part of a website, so spend time making it look
good (you’ll notice that all three <a href="#easy-wins">Easy Wins</a> were for the body
text). Once you’ve done that though, consider more fonts.</p>
<p>Monospaced fonts (for code) lets readers easily distinguish between prose and
code, and display fonts (for titles and headers) can have much more color and
character. <strong>Using a monospaced font can make technical, code-heavy text more
readable, and using a display font can lend your website personality.</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-css" data-lang="css"><span style="display:flex;"><span><span style="color:#f92672">h1</span><span style="color:#f92672">,</span> <span style="color:#f92672">h2</span><span style="color:#f92672">,</span> <span style="color:#f92672">h3</span><span style="color:#f92672">,</span> <span style="color:#f92672">h4</span><span style="color:#f92672">,</span> <span style="color:#f92672">h5</span><span style="color:#f92672">,</span> <span style="color:#f92672">h6</span> { <span style="color:#66d9ef">font-family</span>: Verdana, <span style="color:#66d9ef">sans-serif</span>; }
</span></span><span style="display:flex;"><span><span style="color:#f92672">code</span> { <span style="color:#66d9ef">font-family</span>: Consolas, <span style="color:#66d9ef">monospace</span>; }
</span></span></code></pre></div><h3 id="set-a-background-color">Set a background color</h3>
<p>(This will involve some aesthetic redesign for your website, which is why it
isn’t higher on the list.)</p>
<p>High contrast between text and background is good for legibility, but the
contrast between pure white (<code>#ffffff</code>) and pure black (<code>#000000</code>) can look
harsh and unsettling. <strong>Web pages are better served by off-white and off-black
backgrounds</strong>, which are easier on the eyes while still retaining high
contrast. <a href="https://edwardtufte.github.io/tufte-css/">Tufte CSS</a> suggests
<code>#fffff8</code> and <code>#111111</code>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-css" data-lang="css"><span style="display:flex;"><span><span style="color:#75715e">/* If the reader prefers dark mode, use off-black instead of off-white. */</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">body</span> { <span style="color:#66d9ef">background-color</span>: <span style="color:#ae81ff">#fffff8</span>; }
</span></span><span style="display:flex;"><span>@<span style="color:#66d9ef">media</span> <span style="color:#f92672">(</span><span style="color:#f92672">prefers-color-scheme</span><span style="color:#f92672">:</span> <span style="color:#f92672">dark</span><span style="color:#f92672">)</span> { <span style="color:#f92672">body</span> { <span style="color:#66d9ef">background-color</span>: <span style="color:#ae81ff">#111111</span>; } }
</span></span></code></pre></div><h2 id="braver-undertakings">Braver Undertakings</h2>
<h3 id="format-code-blocks">Format code blocks</h3>
<p>If you’re unlucky enough to know something about programming and noisy enough
to want to blog about it (both of which are unfortunately quite likely, if
you’re reading this), then <strong>you probably want your code blocks to look good.</strong></p>
<p>CSS Tricks has <a href="https://css-tricks.com/considerations-styling-pre-tag/">a fantastic tutorial on how to style <code><pre><code></code>
blocks</a>, which walks
through code wrapping, code block auto-expansion, syntax highlighting and space
control.</p>
<p>Frustratingly, there was <a href="https://stackoverflow.com/a/22417120/13372802">one bug that drove me up the
wall</a>, in which some lines of
code had their font size increased for seemingly no reason:</p>
<blockquote>
<p>WebKit has the annoying behavior (for a properly designed responsive site) of
trying to enlarge the font size for the “primary” text on the screen, where
primary is simply its best guess.</p>
</blockquote>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-css" data-lang="css"><span style="display:flex;"><span><span style="color:#f92672">pre</span> <span style="color:#f92672">code</span> {
</span></span><span style="display:flex;"><span> <span style="color:#75715e">/* Don't wrap long lines, force horizontal scrolling. */</span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">white-space</span>: <span style="color:#66d9ef">pre</span>;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">overflow-x</span>: <span style="color:#66d9ef">auto</span>;
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e">/* https://stackoverflow.com/a/22417120/13372802 */</span>
</span></span><span style="display:flex;"><span> text-size-adjust: <span style="color:#ae81ff">100</span><span style="color:#66d9ef">%</span>;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">-ms-</span>text-size-adjust: <span style="color:#ae81ff">100</span><span style="color:#66d9ef">%</span>;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">-moz-</span>text-size-adjust: <span style="color:#ae81ff">100</span><span style="color:#66d9ef">%</span>;
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">-webkit-</span>text-size-adjust: <span style="color:#ae81ff">100</span><span style="color:#66d9ef">%</span>;
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><h3 id="support-sidenotes">Support sidenotes</h3>
<p><em>Sidenotes</em> are when footnotes are placed in the margins beside the text they
reference, instead of at the end of the page. They allow readers to instantly
read annotations instead of having to constantly click or scroll to and fro.
<strong>Sidenotes greatly improve footnotes for the web, but are fairly difficult to
implement despite recent efforts.</strong></p>
<p>Gwern has compiled <a href="https://www.gwern.net/Sidenotes">an exhaustive bibliography of sidenote
implementations</a>, which I recommend skimming
over before turning to <a href="https://edwardtufte.github.io/tufte-css/">Tufte CSS</a>
for a simpler implementation.</p>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>Yeah I know, I’m interchanging <em>font</em> and <em>typeface</em>, but at least I have
a life. <a href="#fnref:1" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
<li id="fn:2">
<p>Graphic designers may call this <em>active whitespace:</em> whitespace
deliberately added for the sake of emphasis or structure. <a href="#fnref:2" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
</ol>
</div>Migrating to Hugohttps://www.georgeho.org/migrating-to-hugo/2022-03-05T00:00:00Z2022-03-05T00:00:00Z<center><img src="https://www.georgeho.org/assets/images/blog-rewrite-meme.png"></center>
<p>This weekend I migrated my blog to Hugo.</p>
<p>My website is now based on the <a href="https://github.com/janraasch/hugo-bearblog/">Hugo Bear Blog
theme</a>, generated with
<a href="https://gohugo.io/">Hugo</a>, hosted by <a href="https://pages.github.com/">GitHub Pages</a>
and served with <a href="https://www.cloudflare.com/">Cloudflare</a>. I’ve also migrated
from the <code>eigenfoo.xyz</code> domain to the more creditable-sounding <code>georgeho.org</code>
(sadly, <code>georgeho.com</code> and <code>georgeho.net</code> were already taken). In terms of
typography, the header typeface is <a href="https://www.1001freefonts.com/nicholson-gothic.font">Nicholson
Gothic</a>, the body typeface
is <a href="https://mbtype.com/fonts/equity/">Equity</a> and the monospaced typeface for
occasional code snippets is <a href="https://mbtype.com/fonts/triplicate/">Triplicate</a>.
In all, I probably spend the equivalent of two fancy lattes a year for this
setup.</p>
<h2 id="why-hugo-why-not-jekyll">Why Hugo? Why Not Jekyll?</h2>
<p>Honestly, no good reason! <a href="https://vickiboykis.com/2022/01/08/migrating-to-hugo/">Some people point
out</a> that Jekyll is not
actively maintained or used anymore, and that GitHub Pages doesn’t support
Jekyll 4.0. However, those aren’t really good enough reasons for migrating a
blogging stack.</p>
<p>Here’s a short list of things I like about Hugo over Jekyll — but again, none
of these things really should have enticed me to make the jump.</p>
<ul>
<li>Ease of installation and use (Hugo is a binary executable instead of a Ruby
library), and it was very easy to make changes to the theme (e.g. changing
the font or <a href="https://practicaltypography.com/line-length.html">increasing the font
size</a>) — although that
could just be because <a href="https://github.com/janraasch/hugo-bearblog/">the theme that I’m
using</a> is dead simple.</li>
<li>Automatic generation of a <a href="https://www.georgeho.org/sitemap.xml">sitemap</a> and <a href="https://www.georgeho.org/feed.xml">RSS feed</a>
— with Jekyll, these needed to be done manually (or by your theme).</li>
<li>Typographical conveniences like automatic <a href="https://practicaltypography.com/straight-and-curly-quotes.html">smart
quotes</a>,
rendering <code>-</code>, <code>--</code> and <code>---</code> into <a href="https://practicaltypography.com/hyphens-and-dashes.html">the appropriate hyphen or
dash</a>, and <code>...</code>
into <a href="https://practicaltypography.com/ellipses.html">an ellipsis</a>.</li>
<li>Faster builds of my website… although this isn’t really that helpful for
me, since my blog barely has a few dozen pages.</li>
</ul>
<h2 id="the-migration">The Migration</h2>
<p>…was surprisingly painless! All I <em>really</em> needed to do was to <a href="https://themes.gohugo.io/">pick out a
theme</a>, follow the <a href="https://gohugo.io/getting-started/quick-start/">Hugo Quick
Start</a>, dump my Markdown blog
posts into the <code>content/</code> directory and change some of the YAML front matter in
all of my blog posts.</p>
<p>In reality, I spent a few extra hours fiddling with the typography and making
sure that all my links were back-compatible with my previous website.</p>
<h2 id="pollen">Pollen</h2>
<p>This is actually not the first time I tried to rewrite my website: earlier this
year I experimented with writing a
<a href="https://edwardtufte.github.io/tufte-css/">Tufte-inspired</a> blog using
<a href="https://pollenpub.com">Pollen</a>. For those unfamiliar, it’s like R Markdown (in
that it’s a markup language that allows arbitrary R code to be embedded in it),
but instead of R, it’s <a href="https://racket-lang.org/">Racket</a>, and instead of
Markdown, it’s your own domain-specific markup language that you build with
Racket.</p>
<p>This means that I wrote a custom language specifically for formatting
Tufte-style two-column blog posts. It actually worked out pretty well (and the
resulting blog posts looked <em>damn good</em>), but I couldn’t justify maintaining my
own language specifically for writing blog posts. I’d probably recommend using
Pollen for large, one-off pieces of writing (like a book), instead of small,
recurring pieces of writing (like a blog).</p>Data Collection is Hard. You Should Try It.https://www.georgeho.org/data-collection-is-hard/2022-03-03T00:00:00Z2022-03-03T00:00:00Z<p>For people who make careers out of data, data scientists don’t have <em>nearly</em>
enough experience in data collection — and many data scientists don’t seem to
feel much cognitive dissonance from this fact, despite (very persuasive!)
<a href="https://counting.substack.com/p/go-collect-some-and-data">overtures by a few valiant data
professionals</a><sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup>.</p>
<p>With this blog post I want to give a defense of data collection — not as an
activity that’s inherently worthwhile pursuing (I assume data professionals
don’t need to be convinced of that!), but as something that is worth doing even
for <em>selfish</em> reasons. Why should you spend time learning about that data
collection system that’s being maintained by that other team at work? Why
should you consider collecting some data for your next side project? <em>What’s
in it for you?</em></p>
<p>Throughout this blog post, I’ll be making comparisons to a recent project of
mine, <a href="https://cryptics.georgeho.org/"><code>cryptics.georgeho.org</code></a>, a dataset of
cryptic crossword clues which I created and published last year.</p>
<h2 id="learn-data-adjacent-technologies">Learn Data-Adjacent Technologies</h2>
<p>The most obvious reason is that <strong>collecting data is a unique opportunity to
learn many staple technologies in data</strong> — and there aren’t many projects
that run the entire data tech stack.</p>
<p>To enumerate these technologies:</p>
<ol>
<li>Compute services
<ul>
<li>Your data collection pipelines will need to run somewhere. Will that be in
the cloud, or on your local computer? How do you think about trading off
cost, compute and convenience?</li>
<li>I ran most of my web scraping on DigitalOcean Droplets, but I could just
as easily have taken the opportunity to learn more about cloud compute
solutions or serverless functions like AWS EC2 or Lambda. These days, the
project runs incremental scrapes entirely on my laptop.</li>
</ul>
</li>
<li>Data storage
<ul>
<li>You’ll need to store your data somewhere, whether it be a relational or
NoSQL database, or just flat files. Since your data will outlive any code
you write, careful design of the data storage solution and schema will pay
dividends in the long run.</li>
<li>I used SQLite for its simplicity and performance. However, as the scope of
the project expanded, I had to redesign the schema multiple times, which
was painful.</li>
</ul>
</li>
<li>Labeling, annotation or other data transformations
<ul>
<li>After collecting your data, you may want to label, annotate, structure or
otherwise transform your data. For example, perhaps you’ll want to pull
structured tabular data out of unstructured PDFs or HTML tag soups;
another example might be to have humans label the data.</li>
<li>This is the main “value-add” of your dataset — while the time and effort
required to collect and store the data constitutes a moat, ultimately what
will distinguish your dataset to <em>users</em> will be the transformations done
here.</li>
<li>For me, this involved a lot of <code>BeautifulSoup</code> to parse structured data
out of HTML pages. This required a <a href="https://cryptics.georgeho.org/datasheet#collection-process">significant amount of development and
engineering
effort</a>.</li>
</ul>
</li>
<li>Data licensing and copyright
<ul>
<li>Once you have your dataset, can you license, share or even sell your data?
The legality of data are a huge grey area (especially if there’s web
scraping involved), and while navigating these waters will be tricky, it’s
instructive to learn about it.</li>
<li>I feel like the collection and structuring of cryptic crossword clues for
academic/archival purposes was fair use, and so didn’t worry too much
about the legality of my project — but it was an educational rabbit hole
to fall down!</li>
</ul>
</li>
<li>Sharing and publishing data
<ul>
<li>The legal nuances of data aside, the technical problem of sharing data is
pretty tricky!</li>
<li>This problem sits at the intersection of MLOps and information design: you
want to share the data in a standardized way, while having an interface
that making it easy for users to explore your data. Serving a tarball on a
web server technically works, but leaves so much on the table.</li>
<li><code>cryptics.georgeho.org</code> uses <a href="https://datasette.io/">Datasette</a>, which I
can’t recommend highly enough.</li>
</ul>
</li>
<li>Writing documentation
<ul>
<li>If you think it’s hard to write and maintain good documentation for
software, imagine how difficult it must be to do the same for data, which
outlives software and is much harder to both create and version control.</li>
<li>I’ve found <a href="https://arxiv.org/abs/1803.09010">Gebru et al.’s <em>Datasheets for
Datasets</em></a> to be an excellent template
for documenting data.</li>
</ul>
</li>
</ol>
<h2 id="design-a-data-collection-system">Design a Data Collection System</h2>
<p>Hopefully by now you can appreciate that every part of the data collection
pipeline involves not just technical proficiency with some system or framework,
but also an element sound architecture.</p>
<p><strong>Collecting data is a great way to get experience designing an entire data
pipeline from end to end, from creation to delivery.</strong> This kind of opportunity
doesn’t come easily (even in industry!), and while your data pipeline won’t be
as sophisticated as the kinds you’ll find at data companies, you’ll still be
able to take away some valuable lessons from it.</p>
<p>For <code>cryptics.georgeho.org</code>, I found that the most valuable pattern for storing
data was to dump raw and unstructured data into a database (a “data lake”), and
then extract useful and structured data into a separate database (a “data
warehouse”). I also learnt that the historical backfilling ETL job required a
lot of time and compute, but subsequent incremental ETL jobs could just run off
of my laptop. These best practice patterns around data collection and
management are all applicable far beyond my simple side project, and were
valuable lessons to learn first-hand.</p>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>Puzzlingly, this trend doesn’t seem to be true of other forms of
unglamorous data work like data cleaning, where people generally accept that
<a href="https://counting.substack.com/p/data-cleaning-is-analysis-not-grunt">data cleaning is not grunt
work</a>. <a href="#fnref:1" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
</ol>
</div>Streaming Data with Tornado and WebSocketshttps://www.georgeho.org/tornado-websockets/2021-10-05T00:00:00Z2021-10-05T00:00:00Z<p>A lot of data science and machine learning practice assumes a static dataset,
maybe with some MLOps tooling for rerunning a model pipeline with the freshest
version of the dataset.</p>
<p>Working with streaming data is an entirely different ball game, and it wasn’t
clear to me what tools a data scientist might reach for when dealing with
streaming data<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup>.</p>
<p>I recently came across a pretty straightforward and robust solution:
<a href="https://datatracker.ietf.org/doc/html/rfc6455">WebSockets</a> and
<a href="https://www.tornadoweb.org/en/stable/">Tornado</a>. Tornado is a Python web
framework with strong support for asynchronous networking. WebSockets are a
way for two processes (or apps) to communicate with each other (similar to HTTP
requests with REST endpoints). Of course, Tornado has pretty good support for
WebSockets as well.</p>
<p>In this blog post I’ll give a minimal example of using Tornado and WebSockets
to handle streaming data. The toy example I have is one app (<code>server.py</code>)
writing samples of a Bernoulli to a WebSocket, and another app (<code>client.py</code>)
listening to the WebSocket and keeping track of the posterior distribution for
a <a href="https://www.georgeho.org/bayesian-bandits/">Beta-Binomial conjugate model</a>.
After walking through the code, I’ll discuss these tools, and why they’re good
choices for working with streaming data.</p>
<p>For another tutorial on this same topic, you can check out <a href="https://en.proft.me/2014/05/16/realtime-web-application-tornado-and-websocket/"><code>proft</code>’s blog
post</a>.</p>
<h2 id="server">Server</h2>
<ul>
<li>When <code>WebSocketServer</code> is registered to a REST endpoint (in <code>main</code>), it keeps
track of any processes who are listening to that endpoint, and pushes
messages to them when <code>send_message</code> is called.
<ul>
<li>Note that <code>clients</code> is a class variable, so <code>send_message</code> is a class
method.</li>
<li>This class could be extended to also listen to the endpoint, instead of
just blindly pushing messages out — after all, WebSockets allow for
bidirectional data flow.</li>
</ul>
</li>
<li>The <code>RandomBernoulli</code> and <code>PeriodicCallback</code> make a pretty crude example, but
you could write a class that transmits data in real-time to suit your use
case. For example, you could watch a file for any modifications using
<a href="https://pythonhosted.org/watchdog/"><code>watchdog</code></a>, and dump the changes into
the WebSocket.</li>
<li>The <a href="https://www.tornadoweb.org/en/stable/web.html?highlight=websocket_ping#tornado.web.Application.settings"><code>websocket_ping_interval</code> and <code>websocket_ping_timeout</code> arguments to
<code>tornado.Application</code></a>
configure periodic pings of WebSocket connections, keeping connections alive
and allowing dropped connections to be detected and closed.</li>
<li>It’s also worth noting that there’s a
<a href="https://www.tornadoweb.org/en/stable/websocket.html?highlight=websocket_max_message_size#tornado.websocket.WebSocketHandler"><code>tornado.websocket.WebSocketHandler.websocket_max_message_size</code></a>
attribute. While this is set to a generous 10 MiB, it’s important that the
WebSocket messages don’t exceed this limit!</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#e6db74">""" Every 100ms, sample from a Bernoulli and write the value to a WebSocket. """</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> random
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> tornado.ioloop
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> tornado.web
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> tornado.websocket
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">class</span> <span style="color:#a6e22e">WebSocketServer</span>(tornado<span style="color:#f92672">.</span>websocket<span style="color:#f92672">.</span>WebSocketHandler):
</span></span><span style="display:flex;"><span> <span style="color:#e6db74">"""Simple WebSocket handler to serve clients."""</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># Note that `clients` is a class variable and `send_message` is a</span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># classmethod.</span>
</span></span><span style="display:flex;"><span> clients <span style="color:#f92672">=</span> set()
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">open</span>(self):
</span></span><span style="display:flex;"><span> WebSocketServer<span style="color:#f92672">.</span>clients<span style="color:#f92672">.</span>add(self)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">on_close</span>(self):
</span></span><span style="display:flex;"><span> WebSocketServer<span style="color:#f92672">.</span>clients<span style="color:#f92672">.</span>remove(self)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#a6e22e">@classmethod</span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">send_message</span>(cls, message: str):
</span></span><span style="display:flex;"><span> print(<span style="color:#e6db74">f</span><span style="color:#e6db74">"Sending message </span><span style="color:#e6db74">{</span>message<span style="color:#e6db74">}</span><span style="color:#e6db74"> to </span><span style="color:#e6db74">{</span>len(cls<span style="color:#f92672">.</span>clients)<span style="color:#e6db74">}</span><span style="color:#e6db74"> client(s)."</span>)
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">for</span> client <span style="color:#f92672">in</span> cls<span style="color:#f92672">.</span>clients:
</span></span><span style="display:flex;"><span> client<span style="color:#f92672">.</span>write_message(message)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">class</span> <span style="color:#a6e22e">RandomBernoulli</span>:
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">def</span> __init__(self):
</span></span><span style="display:flex;"><span> self<span style="color:#f92672">.</span>p <span style="color:#f92672">=</span> <span style="color:#ae81ff">0.72</span>
</span></span><span style="display:flex;"><span> print(<span style="color:#e6db74">f</span><span style="color:#e6db74">"True p = </span><span style="color:#e6db74">{</span>self<span style="color:#f92672">.</span>p<span style="color:#e6db74">}</span><span style="color:#e6db74">"</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">sample</span>(self):
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">return</span> int(random<span style="color:#f92672">.</span>uniform(<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">1</span>) <span style="color:#f92672"><=</span> self<span style="color:#f92672">.</span>p)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">main</span>():
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># Create a web app whose only endpoint is a WebSocket, and start the web</span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># app on port 8888.</span>
</span></span><span style="display:flex;"><span> app <span style="color:#f92672">=</span> tornado<span style="color:#f92672">.</span>web<span style="color:#f92672">.</span>Application(
</span></span><span style="display:flex;"><span> [(<span style="color:#e6db74">r</span><span style="color:#e6db74">"/websocket/"</span>, WebSocketServer)],
</span></span><span style="display:flex;"><span> websocket_ping_interval<span style="color:#f92672">=</span><span style="color:#ae81ff">10</span>,
</span></span><span style="display:flex;"><span> websocket_ping_timeout<span style="color:#f92672">=</span><span style="color:#ae81ff">30</span>,
</span></span><span style="display:flex;"><span> )
</span></span><span style="display:flex;"><span> app<span style="color:#f92672">.</span>listen(<span style="color:#ae81ff">8888</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># Create an event loop (what Tornado calls an IOLoop).</span>
</span></span><span style="display:flex;"><span> io_loop <span style="color:#f92672">=</span> tornado<span style="color:#f92672">.</span>ioloop<span style="color:#f92672">.</span>IOLoop<span style="color:#f92672">.</span>current()
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># Before starting the event loop, instantiate a RandomBernoulli and</span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># register a periodic callback to write a sampled value to the WebSocket</span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># every 100ms.</span>
</span></span><span style="display:flex;"><span> random_bernoulli <span style="color:#f92672">=</span> RandomBernoulli()
</span></span><span style="display:flex;"><span> periodic_callback <span style="color:#f92672">=</span> tornado<span style="color:#f92672">.</span>ioloop<span style="color:#f92672">.</span>PeriodicCallback(
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">lambda</span>: WebSocketServer<span style="color:#f92672">.</span>send_message(str(random_bernoulli<span style="color:#f92672">.</span>sample())), <span style="color:#ae81ff">100</span>
</span></span><span style="display:flex;"><span> )
</span></span><span style="display:flex;"><span> periodic_callback<span style="color:#f92672">.</span>start()
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># Start the event loop.</span>
</span></span><span style="display:flex;"><span> io_loop<span style="color:#f92672">.</span>start()
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">if</span> __name__ <span style="color:#f92672">==</span> <span style="color:#e6db74">"__main__"</span>:
</span></span><span style="display:flex;"><span> main()
</span></span></code></pre></div><h2 id="client">Client</h2>
<ul>
<li><code>WebSocketClient</code> is a class that:
<ol>
<li>Can be <code>start</code>ed and <code>stop</code>ped to connect/disconnect to the WebSocket and
start/stop listening to it in a separate thread</li>
<li>Can process every message (<code>on_message</code>) it hears from the WebSocket: in
this case it simply maintains <a href="https://www.georgeho.org/bayesian-bandits/#stochastic-aka-stationary-bandits">a count of the number of trials and
successes</a>,
but this processing could theoretically be anything. For example, you
could do some further processing of the message and then dump that into a
separate WebSocket for other apps (or even users!) to subscribe to.</li>
</ol>
</li>
<li>To connect to the WebSocket, we need to use a WebSocket library: thankfully
Tornado has a built-in WebSocket functionality (<code>tornado.websocket</code>), but
we’re also free to use other libraries such as the creatively named
<a href="https://github.com/aaugustin/websockets"><code>websockets</code></a> or
<a href="https://github.com/websocket-client/websocket-client"><code>websocket-client</code></a>.</li>
<li>Note that we run <code>on_message</code> on the same thread as we run
<code>connect_and_read</code>. This isn’t a problem so long as <code>on_message</code> is fast
enough, but a potentially wiser choice would be to offload <code>connect_and_read</code>
to a separate thread by instantiating a
<a href="https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.ThreadPoolExecutor"><code>concurrent.futures.ThreadPoolExecutor</code></a>
and calling
<a href="https://www.tornadoweb.org/en/stable/ioloop.html#tornado.ioloop.IOLoop.run_in_executor"><code>tornado.ioloop.IOLoop.run_in_executor</code></a>,
so as not to block the thread where the <code>on_message</code> processing happens.</li>
<li>The <code>io_loop</code> instantiated in <code>main</code> (as well as in <code>server.py</code>) is
important: it’s how Tornado schedules tasks (a.k.a. <em>callbacks</em>) for delayed
(a.k.a. <em>asynchronous</em>) execution. To add a callback, we simply call
<code>io_loop.add_callback()</code>.</li>
<li>The <a href="https://www.tornadoweb.org/en/stable/websocket.html?highlight=ping_#tornado.websocket.websocket_connect"><code>ping_interval</code> and <code>ping_timeout</code> arguments to
<code>websocket_connect</code></a>
configure periodic pings of the WebSocket connection, keeping connections
alive and allowing dropped connections to be detected and closed.</li>
<li>The <code>callback=self.maybe_retry_connection</code> is <a href="https://github.com/tornadoweb/tornado/blob/1db5b45918da8303d2c6958ee03dbbd5dc2709e9/tornado/websocket.py#L1654-L1655">run on a future
<code>WebSocketClientConnection</code></a>.
<code>websocket_connect</code> doesn’t actually establish the connection directly, but
rather returns a future. Hence, we try to get the <code>future.result()</code> itself
(i.e. the WebSocket client connection) — I don’t actually do anything with
the <code>self.connection</code>, but you could if you wanted. In the event of an
exception while doing that, we assume there’s a problem with the WebSocket
connection and retry <code>connect_and_read</code> after 3 seconds. This all has the
effect of recovering gracefully if the WebSocket is dropped or <code>server.py</code>
experiences a brief outage for whatever reason (both of which are probably
inevitable for long-running apps using WebSockets).</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#e6db74">""" Stream data from the WebSocket and update the Beta posterior parameters online. """</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> tornado.ioloop
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> tornado.websocket
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">class</span> <span style="color:#a6e22e">WebSocketClient</span>:
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">def</span> __init__(self, io_loop):
</span></span><span style="display:flex;"><span> self<span style="color:#f92672">.</span>connection <span style="color:#f92672">=</span> <span style="color:#66d9ef">None</span>
</span></span><span style="display:flex;"><span> self<span style="color:#f92672">.</span>io_loop <span style="color:#f92672">=</span> io_loop
</span></span><span style="display:flex;"><span> self<span style="color:#f92672">.</span>num_successes <span style="color:#f92672">=</span> <span style="color:#ae81ff">0</span>
</span></span><span style="display:flex;"><span> self<span style="color:#f92672">.</span>num_trials <span style="color:#f92672">=</span> <span style="color:#ae81ff">0</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">start</span>(self):
</span></span><span style="display:flex;"><span> self<span style="color:#f92672">.</span>connect_and_read()
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">stop</span>(self):
</span></span><span style="display:flex;"><span> self<span style="color:#f92672">.</span>io_loop<span style="color:#f92672">.</span>stop()
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">connect_and_read</span>(self):
</span></span><span style="display:flex;"><span> print(<span style="color:#e6db74">"Reading..."</span>)
</span></span><span style="display:flex;"><span> tornado<span style="color:#f92672">.</span>websocket<span style="color:#f92672">.</span>websocket_connect(
</span></span><span style="display:flex;"><span> url<span style="color:#f92672">=</span><span style="color:#e6db74">f</span><span style="color:#e6db74">"ws://localhost:8888/websocket/"</span>,
</span></span><span style="display:flex;"><span> callback<span style="color:#f92672">=</span>self<span style="color:#f92672">.</span>maybe_retry_connection,
</span></span><span style="display:flex;"><span> on_message_callback<span style="color:#f92672">=</span>self<span style="color:#f92672">.</span>on_message,
</span></span><span style="display:flex;"><span> ping_interval<span style="color:#f92672">=</span><span style="color:#ae81ff">10</span>,
</span></span><span style="display:flex;"><span> ping_timeout<span style="color:#f92672">=</span><span style="color:#ae81ff">30</span>,
</span></span><span style="display:flex;"><span> )
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">maybe_retry_connection</span>(self, future) <span style="color:#f92672">-></span> <span style="color:#66d9ef">None</span>:
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">try</span>:
</span></span><span style="display:flex;"><span> self<span style="color:#f92672">.</span>connection <span style="color:#f92672">=</span> future<span style="color:#f92672">.</span>result()
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">except</span>:
</span></span><span style="display:flex;"><span> print(<span style="color:#e6db74">"Could not reconnect, retrying in 3 seconds..."</span>)
</span></span><span style="display:flex;"><span> self<span style="color:#f92672">.</span>io_loop<span style="color:#f92672">.</span>call_later(<span style="color:#ae81ff">3</span>, self<span style="color:#f92672">.</span>connect_and_read)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">on_message</span>(self, message):
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">if</span> message <span style="color:#f92672">is</span> <span style="color:#66d9ef">None</span>:
</span></span><span style="display:flex;"><span> print(<span style="color:#e6db74">"Disconnected, reconnecting..."</span>)
</span></span><span style="display:flex;"><span> self<span style="color:#f92672">.</span>connect_and_read()
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> message <span style="color:#f92672">=</span> int(message)
</span></span><span style="display:flex;"><span> self<span style="color:#f92672">.</span>num_successes <span style="color:#f92672">+=</span> message
</span></span><span style="display:flex;"><span> self<span style="color:#f92672">.</span>num_trials <span style="color:#f92672">+=</span> <span style="color:#ae81ff">1</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> alpha <span style="color:#f92672">=</span> <span style="color:#ae81ff">2</span> <span style="color:#f92672">+</span> self<span style="color:#f92672">.</span>num_successes
</span></span><span style="display:flex;"><span> beta <span style="color:#f92672">=</span> <span style="color:#ae81ff">2</span> <span style="color:#f92672">+</span> self<span style="color:#f92672">.</span>num_trials <span style="color:#f92672">-</span> self<span style="color:#f92672">.</span>num_successes
</span></span><span style="display:flex;"><span> mean <span style="color:#f92672">=</span> self<span style="color:#f92672">.</span>num_successes <span style="color:#f92672">/</span> self<span style="color:#f92672">.</span>num_trials
</span></span><span style="display:flex;"><span> print(<span style="color:#e6db74">f</span><span style="color:#e6db74">"α = </span><span style="color:#e6db74">{</span>alpha<span style="color:#e6db74">}</span><span style="color:#e6db74">; β = </span><span style="color:#e6db74">{</span>beta<span style="color:#e6db74">}</span><span style="color:#e6db74">; mean = </span><span style="color:#e6db74">{</span>mean<span style="color:#e6db74">}</span><span style="color:#e6db74">"</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">main</span>():
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># Create an event loop (what Tornado calls an IOLoop).</span>
</span></span><span style="display:flex;"><span> io_loop <span style="color:#f92672">=</span> tornado<span style="color:#f92672">.</span>ioloop<span style="color:#f92672">.</span>IOLoop<span style="color:#f92672">.</span>current()
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># Before starting the event loop, instantiate a WebSocketClient and add a</span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># callback to the event loop to start it. This way the first thing the</span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># event loop does is to start the client.</span>
</span></span><span style="display:flex;"><span> client <span style="color:#f92672">=</span> WebSocketClient(io_loop)
</span></span><span style="display:flex;"><span> io_loop<span style="color:#f92672">.</span>add_callback(client<span style="color:#f92672">.</span>start)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># Start the event loop.</span>
</span></span><span style="display:flex;"><span> io_loop<span style="color:#f92672">.</span>start()
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">if</span> __name__ <span style="color:#f92672">==</span> <span style="color:#e6db74">"__main__"</span>:
</span></span><span style="display:flex;"><span> main()
</span></span></code></pre></div><h2 id="why-tornado">Why Tornado?</h2>
<p>Tornado is a Python web framework, but unlike the more popular Python web
frameworks like <a href="https://flask.palletsprojects.com/">Flask</a> or
<a href="https://www.djangoproject.com/">Django</a>, it has strong support for
<a href="https://www.tornadoweb.org/en/stable/guide/async.html#blocking">asynchronous networking and non-blocking
calls</a> —
essentially, Tornado apps have one (single-threaded) event loop
(<code>tornado.ioloop.IOLoop</code>), which handles all requests asynchronously,
dispatching incoming requests to the relevant non-blocking function as the
request comes in. As far as I know, Tornado is the only Python web framework
that does this.</p>
<p>As an aside, Tornado seems to be <a href="https://thehftguy.com/2020/10/27/my-experience-in-production-with-flask-bottle-tornado-and-twisted/">more popular in
finance</a>,
where streaming real-time data (e.g. market data) is very common.</p>
<h2 id="why-websockets">Why WebSockets?</h2>
<p>A sharper question might be, why WebSockets over HTTP requests to a REST
endpoint? After all, both theoretically allow a client to stream data in
real-time from a server.</p>
<p><a href="https://stackoverflow.com/a/45464306">A lot can be said</a> when comparing
WebSockets and RESTful services, but I think the main points are accurately
summarized by <a href="https://www.baeldung.com/rest-vs-websockets#usage">Kumar Chandrakant on
Baeldung</a>:</p>
<blockquote>
<p>[A] WebSocket is more suitable for cases where a push-based and real-time
communication defines the requirement more appropriately. Additionally,
WebSocket works well for scenarios where a message needs to be pushed to
multiple clients simultaneously. These are the cases where client and server
communication over RESTful services will find it difficult if not prohibitive.</p>
</blockquote>
<p>Tangentially, there’s one alternative that seems to be better than WebSockets
from a protocol standpoint, but unfortunately doesn’t seem to have support from
many Python web frameworks, and that is <a href="https://www.smashingmagazine.com/2018/02/sse-websockets-data-flow-http2/">Server-Sent Events (a.k.a.
SSE)</a>:
it seems to be a cleaner protocol for unidirectional data flow, which is really
all that we need.</p>
<p>Additionally, <a href="https://lucumr.pocoo.org/2012/9/24/websockets-101/">Armin
Ronacher</a> has a much
starker view of WebSockets, seeing no value in using WebSockets over TCP/IP
sockets for this application:</p>
<blockquote>
<p>Websockets make you sad. […] Websockets are complex, way more complex than I
anticipated. I can understand that they work that way but I definitely don’t
see a value in using websockets instead of regular TCP connections if all you
want is to exchange data between different endpoints and neither is a browser.</p>
</blockquote>
<p>My thought after reading these criticisms is that perhaps WebSockets aren’t the
ideal technology for handling streaming data (from a maintainability or
architectural point of view), but that doesn’t mean that they aren’t good
scalable technologies when they do work.</p>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>There is <a href="https://sqlstream.com/real-time-vs-streaming-a-short-explanation/">technically a difference</a> between “real-time” and “streaming”: “real-time” refers to data that comes in as it is created, whereas “streaming” refers to a system that processes data continuously. You stream your TV show from Netflix, but since the show was created long before you watched it, you aren’t viewing it in real-time. <a href="#fnref:1" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
</ol>
</div>Joining Flatiron Healthhttps://www.georgeho.org/joining-flatiron/2021-09-24T00:00:00Z2021-09-24T00:00:00Z<p>An exciting professional update: today is my last day at
<a href="https://www.point72.com/">Point72</a> and next month I’ll be joining <a href="https://flatiron.com/">Flatiron
Health</a> as a data scientist on their machine learning
team! I’ll be working out of their SoHo offices, and will continue to be based
in New York.</p>
<p><img src="https://www.georgeho.org/assets/images/flatiron-logo.png" alt="Flatiron Health logo"></p>
<p>Flatiron Health is a technology company in the healthcare space, trying to
accelerate oncology research and improve quality of cancer care through data
analytics.</p>
<p>The past two years have been terrific, but needless to say, I’m looking forward
to adventures ahead! ⛵</p>`cryptics.georgeho.org` — A Dataset of Cryptic Crossword Clueshttps://www.georgeho.org/cryptic-clues/2021-09-11T00:00:00Z2021-09-11T00:00:00Z<p><code>cryptics.georgeho.org</code> is a dataset of cryptic crossword clues<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup>, collected
from various blogs and publicly available digital archives. I originally
started this project to practice my web scraping and data engineering skills,
but as it’s evolved I hope it can be a resource to solvers and constructors of
cryptic crosswords.</p>
<p>The project scrapes several blogs and digital archives for cryptic crosswords.
Out of these collected web pages, the clues, answers, clue numbers, blogger’s
explanation and commentary, puzzle title and publication date are all parsed
and extracted into a tabular dataset. The result (as of September 2021) is <strong>a
little over half a million clues from cryptic crosswords over the past twelve
years</strong>, which makes for a rich and peculiar dataset.</p>
<p>Without further ado, please check out
<a href="https://cryptics.georgeho.org/"><code>cryptics.georgeho.org</code></a>!</p>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>If you’re new to cryptic crosswords, rejoice! A whole new world awaits you! The New Yorker has <a href="https://www.newyorker.com/puzzles-and-games-dept/cryptic-crossword/reintroducing-the-new-yorkers-cryptic-crossword">an excellent introduction to cryptic crosswords</a>, and Matt Gritzmacher has <a href="https://crosswordlinks.substack.com/">a daily newsletter with links to crosswords</a>. <a href="#fnref:1" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
</ol>
</div>How Many Cryptic Crossword Grids Are There?https://www.georgeho.org/counting-cryptics/2021-05-17T00:00:00Z2021-05-17T00:00:00Z<p>Counting the number of valid American-style crossword grids is more or less a
solved problem. For example, see this <a href="https://fivethirtyeight.com/features/how-many-crossword-puzzles-can-you-make/">FiveThirtyEight
Riddler</a>
and <a href="https://twitter.com/Log3overLog2/status/1092472516571000839">Michael Kleber’s answer in a Twitter
thread</a>.</p>
<p>However, the same doesn’t seem to be true for British-style cryptic crosswords.
Hence this blog post!</p>
<p>Now, <em>counting</em> the number of valid grids is a different task from
<em>enumerating</em> them, and it’s a bad idea to do the former by doing the latter,
because the sheer number of grids can be prohibitively expensive to compute.
However, I’m mostly interested in grids smaller than 11×11, and I actually
<em>wanted</em> to see all possible grids, so I went ahead and did the inadvisable.</p>
<p>So let’s jump right in! If you’re just interested in the numbers and a list of
all valid grids, feel free to scroll to the very end.</p>
<h2 id="what-makes-a-valid-cryptic-grid">What Makes A Valid Cryptic Grid?</h2>
<ol>
<li>The grid must be <strong>rotationally symmetric</strong>.</li>
<li>The grid length (i.e. the length of one side of the grid) must be an <strong>odd
number</strong>.</li>
<li>All white squares must be <strong>connected</strong>: that is, there can be only one
contiguous island of white squares.</li>
<li>All words must have <strong>half their letters checked</strong>.
<ul>
<li>For words of odd length, there’s some ambiguity: depending on who you talk
to, either “half rounded up” or “half rounded up or down” must be checked.</li>
<li>For the purposes of this blog post, I required “half rounded up”.</li>
</ul>
</li>
<li>There <strong>cannot be more than two consecutive unchecked squares</strong>.</li>
<li>Two consecutive unchecked squares <strong>cannot occur at the start or end of a
word</strong>.
<ul>
<li>I haven’t found much explicit mention of this rule other than <a href="https://www.crosswordunclued.com/2009/09/crossword-grid-checking.html">this blog
post</a>
saying that it’s a “house rule” at <em>The Times</em> of London, but all cryptics
I’ve seen have hewed to this requirement, so I enforced it.</li>
</ul>
</li>
</ol>
<p>It’s also worth noting that different constructors and publications have
different “house rules”. For example:</p>
<ul>
<li>Some publications have upper and/or lower limits on the number of clues. For
example, <a href="https://www.ft.com/crossword">the <em>Financial Times</em></a> seems to
always have exactly 32 clues.</li>
<li>Some constructors also restrict the size of black islands: for example, there
cannot be a contiguous black island of more than five squares.</li>
</ul>
<p>I didn’t enforce these rules, more out of laziness and lack of time than
technical infeasibility.</p>
<h2 id="generating-cryptic-grids">Generating Cryptic Grids</h2>
<p>Akshay Ravikumar has an <a href="https://akshayr.xyz/blog/articles/counting-crosswords">excellent blog
post</a> explaining how he
generated American crosswords, and if you’re interested in diving deeper I
highly recommend reading his exposition: my algorithm is more or less directly
lifted from his work, just adapted to cryptic crosswords.</p>
<p>Here’s the final algorithm that I used:</p>
<ol>
<li>Precompute all <code>valid_rows</code> and <code>symmetric_rows</code>: for a cryptic, these are
rows that don’t have words below the minimum word length.</li>
<li>Using <code>valid_rows</code> and <code>symmetric_rows</code>, find all sets of valid middle three
rows — for example, for an 11×11 grid, find all possible fifth, sixth and
seventh rows.
<ul>
<li>Note that the middle row must be symmetric, and the two adjacent rows must
be mirror images of each other.</li>
</ul>
</li>
<li>From the middle rows outward, build up a grid in a depth-first search.
<ul>
<li>Before adding a new row, make sure that it satisfies the checking
requirement: that it has the correct number of unchecked squares and has
at most two consecutive unchecked squares not at the start or end of
words.</li>
<li>There is also a trick we can use to limit the search space: if the
previous three rows have a column that is black-white-white, then in the
same column, the next row must be white.</li>
<li>This is best explained pictorially:
<img src="https://www.georgeho.org/assets/images/counting-cryptics-illustration.png" alt="An illustration of the black-black-white
trick"></li>
</ul>
</li>
<li>Check that the columns are valid. Specifically:
<ul>
<li>Check that the columns are <code>valid_rows</code> (this ensures that there are no
words below the minimum word length).</li>
<li>And also check that the columns don’t have two consecutive unchecked
squares at the start or end of the word.</li>
<li>Note that all other requirements (e.g. the number of checked squares) are
already taken care up while building up the grid.</li>
</ul>
</li>
<li>Check connectedness of the grid using a <a href="https://www.hackerearth.com/practice/algorithms/graphs/depth-first-search/tutorial/">depth-first
search</a>.</li>
</ol>
<p>This algorithm works well and runs reasonably quickly (i.e. in less than a
minute) for 5×5 and 7×7 grids, but at 9×9 the search time becomes significant
(around half an hour on a MacBook Pro). Additionally, some valid grids aren’t
very interesting as “real” crosswords, such as the one below.</p>
<center><pre><code>
⬛⬛⬛⬛⬜⬜⬜
⬛⬛⬛⬛⬜⬛⬜
⬛⬛⬛⬛⬜⬛⬜
⬜⬜⬜⬜⬜⬜⬜
⬜⬛⬜⬛⬛⬛⬛
⬜⬛⬜⬛⬛⬛⬛
⬜⬜⬜⬛⬛⬛⬛
</code></pre></center>
<p>It’s not very interesting because of the sheer number of black squares (and
correspondingly low number of clues). So to winnow down the grids more, I
filtered <code>valid_rows</code> before I start: <code>valid_rows</code> must have a minimum number
of white squares: 2 squares for 5×5 and 3 squares for 7×7 through 13×13.
Anecdotally, this reduces the computation time by a factor of three or four. I
call the grids produced in this reduced search <em>“interesting grids”,</em> as
opposed to <em>“valid grids”</em>.</p>
<p>I should note that there are definitely more ways to speed up the search: I
could’ve parallelize the search (i.e. assign each worker a subset of the valid
middle rows), I could’ve written the program in a language faster than Python
(like Julia), and further algorithmic speedups are possible (e.g. checking
columns after adding each row would prune more grids earlier, instead of
deferring the column checks to after the grid is constructed).</p>
<p>At any rate, I just ran the program on my laptop, and stopped at 9×9 grids.
Results below!</p>
<h2 id="results">Results</h2>
<p>If you’ve just scrolled down here, the only thing you need to note is that an
<em>“interesting grid”</em> is one in which every row has at least a certain number of
white squares: 2 for 5×5 grids and 3 for 7×7 grids onwards.</p>
<p>For comparison, I’ve added the number of valid American grids, taken from
<a href="https://twitter.com/Log3overLog2/status/1092795679947264000">Michael Kleber’s corrected
Tweet</a>.</p>
<table>
<thead>
<tr>
<th style="text-align:center">Grid Size</th>
<th style="text-align:right">Valid Grids</th>
<th style="text-align:right">Interesting Grids</th>
<th style="text-align:right">American Grids</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:center">5×5</td>
<td style="text-align:right">17</td>
<td style="text-align:right">9</td>
<td style="text-align:right">12</td>
</tr>
<tr>
<td style="text-align:center">7×7</td>
<td style="text-align:right">346</td>
<td style="text-align:right">43</td>
<td style="text-align:right">312</td>
</tr>
<tr>
<td style="text-align:center">9×9</td>
<td style="text-align:right">9,381</td>
<td style="text-align:right">334</td>
<td style="text-align:right">31,187</td>
</tr>
<tr>
<td style="text-align:center">11×11</td>
<td style="text-align:right">N/A</td>
<td style="text-align:right">N/A</td>
<td style="text-align:right">17,438,702</td>
</tr>
<tr>
<td style="text-align:center">13×13</td>
<td style="text-align:right">N/A</td>
<td style="text-align:right">N/A</td>
<td style="text-align:right">40,575,832,476</td>
</tr>
<tr>
<td style="text-align:center">15×15</td>
<td style="text-align:right">N/A</td>
<td style="text-align:right">N/A</td>
<td style="text-align:right">404,139,015,237,875</td>
</tr>
</tbody>
</table>
<p>There are 346 valid 7×7 cryptics — interestingly, slightly more than the
pleasing 6 × 52 = 312 valid American-style crosswords which inspired <a href="https://www.7xwords.com/why.html">Malaika
Handa’s 7xwords</a>; disappointingly,
factorizing into a not-at-all-auspicious 2 × 173.</p>
<p>For larger grid lengths, there appear to be far fewer valid cryptic grids than
American grids, probably owing to the more stringent conditions for cryptics.</p>
<p>It was infeasible for me to run my program for 11×11 grids onwards — either I
need to put a lot more effort in optimizing my program, or (more likely) it’s
simply computationally intractable to enumerate all possible grids, and we can
only count them. If I’m inspired to pick up this line of work again, I’ll be
sure to post a part two!</p>
<p>And finally, the code:</p>
<ul>
<li><a href="https://github.com/eigenfoo/counting-cryptics">Source code (Python Jupyter Notebook)</a></li>
<li>Valid grids
<ul>
<li><a href="https://raw.githubusercontent.com/eigenfoo/counting-cryptics/main/valid_5x5_grids.txt">5×5</a></li>
<li><a href="https://raw.githubusercontent.com/eigenfoo/counting-cryptics/main/valid_7x7_grids.txt">7×7</a></li>
<li><a href="https://raw.githubusercontent.com/eigenfoo/counting-cryptics/main/valid_9x9_grids.txt">9×9</a></li>
</ul>
</li>
<li>Interesting grids
<ul>
<li><a href="https://raw.githubusercontent.com/eigenfoo/counting-cryptics/main/interesting_5x5_grids.txt">5×5</a></li>
<li><a href="https://raw.githubusercontent.com/eigenfoo/counting-cryptics/main/interesting_7x7_grids.txt">7×7</a></li>
<li><a href="https://raw.githubusercontent.com/eigenfoo/counting-cryptics/main/interesting_9x9_grids.txt">9×9</a></li>
</ul>
</li>
</ul>Understanding NUTS and HMChttps://www.georgeho.org/understanding-nuts-hmc/2021-01-07T00:00:00Z2021-01-07T00:00:00Z<p><em>“Bayesian modeling is harder than deep learning”</em> is a sentiment I’ve been
hearing a lot lately. While I’m skeptical of sweeping statements like that, I
agree when it comes to the central inference algorithm — how MCMC samplers
work (especially the <em>de facto</em> standard samplers, NUTS and HMC) is one of the
most difficult concepts I’ve tried to learn, and is certainly harder than
autodifferentiation or backpropagation.</p>
<p>So I thought I’d share what worked for me when I tried to teach myself NUTS and
HMC. In chronological order of publication, these are the three resources that
I’d recommend reading to grok NUTS/HMC:</p>
<ol>
<li><a href="http://www.mcmchandbook.net/HandbookChapter5.pdf">Radford Neal’s chapter in the MCMC
handbook</a></li>
<li><a href="https://arxiv.org/abs/1111.4246">Matthew Hoffman’s <em>The No-U-Turn Sampler</em> (a.k.a. the original NUTS
paper)</a></li>
<li><a href="https://arxiv.org/abs/1701.02434">Michael Betancourt’s <em>Conceptual Introduction to Hamiltonian Monte
Carlo</em></a></li>
</ol>
<p>Not only did I find it useful to read these papers several times (as one would
read any sequence of “important” papers), but also to read them in both
chronological and reverse-chronological order. Reading both forwards and
backwards gave me multiple expositions of important ideas and also let me
mentally “diff” the papers to see the progression of ideas over time. For
example, Neal’s chapter was written before NUTS was discovered, which gives you
a sense of what the MCMC world looked like prior to Hoffman’s work: making
progress in fits and starts, but in need of a real leap forward.</p>
<p>In terms of reading code, I’d recommend looking through <a href="https://github.com/ColCarroll/minimc">Colin Carroll’s
<code>minimc</code></a> for a minimal working example
of NUTS in Python, written for pedagogy rather than actual sampling. For a
“real world” implementation of NUTS/HMC, I’d recommend looking through <a href="https://github.com/eigenfoo/littlemcmc">my
<code>littlemcmc</code></a> for a standalone version
of PyMC3’s NUTS/HMC samplers.</p>
<p>Finally, for anyone who wants to read around computational methods for Bayesian
inference more generally (i.e. not restricted to HMC, for example), I’d
(unashamedly) point to <a href="https://www.georgeho.org/bayesian-inference-reading/">my blog post on
this</a>.</p>What I Wish Someone Had Told Me About Tensor Computation Librarieshttps://www.georgeho.org/tensor-computation-libraries/2020-12-15T00:00:00Z2020-12-15T00:00:00Z<p>I get confused with tensor computation libraries (or computational graph libraries, or symbolic
algebra libraries, or whatever they’re marketing themselves as these days).</p>
<p>I was first introduced to PyTorch and TensorFlow and, having no other reference, thought they were
prototypical examples of tensor computation libraries. Then I learnt about Theano — an older and
less popular project, but different from PyTorch and TensorFlow and better in some meaningful ways.
This was followed by JAX, which seemed to be basically NumPy with more bells and whistles (although
I couldn’t articulate what exactly they were). Then came <a href="https://pymc-devs.medium.com/the-future-of-pymc3-or-theano-is-dead-long-live-theano-d8005f8a0e9b">the announcement by the PyMC developers
that Theano would have a new JAX
backend</a>.</p>
<p>Anyways, this confusion prompted a lot of research and eventually, this blog post.</p>
<p>Similar to <a href="https://www.georgeho.org/prob-prog-frameworks/">my previous post on the anatomy of probabilistic programming
frameworks</a>, I’ll first discuss tensor computation
libraries in general — what they are and how they can differ from one another. Then I’ll discuss
some libraries in detail, and finally offer an observation on the future of Theano in the context of
contemporary tensor computation libraries.</p>
<div>
<h2>Contents</h2>
<nav id="TableOfContents">
<ul>
<li><a href="#dissecting-tensor-computation-libraries">Dissecting Tensor Computation Libraries</a>
<ul>
<li><a href="#tensor-computation-library-----maybe-not-the-best-name">“Tensor Computation Library” — Maybe Not The Best Name</a></li>
<li><a href="#some-differences-between-tensor-computation-libraries">(Some) Differences Between Tensor Computation Libraries</a></li>
</ul>
</li>
<li><a href="#a-zoo-of-tensor-computation-libraries">A Zoo of Tensor Computation Libraries</a>
<ul>
<li><a href="#pytorchhttpspytorchorg"><a href="https://pytorch.org/">PyTorch</a></a></li>
<li><a href="#jaxhttpsjaxreadthedocsioenlatest"><a href="https://jax.readthedocs.io/en/latest/">JAX</a></a></li>
<li><a href="#theanohttpstheano-pymcreadthedocsioenlatest"><a href="https://theano-pymc.readthedocs.io/en/latest/">Theano</a></a></li>
</ul>
</li>
<li><a href="#an-observation-on-static-graphs-and-theano">An Observation on Static Graphs and Theano</a></li>
<li><a href="#some-follow-ups-a-week-later">Some Follow-Ups, A Week Later</a></li>
</ul>
</nav>
</div>
<h2 id="dissecting-tensor-computation-libraries">Dissecting Tensor Computation Libraries</h2>
<p>First, a characterization: what do tensor computation libraries even do?</p>
<ol>
<li>They provide ways of specifying and building computational graphs,</li>
<li>They run the computation itself (duh), but also run “related” computations that either (a) <em>use
the computational graph</em>, or (b) operate <em>directly on the computational graph itself</em>,
<ul>
<li>The most salient example of the former is computing gradients via
<a href="https://arxiv.org/abs/1502.05767">autodifferentiation</a>,</li>
<li>A good example of the latter is optimizing the computation itself: think symbolic
simplifications (e.g. <code>xy/x = y</code>) or modifications for numerical stability (e.g. <a href="https://cs.stackexchange.com/q/68411"><code>log(1 + x)</code>
for small values of <code>x</code></a>).</li>
</ul>
</li>
<li>And they provide “best execution” for the computation: whether it’s changing the execution by JIT
(just-in-time) compiling it, by utilizing special hardware (GPUs/TPUs), by vectorizing the
computation, or in any other way.</li>
</ol>
<h3 id="tensor-computation-library-----maybe-not-the-best-name">“Tensor Computation Library” — Maybe Not The Best Name</h3>
<p>As an aside: I realize that the name “tensor computation library” is too broad, and that the
characterization above precludes some libraries that might also justifiably be called “tensor
computation libraries”. Better names might be “graph computation library” (although that might get
mixed up with libraries like <a href="https://networkx.org/"><code>networkx</code></a>) or “computational graph management
library” or even “symbolic tensor algebra libraries”.</p>
<p>So for the avoidance of doubt, here is a list of libraries that this blog post is <em>not</em> about:</p>
<ul>
<li>NumPy and SciPy
<ul>
<li>These libraries don’t have a concept of a computational graph — they’re more like a toolbox of
functions, called from Python and executed in C or Fortran.</li>
<li>However, this might be a controversial distinction — as we’ll see later, JAX also doesn’t build
an explicit computational graph either, and I definitely want to include JAX as a “tensor
computation library”… ¯\_(ツ)_/¯</li>
</ul>
</li>
<li>Numba and Cython
<ul>
<li>These libraries provide best execution for code (and in fact some tensor computation libraries,
such as Theano, make good use them), but like NumPy and SciPy, they do not actually manage the
computational graph itself.</li>
</ul>
</li>
<li>Keras, Trax, Flax and PyTorch-Lightning
<ul>
<li>These libraries are high-level wrappers around tensor computation libraries — they basically
provide abstractions and a user-facing API to utilize tensor computation libraries in a
friendlier way.</li>
</ul>
</li>
</ul>
<h3 id="some-differences-between-tensor-computation-libraries">(Some) Differences Between Tensor Computation Libraries</h3>
<p>Anyways, back to tensor computation libraries.</p>
<p>All three aforementioned goals are ambitious undertakings with sophisticated solutions, so it
shouldn’t be surprising to learn that decisions in pursuit on goal can have implications for (or
even incur a trade-off with!) other goals. Here’s a list of common differences along all three axes:</p>
<ol>
<li>
<p>Tensor computation libraries can differ in how they represent the computational graph, and how it
is built.</p>
<ul>
<li>Static or dynamic graphs: do we first define the graph completely and then inject data to run
(a.k.a. define-and-run), or is the graph defined on-the-fly via the actual forward computation
(a.k.a. define-by-run)?
<ul>
<li>TensorFlow 1.x was (in)famous for its static graphs, which made users feel like they were
“working with their computational graph through a keyhole”, especially when <a href="https://news.ycombinator.com/item?id=13429355">compared to
PyTorch’s dynamic graphs</a>.</li>
</ul>
</li>
<li>Lazy or eager execution: do we evaluate variables as soon as they are defined, or only when a
dependent variable is evaluated? Usually, tensor computation libraries either choose to support
dynamic graphs with eager execution, or static graphs with lazy execution — for example,
<a href="https://www.tensorflow.org/guide/eager">TensorFlow 2.0 supports both modes</a>.</li>
<li>Interestingly, some tensor computation libraries (e.g. <a href="https://thinc.ai/">Thinc</a>) don’t even
construct an explicit computational graph: they represent it as <a href="https://thinc.ai/docs/concept">chained higher-order
functions</a>.</li>
</ul>
</li>
<li>
<p>Tensor computation libraries can also differ in what they want to use the computational graph
<em>for</em> — for example, are we aiming to do things that basically amount to running the
computational graph in a “different mode”, or are we aiming to modify the computational graph
itself?</p>
<ul>
<li>Almost all tensor computation libraries support autodifferentiation in some capacity (either
forward-mode, backward-mode, or both).</li>
<li>Obviously, how you represent the computational graph and what you want to use it for are very
related questions! For example, if you want to be able to represent aribtrary computation as a
graph, you’ll have to handle control flow like if-else statements or for-loops — this leads
to common gotchas with <a href="https://jax.readthedocs.io/en/latest/notebooks/Common_Gotchas_in_JAX.html#%F0%9F%94%AA-Control-Flow">using Python for-loops in
JAX</a>
or needing to use <a href="https://discuss.pytorch.org/t/can-you-have-for-loops-in-the-forward-prop/68295"><code>torch.nn.ModuleList</code> in for-loops with
PyTorch</a>.</li>
<li>Some tensor computation libraries (e.g. <a href="https://github.com/Theano/Theano">Theano</a> and its
fork, <a href="https://theano-pymc.readthedocs.io/en/latest/index.html">Theano-PyMC</a>) aim to <a href="https://theano-pymc.readthedocs.io/en/latest/extending/optimization.html">optimize
the computational graph
itself</a>, for which an
<a href="#an-observation-on-static-graphs-and-theano">explicit graph is necessary</a>.</li>
</ul>
</li>
<li>
<p>Finally, tensor computation libraries can also differ in how they execute code.</p>
<ul>
<li>All tensor computation libraries run on CPU, but the strength of GPU and TPU support is a major
differentiator among tensor computation libraries.</li>
<li>Another differentiator is how tensor computation libraries compile code to be executed on
hardware. For example, do they use JIT compilation or not? Do they use “vanilla” C or CUDA
compilers, or <a href="https://tensorflow.google.cn/xla">the XLA compiler for machine-learning specific
code</a>?</li>
</ul>
</li>
</ol>
<h2 id="a-zoo-of-tensor-computation-libraries">A Zoo of Tensor Computation Libraries</h2>
<p>Having outlined the basic similarities and differences of tensor computation libraries, I think
it’ll be helpful to go through several of the popular libraries as examples. I’ve tried to link to
the relevant documentation where possible.<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup></p>
<h3 id="pytorchhttpspytorchorg"><a href="https://pytorch.org/">PyTorch</a></h3>
<ol>
<li>How is the computational graph represented and built?
<ul>
<li>PyTorch dynamically builds (and eagerly evaluates) an explicit computational graph. For more
detail on how this is done, check out <a href="https://pytorch.org/docs/stable/notes/autograd.html">the PyTorch docs on autograd
mechanics</a>.</li>
<li>For more on how PyTorch computational graphs, see <a href="https://jdhao.github.io/2017/11/12/pytorch-computation-graph/"><code>jdhao</code>’s introductory blog post on
computational graphs in
PyTorch</a>.</li>
</ul>
</li>
<li>What is the computational graph used for?
<ul>
<li>To quote the <a href="https://pytorch.org/docs/stable/index.html">PyTorch docs</a>, “PyTorch is an
optimized tensor library for deep learning using GPUs and CPUs” — as such, the main focus is
on <a href="https://pytorch.org/docs/stable/notes/autograd.html">autodifferentiation</a>.</li>
</ul>
</li>
<li>How does the library ensure “best execution” for computation?
<ul>
<li>PyTorch has <a href="https://pytorch.org/docs/stable/notes/cuda.html">native GPU support</a> via CUDA.</li>
<li>PyTorch also has support for TPU through projects like
<a href="https://github.com/pytorch/xla">PyTorch/XLA</a> and
<a href="https://www.pytorchlightning.ai/">PyTorch-Lightning</a>.</li>
</ul>
</li>
</ol>
<h3 id="jaxhttpsjaxreadthedocsioenlatest"><a href="https://jax.readthedocs.io/en/latest/">JAX</a></h3>
<ol>
<li>
<p>How is the computational graph represented and built?</p>
<ul>
<li>
<p>Instead of building an explicit computational graph to compute gradients, JAX simply supplies a
<code>grad()</code> that returns the gradient function of any supplied function. As such, there is
technically no concept of a computational graph — only pure (i.e. stateless and
side-effect-free) functions and their gradients.</p>
</li>
<li>
<p><a href="https://sjmielke.com/jax-purify.htm">Sabrina Mielke summarizes the situation very well</a>:</p>
<blockquote>
<p>PyTorch builds up a graph as you compute the forward pass, and one call to <code>backward()</code> on
some “result” node then augments each intermediate node in the graph with the gradient of the
result node with respect to that intermediate node. JAX on the other hand makes you express
your computation as a Python function, and by transforming it with <code>grad()</code> gives you a
gradient function that you can evaluate like your computation function — but instead of the
output it gives you the gradient of the output with respect to (by default) the first
parameter that your function took as input.</p>
</blockquote>
</li>
</ul>
</li>
<li>
<p>What is the computational graph used for?</p>
<ul>
<li>According to the <a href="https://jax.readthedocs.io/en/latest/notebooks/quickstart.html">JAX quickstart</a>,
JAX bills itself as “NumPy on the CPU, GPU, and TPU, with great automatic differentiation for
high-performance machine learning research”. Hence, its focus is heavily on
autodifferentiation.</li>
</ul>
</li>
<li>
<p>How does the library ensure “best execution” for computation?</p>
<ul>
<li>
<p>This is best explained by quoting the <a href="https://jax.readthedocs.io/en/latest/notebooks/quickstart.html">JAX quickstart</a>:</p>
<blockquote>
<p>JAX uses XLA to compile and run your NumPy code on […] GPUs and TPUs. Compilation happens
under the hood by default, with library calls getting just-in-time compiled and executed. But
JAX even lets you just-in-time compile your own Python functions into XLA-optimized kernels
[…] Compilation and automatic differentiation can be composed arbitrarily […]</p>
</blockquote>
</li>
<li>
<p>For more detail on JAX’s four-function API (<code>grad</code>, <code>jit</code>, <code>vmap</code> and <code>pmap</code>), see
<a href="http://alexminnaar.com/2020/08/15/jax-overview.html">Alex Minaar’s overview of how JAX works</a>.</p>
</li>
</ul>
</li>
</ol>
<h3 id="theanohttpstheano-pymcreadthedocsioenlatest"><a href="https://theano-pymc.readthedocs.io/en/latest/">Theano</a></h3>
<blockquote>
<p><strong>Note:</strong> the <a href="https://github.com/Theano/Theano">original Theano</a> (maintained by
<a href="https://mila.quebec/en/">MILA</a>) has been discontinued, and the PyMC developers have forked the
project: <a href="https://github.com/pymc-devs/Theano-PyMC">Theano-PyMC</a> (soon to be renamed Aesara). I’ll
discuss both the original and forked projects below.</p>
</blockquote>
<ol>
<li>How is the computational graph represented and built?
<ul>
<li>Theano statically builds (and lazily evaluates) an explicit computational graph.</li>
</ul>
</li>
<li>What is the computational graph used for?
<ul>
<li>Theano is unique among tensor computation libraries in that it places more emphasis on
reasoning about the computational graph itself. In other words, while Theano has <a href="https://theano-pymc.readthedocs.io/en/latest/library/gradient.html">strong
support for
autodifferentiation</a>,
running the computation and computing gradients isn’t the be-all and end-all: Theano has an
entire module for <a href="https://theano-pymc.readthedocs.io/en/latest/optimizations.html">optimizing the computational graph
itself</a>, and makes it fairly
straightforward to compile the Theano graph to different computational backends (by default,
Theano compiles to C or CUDA, but it’s straightforward to compile to JAX).</li>
<li>Theano is often remembered as a library for deep learning research, but it’s so much more than
that!</li>
</ul>
</li>
<li>How does the library ensure “best execution” for computation?
<ul>
<li>The original Theano used the GCC C compiler for CPU computation, and the NVCC CUDA compiler for
GPU computation.</li>
<li>The Theano-PyMC fork project <a href="https://pymc-devs.medium.com/the-future-of-pymc3-or-theano-is-dead-long-live-theano-d8005f8a0e9b">will use JAX as a
backend</a>,
which can utilize CPUs, GPUs and TPUs as available.</li>
</ul>
</li>
</ol>
<h2 id="an-observation-on-static-graphs-and-theano">An Observation on Static Graphs and Theano</h2>
<p>Finally, a quick observation on static graphs and the niche that Theano fills that other tensor
computation libraries do not. I had huge help from <a href="https://twiecki.io/">Thomas Wiecki</a> and
<a href="https://brandonwillard.github.io/">Brandon Willard</a> with this section.</p>
<p>There’s been a consistent movement in most tensor computation libraries away from static graphs (or
more precisely, statically <em>built</em> graphs): PyTorch and TensorFlow 2 both support dynamically
generated graphs by default, and JAX forgoes an explicit computational graph entirely.</p>
<p>This movement is understandable — building the computational graph dynamically matches people’s
programming intuition much better. When I write <code>z = x + y</code>, I don’t mean <em>“I want to register a sum
operation with two inputs, which is waiting for data to be injected”</em> — I mean <em>“I want to compute
the sum of <code>x</code> and <code>y</code>”.</em> The extra layer of indirection is not helpful to most users, who just want
to run their tensor computation at some reasonable speed.</p>
<p>So let me speak in defence of statically built graphs.</p>
<p>Having an explicit representation of the computational graph is immensely useful for certain things,
even if it makes the graph harder to work with. You can modify the graph (e.g. graph optimizations,
simplifications and rewriting), and you can reason about and analyze the graph. Having the
computation as an actual <em>object</em> helps immeasurably for tasks where you need to think about the
computation itself, instead of just blindly running it.</p>
<p>On the other hand, with dynamically generated graphs, the computational graph is never actually
defined anywhere: the computation is traced out on the fly and behind the scene. You can no longer
do anything interesting with the computational graph: for example, if the computation is slow, you
can’t reason about <em>what</em> parts of the graph are slow. The end result is that you basically have to
hope that the framework internals are doing the right things, which they might not!</p>
<p>This is the niche that Theano (or rather, Theano-PyMC/Aesara) fills that other contemporary tensor
computation libraries do not: the promise is that if you take the time to specify your computation
up front and all at once, Theano can optimize the living daylight out of your computation — whether
by graph manipulation, efficient compilation or something else entirely — and that this is something
you would only need to do once.</p>
<hr>
<h2 id="some-follow-ups-a-week-later">Some Follow-Ups, A Week Later</h2>
<p><em>2020-12-22</em></p>
<p>The blog post trended <a href="https://news.ycombinator.com/item?id=25435028">on Hacker
News</a> and got some discussion.
It’s stupefying how the most upvoted comments are either unrelated or
self-promotional, but I suppose that’s to be expected with the Internet.</p>
<p>However, one nugget of gold in the junk pit is <a href="https://news.ycombinator.com/item?id=25436656">this comment by Albert
Zeyer</a> and the <a href="https://news.ycombinator.com/item?id=25439483">response by the
PyMC developer spearheading the Aesara project, Brandon
Willard</a>. I had two takeaways
from this exchange:</p>
<ol>
<li>Theano is messy, either in a code hygiene sense, or in an API design sense.
<ul>
<li>For example, the graph optimization/rewriting process can require entire
graphs to be copied at multiple points along the way. This obliterates
performance and was almost entirely due to some design oddities.</li>
</ul>
</li>
<li>The JAX backend arose as a proof-of-concept of how extensible Theano is,
both in terms of “hackability” and how much mileage we can get out of the
design choices behind Theano (e.g. static graphs). The JAX backend isn’t the
focus of the fork, but it’s easily the difference that will stand out most
at the user level. The focus of the Aesara is <em>resolving the design
shortcomings of Theano</em>.</li>
</ol>
<p>On the one hand, I’m glad that I finally understand the <em>real</em> focus of the
Aesara fork — I feel like I have a <em>much</em> greater appreciation of what Aesara
really is, and it’s place in the ecosystem of tensor computation libraries.</p>
<p>On the other hand, I’m discomfited by the implication that meaningful
contributions to Aesara must involve deep expertise on computational graphs and
graph optimizations - neither of which I have experience in (and I suspect are
rare even among the open source community). Moreover, meaningful contributions
to Aesara will probably require deep familiarity with Theano’s design and its
shortcomings. This isn’t to discourage me (or anyone else!) from contributing
to Aesara, but it’s good to acknowledge the bottomless pit of technical
expertise that goes on behind the user-facing Bayesian modelling.</p>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>Some readers will notice the conspicuous lack of TensorFlow from this list - its exclusion isn’t out of malice, merely a lack of time and effort to do the necessary research to do it justice. Sorry. <a href="#fnref:1" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
</ol>
</div>Serif Fonts for Codinghttps://www.georgeho.org/fonts-for-coding/2020-11-26T00:00:00Z2020-11-26T00:00:00Z<p>Sometimes I get sniped by <a href="https://news.ycombinator.com/item?id=25159038">Hacker News
posts</a>, and this one plunged me
down a rabbit hole for coding fonts.</p>
<p>Many coding fonts are lightly stressed, monospaced sans serifs: in other words, each glyph takes
the same width, and each glyph looks like a stick figure, with constant stroke
width (a.k.a. <a href="https://designshack.net/articles/typography/is-my-type-stressed-a-primer-on-stressed-typography/"><em>stress</em></a>) throughout the glyph.</p>
<p>But as <a href="https://news.ycombinator.com/item?id=25167704">the Internet stranger
<code>uncanneyvalley</code></a> pointed out,
there’s decent overlap between “fonts good for coding” and “fonts good for
dyslexia”: being able to easily distinguish between visually-similar and
repeated characters.</p>
<link rel="preconnect" href="https://fonts.googleapis.com">
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
<link href="https://fonts.googleapis.com/css2?family=Source+Code+Pro&display=swap" rel="stylesheet">
<style>
code.slab-serif {
font-family: "Source Code Pro", monospace;
}
</style>
<p><strong>Proportional (<a href="https://mbtype.com/fonts/equity/">Equity</a>):</strong></p>
<center>([{}]) l1i!|I OQo08 rumn ecoa pqdb -- __ == !! "' :; ,.</center>
<p><strong>Monospaced Sans Serif (<a href="https://fonts.google.com/specimen/Source+Code+Pro">Source Code Pro</a>):</strong></p>
<center><code class="slab-serif">([{}]) l1i!|I OQo08 rumn ecoa pqdb -- __ == !! "' :; ,.</code></center>
<p><strong>Monospaced Serif (<a href="https://mbtype.com/fonts/triplicate/">Triplicate</a>):</strong></p>
<center><code>([{}]) l1i!|I OQo08 rumn ecoa pqdb -- __ == !! "' :; ,.</code></center>
<p>I think that stressed monospaced serif fonts (i.e. monospaced fonts with serifs that are curved
instead of slab-like, and that are visually thinner than the rest of the glyph)
are generally much better for coding than most default coding typefaces. It turns out
there are very few such fonts: I’ve had to scour the Internet for them, but you
can have the fruits of my labor for free!</p>
<ul>
<li>Libertinus Mono (<a href="https://fontlibrary.org/en/font/libertinus-mono">Font
Library</a>,
<a href="https://github.com/alerque/libertinus">GitHub</a>)</li>
<li>Linux Libertine Mono
(<a href="https://en.wikipedia.org/wiki/Linux_Libertine">Wikipedia</a>, <a href="https://www.fontsquirrel.com/fonts/linux-libertine">Font
Squirrel</a>, <a href="https://fontlibrary.org/en/font/linux-libertine">Font
Library</a>)</li>
<li>SimSun
(<a href="https://docs.microsoft.com/en-us/typography/font-list/simsun">Microsoft</a>,
<a href="https://www.dafontfree.io/simsun-font/">Dafont Free</a>)</li>
<li>Sun Gallant Demi
<ul>
<li>I can’t find any sources for it beyond <a href="https://unix.stackexchange.com/q/307356">this Unix StackExchange
post</a>. Maybe if you have a Sun
computer? <code>¯\_(ツ)_/¯</code></li>
</ul>
</li>
<li>Triplicate (<a href="https://mbtype.com/fonts/triplicate/">MB Type</a>)</li>
<li>Xanh Mono (<a href="https://fonts.google.com/specimen/Xanh+Mono">Google Fonts</a>,
<a href="https://github.com/yellow-type-foundry/xanhmono">GitHub</a>)</li>
</ul>`littlemcmc` — A Standalone HMC and NUTS Sampler in Pythonhttps://www.georgeho.org/littlemcmc/2020-10-06T00:00:00Z2020-10-06T00:00:00Z<center>
<img
src="https://raw.githubusercontent.com/eigenfoo/littlemcmc/master/docs/_static/logo/default-cropped.png"
alt="LittleMCMC logo">
</center>
<p>Recently there has been a modularization (or, if you’re hip with tech-lingo, an
<a href="https://techcrunch.com/2015/04/18/the-unbundling-of-everything/"><em>unbundling</em></a>)
of Bayesian modelling libraries. Whereas before, probability distributions,
model specification, inference and diagnostics were more or less rolled into one
library, it’s becoming more and more realistic to specify a model in one
library, accelerate it using another, perform inference with a third and use a
fourth to visualize the results. (For example, Junpeng Lao has recently had
<a href="https://twitter.com/junpenglao/status/1309470970223226882">good success</a> doing
exactly this!)</p>
<p>It’s in this spirit of unbundling that the PyMC developers wanted to <a href="https://discourse.pymc.io/t/isolate-nuts-into-a-new-library/3974">spin out
the core HMC and NUTS samplers from PyMC3 into a separate
library</a>.
PyMC3 has a very well-tested and performant Python implementation of HMC and
NUTS, which would be very useful to any users who have their own functions for
computing log-probability and its gradients, and who want to use a lightweight
and reliable sampler.</p>
<p>So for example, if you’re a physical scientist with a Bayesian model who’s
written your own functions to compute the log probability and its gradients
(perhaps for performance or interoperability reasons), and need a good MCMC
sampler, then <code>littlemcmc</code> is for you! As long as you can call your functions
from Python, you can use the same HMC or NUTS sampler that’s used by the rest of
the PyMC3 community.</p>
<p>So without further ado: please check out <code>littlemcmc</code>!</p>
<ul>
<li><a href="https://github.com/eigenfoo/littlemcmc">GitHub</a></li>
<li><a href="https://littlemcmc.readthedocs.io/en/latest/">Read the Docs</a></li>
</ul>Pollen and Digital Publishing (a.k.a. _The Book is a Program_)https://www.georgeho.org/pollen-digital-publishing/2020-09-10T00:00:00Z2020-09-10T00:00:00Z<p>I’ve picked up a new hobby (or perhaps just another fleeting fascination) —
digital publishing. The catalyst was the book <a href="https://practicaltypography.com/"><em>Practical Typography</em> by
Matthew Butterick</a>. There were so many
interesting things about it: it is gorgeous, it expounds well-argued (if
slightly controversial) views on how the average writer should think about
typography, it has a little widget that would change the book’s typeface to
showcase Buttericks’ fonts for sale, it is published online but — and
Butterick makes a big point of this — is not free.</p>
<p>Most interesting to me, however, was how the book was written and published
with a tool written specifically for the book —
<a href="https://docs.racket-lang.org/pollen/">Pollen</a>. A good explanatory analogy (at
least for those in the data science and engineering world) is that it’s like R
Markdown (in that it’s a markup language that allows arbitrary R code to be
embedded in it), but instead of R, it’s Racket, and instead of Markdown, it’s
your own domain-specific markup language that you build with Racket.</p>
<p>After playing around with Pollen for a bit, I think I’m sold. Two big reasons:</p>
<ol>
<li>Write your own markup
<ul>
<li>You can write your own “HTML tags” — so for example, if you’re writing a
technical document and want to emphasize certain jargon upon first
mention, you can write a <code>firstmention</code> tag, and have it italicize the
tagged text and append it to a glossary with a link to its first mention
in your document. The cool thing is that tags are just functions in
Racket, which allow you to transform the input text arbitrarily.</li>
<li>As you can imagine, the ability to write your own markup really lets you
tailor it to the content at hand.</li>
</ul>
</li>
<li>Multi-format publishing
<ul>
<li>This lets you write in one input format, and output to multiple formats -
so once I make changes to the source files, I can immediately have an
HTML, LaTeX, PDF, and plain text format of my writing.</li>
</ul>
</li>
</ol>
<p><em>But what about Markdown or LaTeX or ReStructured Text or —</em> none of them
give you flexibility or extensibility that Pollen does. In the case of Markdown
or ReStructured Text, you just get a subset of HTML features in a way that
looks more palatable to the average developer. If this suffices for your
publishing needs, that’s great - but if it doesn’t, you’re left in a tough
place. LaTeX - as Butterick readily admits - did a lot of things right, but at
the end of the day it’s just another format that Pollen can target. (I think
Pollen was named in the spirit of LaTeX by the way - in the sense that people
are commonly allergic to both of them.)</p>
<p>Now here’s the “downside” - Pollen is written in
<a href="https://racket-lang.org">Racket</a> (which is a dialect of Lisp), and any
non-trivial applications will probably involve you learning a bit of Racket.
I’d say that that’s a good thing, if nothing else than for some self-education.</p>
<p>Here’s a very simple example to convince you (if you want a longer form answer,
I’d recommend Butterick’s <a href="https://beautifulracket.com/appendix/why-racket-why-lisp.html"><em>Why Racket? Why
Lisp?</em></a>)</p>
<p>Most languages represent HTML as a string (which conceals the semantics of HTML
tags), or as a tree (which conceals the sequential nature of the HTML). Neither
option is great. Lisps, however, could represent a snippet of HTML as follows:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">'</span><span style="color:#f92672">(</span>span <span style="color:#f92672">((</span>class <span style="color:#e6db74">"author"</span><span style="color:#f92672">)(</span>id <span style="color:#e6db74">"primary"</span><span style="color:#f92672">)(</span>living <span style="color:#e6db74">"true"</span><span style="color:#f92672">))</span> <span style="color:#e6db74">"Prof. Leonard"</span><span style="color:#f92672">)</span>
</span></span></code></pre></div><p>Keeping in mind that <code>(f x y)</code> is Lisp’s way of saying <code>f(x, y)</code> and we see
that Lisps cleanly model HTML as <em>nested function application</em>, which really
blows open the door to opportunities in marking up your text.</p>
<p>At any rate, that’s probably enough said about Pollen. Let me show you what I
managed to put together with it in one or two spare weekends —
<a href="https://cooper.georgeho.org/"><code>cooper.georgeho.org</code></a>. I was hunting around for
dummy text that I could use to play around with — Lorem Ipsum seemed trite,
and the U.S. Constitution seemed overdone, so I reached for some historical
documents of my alma mater. Hope you like it!</p>Floating-Point Formats and Deep Learninghttps://www.georgeho.org/floating-point-deep-learning/2020-07-26T00:00:00Z2020-07-26T00:00:00Z<p>Floating-point formats are not the most glamorous or (frankly) the important
consideration when working with deep learning models: if your model isn’t working well,
then your floating-point format certainly isn’t going to save you! However, past a
certain point of model complexity/model size/training time, your choice of
floating-point format can have a significant impact on your model training times and
even performance.</p>
<p>Here’s how the rest of this post is structured:</p>
<ol>
<li><a href="#floating-point-in-_my_-deep-learning">Why should you, a deep learning practitioner,
care</a> about what floating-point format your
model uses?</li>
<li><a href="#floating-point-formats">What even <em>is</em> floating-point</a>, especially these new
floating-point formats made specifically for deep learning?</li>
<li><a href="#advice-for-practitioners">What practical advice is there</a> on using floating-point
formats for deep learning?</li>
</ol>
<h2 id="floating-point-in-_my_-deep-learning">Floating-Point? In <em>My</em> Deep Learning?</h2>
<p><a href="https://knowyourmeme.com/photos/6052-its-more-likely-than-you-think">It’s more likely than you
think!</a></p>
<p>It’s been known for quite some time that <a href="https://arxiv.org/abs/1502.02551">deep neural networks can
tolerate</a> <a href="https://arxiv.org/abs/1412.7024">lower numerical
precision</a>. High-precision calculations turn out not
to be that useful in training or inferencing neural networks: the additional precision
confers no benefit while being slower and less memory-efficient.</p>
<p>Surprisingly, some models can even reach a higher accuracy with lower precision, which
recent research attributes to the <a href="https://arxiv.org/abs/1809.00095">regularization effects from the lower
precision</a>.</p>
<p>Finally (and this is speculation on my part — I haven’t seen any experiments or papers
corroborating this), it’s possible that certain complicated models <em>cannot converge</em>
unless you use an appropriately precise format. There’s a drift between the analytical
gradient update and what the actual backward propagation looks like: the lower the
precision, the bigger the drift. I’d expect that deep learning is particularly
susceptible to an issue here because there’s a lot of multiplications, divisions and
reduction operations.</p>
<h2 id="floating-point-formats">Floating-Point Formats</h2>
<p>Let’s take a quick look at three floating-point formats for deep learning. There are a
lot more floating-point formats, but only a few have gained traction: floating-point
formats require the appropriate hardware and firmware support, which restricts the
introduction and adoption of new formats.</p>
<p>For a quick overview, Grigory Sapunov wrote a great <a href="https://medium.com/@moocaholic/fp64-fp32-fp16-bfloat16-tf32-and-other-members-of-the-zoo-a1ca7897d407">run-down of various floating-point
formats for deep
learning</a>.</p>
<h3 id="ieee-floating-point-formats">IEEE floating-point formats</h3>
<p>These floating-point formats are probably what most people think of when someone says
“floating-point”. The IEEE standard 754 sets out several formats, but for the purposes
of deep learning we are only interested three:
<a href="https://en.wikipedia.org/wiki/Half-precision_floating-point_format">FP16</a>,
<a href="https://en.wikipedia.org/wiki/Single-precision_floating-point_format">FP32</a> and
<a href="https://en.wikipedia.org/wiki/Double-precision_floating-point_format">FP64</a> (a.k.a.
half-, single- and double-precision floating-point formats)<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup>.</p>
<p>Let’s take FP32 as an example. Each FP32 number is a sequence of 32 bits,
$b_{31} b_{30} … b_{0}$. Altogether, this sequence represents the real number</p>
<p>$$ (-1)^{b_{31}} \cdot 2^{(b_{30} b_{29} … b_{23}) - 127} \cdot (1.b_{22} b_{21} … b_{0})_2 $$</p>
<p>Here, $b_{31}$ (the <em>sign bit</em>) determines the sign of the represented value.</p>
<p>$b_{30}$ through $b_{23}$ determine the magnitude or scale of the represented value
(notice that a change in any of these bits drastically changes the size of the
represented value). These bits are called the <em>exponent</em> or <em>scale bits</em>.</p>
<p>Finally, $b_{22}$ through $b_{0}$ determine the precise value of the represented
value. These bits are called the <em>mantissa</em> or <em>precision bits</em>.</p>
<p>Obviously, the more bits you have, the more you can do. Here’s how the three formats
break down:</p>
<table>
<thead>
<tr>
<th style="text-align:left"></th>
<th style="text-align:right">Sign Bits</th>
<th style="text-align:right">Exponent (Scale) Bits</th>
<th style="text-align:right">Mantissa (Precision) Bits</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:left">FP16</td>
<td style="text-align:right">1</td>
<td style="text-align:right">5</td>
<td style="text-align:right">10</td>
</tr>
<tr>
<td style="text-align:left">FP32</td>
<td style="text-align:right">1</td>
<td style="text-align:right">8</td>
<td style="text-align:right">23</td>
</tr>
<tr>
<td style="text-align:left">FP64</td>
<td style="text-align:right">1</td>
<td style="text-align:right">11</td>
<td style="text-align:right">53</td>
</tr>
</tbody>
</table>
<p>There are some details that I’m leaving out here (e.g. how to represent NaNs, positive
and negative infinities), but this is largely how floating point numbers work. A lot
more detail can be found on the <a href="https://en.wikipedia.org/wiki/Floating-point_arithmetic#IEEE_754:_floating_point_in_modern_computers">Wikipedia
page</a>
and of course the <a href="https://ieeexplore.ieee.org/document/8766229">latest revision of the IEEE standard
754</a> itself.</p>
<p>FP32 and FP64 are widely supported by both software (C/C++, PyTorch, TensorFlow) and
hardware (x86 CPUs and most NVIDIA/AMD GPUs).</p>
<p>FP16, on the other hand, is not as widely supported in software (you need to use <a href="http://half.sourceforge.net/">a
special library</a> to use them in C/C++). However, since
deep learning is trending towards favoring FP16 over FP32, it has found support in the
main deep learning frameworks (e.g. <code>tf.float16</code> and <code>torch.float16</code>). In terms of
hardware, FP16 is not supported in x86 CPUs as a distinct type, but is well-supported on
modern GPUs.</p>
<h3 id="google-bfloat16">Google BFloat16</h3>
<p>BFloat16 (a.k.a. the Brain Floating-Point Format, after Google Brain) is basically the
same as FP16, but 3 mantissa bits become exponent bits (i.e. bfloat16 trades 3 bits'
worth of precision for scale).</p>
<figure class="align-center">
<img style="float: middle" src="https://www.georgeho.org/assets/images/bfloat16.png" alt="Diagram illustrating the number and type of bits in bfloat16.">
<figcaption>The number and type of bits in bfloat16. Source: <a href="https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus">Google Cloud blog</a>.</figcaption>
</figure>
<p>When it comes to deep learning, there are generally three “flavors” of values: weights,
activations and gradients. Google suggests storing weights and gradients in FP32, and
storing activations in bfloat16. However, in particularly gracious circumstances,
weights can be stored in bfloat16 without a significant performance degradation.</p>
<p>You can read a lot more on the <a href="https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus">Google Cloud
blog</a>,
and <a href="https://arxiv.org/abs/1905.12322">this paper by Intel and Facebook studying the bfloat16
format</a>.</p>
<p>In terms of software support, bfloat16 is not supported in C/C++, but is supported in
TensorFlow (<a href="https://www.tensorflow.org/api_docs/python/tf#bfloat16"><code>tf.bfloat16</code></a>) and
PyTorch (<a href="https://www.tensorflow.org/api_docs/python/tf#bfloat16"><code>torch.bfloat16</code></a>).</p>
<p>In terms of hardware support, it is supported by <a href="https://en.wikipedia.org/wiki/Cooper_Lake_(microarchitecture)">some modern
CPUS</a>, but the real
support comes out in GPUs and ASICs. At the time of writing, bfloat16 is supported by
the NVIDIA A100 (the first GPU to support it!), and <a href="https://www.techpowerup.com/260344/future-amd-gpu-architecture-to-implement-bfloat16-hardware">will be supported in future AMD
GPUs</a>.
And of course, it is supported by Google TPU v2/v3.</p>
<h3 id="nvidia-tensorfloat">NVIDIA TensorFloat</h3>
<p>Strictly speaking, this isn’t really its own floating-point format, just an overzealous
branding of the technique that NVIDIA developed to train in mixed precision on their
Tensor Core hardware<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup>.</p>
<p>An NVIDIA TensorFloat (a.k.a. TF32) is just a 32-bit float that drops 13 precision bits
in order to execute on Tensor Cores. Thus, it has the precision of FP16 (10 bits), with
the range of FP32 (8 bits). However, if you’re not using Tensor Cores, it’s just a
32-bit float; if you’re only thinking about storage, it’s just a 32-bit float.</p>
<figure class="align-center">
<img style="float: middle" src="https://www.georgeho.org/assets/images/tensorfloat32.png" alt="Diagram illustrating the number and type of bits in an NVIDIA TensorFloat">
<figcaption>The number and type of bits in an NVIDIA TensorFloat. Source: <a href="https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/">NVIDIA blog</a>.</figcaption>
</figure>
<p>One distinct advantage of TF32 is that they’re kind of like FP32. To quote from the
NVIDIA developer blog,</p>
<blockquote>
<p>Applications using NVIDIA libraries enable users to harness the benefits of TF32 with no
code change required. TF32 Tensor Cores operate on FP32 inputs and produce results in
FP32. Non-matrix operations continue to use FP32.</p>
</blockquote>
<p>You can read more about TF32 <a href="https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/">on the NVIDIA
blog</a>, and
about its hardware support in the Ampere architecture on <a href="https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/">the NVIDIA developer
blog</a>.</p>
<p>TF32 is not in the C/C++ standard at all, but is supported in <a href="https://developer.nvidia.com/blog/cuda-11-features-revealed/">CUDA
11</a>.</p>
<p>Hardware-wise, the NVIDIA A100 is the first GPU (and, at the time of writing, the only
device) supporting TF32.</p>
<h2 id="advice-for-practitioners">Advice for Practitioners</h2>
<p>The first thing to say is that floating-point formats are <em>by no means</em> the most
important consideration for your deep learning model — not even close. Floating-point
formats will most likely only make a difference for very large or complex models, for
which fitting the model on GPU memory is a challenge, or for which training times are
excruciatingly long.</p>
<p>The second thing to say is that any practical advice has to be heavily dependent on what
hardware you have available to you.</p>
<h3 id="automatic-mixed-precision-amp-training--a-good-default">Automatic mixed precision (AMP) training — a good default</h3>
<p>Most deep learning stacks support mixed-precision training, which is a pretty good
default option to reap some of the benefits of low-precision training, while still
reasonably avoiding underflow and overflow problems.</p>
<p>TensorFlow supports <a href="https://www.tensorflow.org/guide/mixed_precision">mixed-precision training
natively</a>, whereas the <a href="https://github.com/NVIDIA/apex">NVIDIA Apex
library</a> makes automatic mixed precision training
available in PyTorch. To get started, take a look at NVIDIA’s <a href="https://developer.nvidia.com/automatic-mixed-precision">developer guide for
AMP</a>, and <a href="https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html">documentation for
training in mixed
precision</a>.</p>
<p>It’s worth going over the gist of mixed precision training. There are basically two main
tricks:</p>
<ol>
<li><em>Loss scaling:</em> multiply the loss by some large number, and divide the gradient
updates by this same large number. This avoids the loss underflowing (i.e. clamping
to zero because of the finite precision) in FP16, while still maintaining faithful
backward propagation.</li>
<li><em>FP32 master copy of weights</em>: store the weights themselves in FP32, but cast them to
FP16 before doing the forward and backward propagation (to reap the performance
benefits). During the weight update, the FP16 gradients are cast to FP32 to update
the master copy.</li>
</ol>
<p>You can read more about these techniques in <a href="https://arxiv.org/abs/1710.03740">this paper by NVIDIA and Baidu
Research</a>, or on the accompanying <a href="https://developer.nvidia.com/blog/mixed-precision-training-deep-neural-networks/">blog post by
NVIDIA</a>.</p>
<h3 id="alternative-floating-point-formats--make-sure-itll-be-worth-it">Alternative floating-point formats — make sure it’ll be worth it</h3>
<p>If you’ve already trained your model in mixed precision, it might not be worth the time
or effort to port your code to take advantage of an alternative floating-point format
and bleeding edge hardware.</p>
<p>However, if you choose to go that route, make sure your use case really demands it.
Perhaps you can’t scale up your model without using bfloat16, or you really need to cut
down on training times.</p>
<p>Unfortunately, I don’t have a well-informed opinion on how bfloat16 stacks up against
TF32, so “do your homework” is all I can advise. However, since the NVIDIA A100s only
just (at the time of writing) dropped into the market, it’ll be interesting to see what
the machine learning community thinks of the various low precision options available.</p>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>Technically speaking, there are <a href="https://en.wikipedia.org/wiki/Quadruple-precision_floating-point_format">quadruple-</a> and <a href="https://en.wikipedia.org/wiki/Octuple-precision_floating-point_format">octuple-precision</a> floating-point formats, but those are pretty rarely used, and certainly unheard of in deep learning. <a href="#fnref:1" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
<li id="fn:2">
<p>A Tensor Core is essentially a mixed-precision FP16/FP32 core, which NVIDIA has optimized for deep learning applications. <a href="#fnref:2" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
</ol>
</div>Transformers in Natural Language Processing — A Brief Surveyhttps://www.georgeho.org/transformers-in-nlp/2020-05-23T00:00:00Z2020-05-23T00:00:00Z<p>I’ve recently had to learn a lot about natural language processing (NLP), specifically
Transformer-based NLP models.</p>
<p>Similar to my previous blog post on <a href="https://www.georgeho.org/deep-autoregressive-models/">deep autoregressive
models</a>, this blog post is a write-up
of my reading and research: I assume basic familiarity with deep learning, and aim to
highlight general trends in deep NLP, instead of commenting on individual architectures
or systems.</p>
<p>As a disclaimer, this post is by no means exhaustive and is biased towards
Transformer-based models, which seem to be the dominant breed of NLP systems (at least,
at the time of writing).</p>
<h2 id="some-architectures-and-developments">Some Architectures and Developments</h2>
<p>Here’s an (obviously) abbreviated history of Transformer-based models in NLP<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup> in
(roughly) chronological order. I also cover some other non-Transformer-based models,
because I think they illuminate the history of NLP.</p>
<ol>
<li>
<p>word2vec and GloVe</p>
<ul>
<li>
<p>These were the first instances of word embeddings pre-trained on large amounts of
unlabeled text. These word embeddings generalized well to most other tasks (even
with limited amounts of labeled data), and usually led to appreciable improvements
in performance.</p>
</li>
<li>
<p>These ideas were immensely influential and have served NLP extraordinarily well.
However, they suffer from a major limitation. They are <em>shallow</em> representations
that can only be used in the first layer of any network: the remainder of the
network must still be trained from scratch.</p>
</li>
<li>
<p>The main appeal is well illustrated below: each word has its own vector
representation, and there are linear vector relationships can encode common-sense
semantic meanings of words.</p>
<figure class="align-center">
<img style="float: middle" src="https://www.georgeho.org/assets/images/linear-relationships.png" alt="Linear vector relationships in word embeddings">
<figcaption>Linear vector relationships in word embeddings. Source: <a href="https://www.tensorflow.org/images/linear-relationships.png">TensorFlow documentation</a>.</figcaption>
</figure>
</li>
<li>
<p>Further reading</p>
<ul>
<li><a href="http://arxiv.org/abs/1301.3781">word2vec: Mikolov et al., Google. January 2013</a>
and <a href="http://arxiv.org/abs/1310.4546">October 2013</a>.</li>
<li><a href="https://nlp.stanford.edu/projects/glove/">GloVe: Pennington et al., Stanford CS. EMNLP
2014.</a></li>
</ul>
</li>
</ul>
</li>
<li>
<p>Broadly speaking, after word2vec/GloVe and before Transformers, a lot of ink was
spilled on other different approaches to NLP, including (but certainly not limited
to)</p>
<ol>
<li>Convolutional neural networks</li>
<li>Recurrent neural networks</li>
<li>Reinforcement learning approaches</li>
<li>Memory-augmented deep learning</li>
</ol>
<ul>
<li>Perhaps the most famous of such models is <a href="https://allennlp.org/elmo">ELMo (Embeddings from Language
Models)</a> by AI2, which learned bidirectional word
embeddings using LSTMs, and began NLP’s fondness of Sesame Street.</li>
<li>I won’t go into much more detail here: partly because not all of these approaches
have held up as well as current Transformer-based models, and partly because I have
plans for my computer that don’t involve blogging about recent advances in NLP.</li>
<li>Here is <a href="https://arxiv.org/abs/1708.02709">a survey paper</a> (and an <a href="https://medium.com/dair-ai/deep-learning-for-nlp-an-overview-of-recent-trends-d0d8f40a776d">associated blog
post</a>)
published shortly after the Transformer was invented, which summarizes a lot of the
work that was being done during this period.</li>
</ul>
</li>
<li>
<p>Transformer</p>
<ul>
<li>
<p>The authors introduce a feed-forward network architecture, using only attention
mechanisms and dispensing with convolutions and recurrence entirely (which were not
uncommon techniques in NLP at the time).</p>
</li>
<li>
<p>It achieved state-of-the-art performance on several tasks, and (perhaps more
importantly) was found to generalize very well to other NLP tasks, even with
limited data.</p>
</li>
<li>
<p>Since this architecture was the progenitor of so many other NLP models, it’s
worthwhile to dig into the details a bit. The architecture is illustrated below:
note that its feed-forward nature and multi-head self attention are critical
aspects of this architecture!</p>
<figure class="align-center">
<img style="float: middle" src="https://www.georgeho.org/assets/images/transformer-block.png" alt="Graphical representation of BERT">
<figcaption>Graphical representation of BERT. Source: <a href="https://i.pinimg.com/originals/02/95/a3/0295a3be438ae68f604e53fc88c7edb4.png">Pinterest</a>.</figcaption>
</figure>
</li>
<li>
<p>Further reading</p>
<ul>
<li><a href="https://arxiv.org/pdf/1706.03762.pdf">Vaswani et al., Google Brain. December 2017.</a></li>
<li><a href="https://jalammar.github.io/illustrated-transformer/"><em>The Illustrated Transformer</em> blog post</a></li>
<li><a href="http://nlp.seas.harvard.edu/2018/04/03/attention.html"><em>The Annotated Transformer</em> blog post</a></li>
</ul>
</li>
</ul>
</li>
<li>
<p>ULMFiT (Universal Language Model Fine-tuning for Text Classification)</p>
<ul>
<li>The authors introduce an effective transfer learning method that can be applied to
any task in NLP: this paper introduced the idea of general-domain, unsupervised
pre-training, followed by task-specific fine-tuning. They also introduce other
techniques that are fairly common in NLP now, such as slanted triangular learning
rate schedules. (what some researchers now call warm-up).</li>
<li>Further reading
<ul>
<li><a href="https://arxiv.org/pdf/1801.06146.pdf">Howard and Ruder. January 2018.</a></li>
</ul>
</li>
</ul>
</li>
<li>
<p>GPT-1 and GPT-2 (Generative Pre-trained Transformers)</p>
<ul>
<li>At the risk of peeking ahead, GPT is largely BERT but with Transformer decoder
blocks, instead of encoder blocks. Note that in doing this, we lose the
autoregressive/unidirectional nature of the model.</li>
<li>Arguably the main contribution of GPT-2 is that it demonstrated the value of
training larger Transformer models (a trend that I personally refer to as the
<em>Embiggening</em>).</li>
<li>GPT-2 generated some controversy, as OpenAI <a href="https://www.theverge.com/2019/2/14/18224704/ai-machine-learning-language-models-read-write-openai-gpt2">initially refused to open-source the
model</a>,
citing potential malicious uses, but <a href="https://www.theverge.com/2019/11/7/20953040/openai-text-generation-ai-gpt-2-full-model-release-1-5b-parameters">ended up releasing the model
later</a>.</li>
<li>Further reading
<ul>
<li><a href="https://openai.com/blog/language-unsupervised/">Radford et al., OpenAI. June
2018</a> and <a href="https://openai.com/blog/better-language-models/">February
2019</a>.</li>
<li><a href="http://jalammar.github.io/illustrated-gpt2/"><em>The Illustrated GPT-2</em> blog post</a></li>
</ul>
</li>
</ul>
</li>
<li>
<p>BERT (Bidirectional Encoder Representations from Transformers)</p>
<ul>
<li>
<p>The authors use the Transformer encoder (and only the encoder) to pre-train deep
bidirectional representations from unlabeled text. This pre-trained BERT model can
then be fine-tuned with just one additional output layer to achieve
state-of-the-art performance for many NLP tasks, without substantial task-specific
architecture changes, as illustrated below.</p>
<figure class="align-center">
<img style="float: middle" src="https://www.georgeho.org/assets/images/bert.png" alt="Graphical representation of BERT">
<figcaption>Graphical representation of BERT. Source: <a href="https://i.pinimg.com/originals/02/95/a3/0295a3be438ae68f604e53fc88c7edb4.png">Pinterest</a>.</figcaption>
</figure>
</li>
<li>
<p>BERT was a drastic development in the NLP landscape: it became almost a cliche to
conclude that BERT performs “surprisingly well” on whatever task or dataset you
throw at it.</p>
</li>
<li>
<p>Further reading</p>
<ul>
<li><a href="https://arxiv.org/pdf/1810.04805.pdf">Devlin et al., Google AI Language, May 2019.</a></li>
<li><a href="https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html">Accompanying blog post</a></li>
<li><a href="https://jalammar.github.io/illustrated-bert/"><em>The Illustrated BERT</em> blog post</a></li>
</ul>
</li>
</ul>
</li>
<li>
<p>RoBERTa (Robustly Optimized BERT Approach)</p>
<ul>
<li>
<p>The scientific contributions of this paper are best quoted from its abstract:</p>
<blockquote>
<p>We find that BERT was significantly under-trained, and can match or exceed the
performance of every model published after it. […] These results highlight the
importance of previously overlooked design choices, and raise questions about the
source of recently reported improvements.</p>
</blockquote>
</li>
<li>
<p>The authors use an identical architecture to BERT, but propose several improvements
to the training routine, such as changing the dataset and removing the
next-sentence-prediction (NSP) pre-training task. Funnily enough, far and away the
best thing the authors did to improve BERT was just the most obvious thing: train
BERT for longer!</p>
</li>
<li>
<p>Further reading:</p>
<ul>
<li><a href="https://arxiv.org/abs/1907.11692">Liu et al., Facebook AI. June 2019.</a></li>
<li><a href="https://ai.facebook.com/blog/roberta-an-optimized-method-for-pretraining-self-supervised-nlp-systems/">Accompanying blog post</a></li>
</ul>
</li>
</ul>
</li>
<li>
<p>T5 (Text-to-Text Transfer Transformer)</p>
<ul>
<li>There are two main contributions of this paper:
<ol>
<li>The authors recast all NLP tasks into a text-to-text format: for example,
instead of performing a two-way softmax for binary classification, one could
simply teach an NLP model to output the tokens “spam” or “ham”. This provides a
unified text-to-text format for all NLP tasks.</li>
<li>The authors systematically study and compare the effects of pre-training
objectives, architectures, unlabeled datasets, transfer approaches, and other
factors on dozens of canonical NLP tasks.</li>
</ol>
</li>
<li>This paper (and especially the tables in the appendices!) probably cost the Google
team an incredible amount of money, and the authors were very thorough in ablating
what does and doesn’t help for a good NLP system.</li>
<li>Further reading
<ul>
<li><a href="https://arxiv.org/pdf/1910.10683.pdf">Raffel et al., Google. October 2019.</a></li>
<li><a href="https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html">Accompanying blog post</a></li>
</ul>
</li>
</ul>
</li>
</ol>
<h2 id="some-thoughts-and-observations">Some Thoughts and Observations</h2>
<p>Here I comment on some general trends that I see in Transformer-based models in NLP.</p>
<ol>
<li>
<p>Ever since Google developed the Transformer in 2017, most NLP contributions are not
architectural: instead most recent advances have used the Transformer model as-is, or
using some subset of the Transformer (e.g. BERT and GPT use exclusively Transformer
encoder and decoder blocks, respectively). Instead, recent research has focused on
the way NLP models are pre-trained or fine-tuned, or creating a new dataset, or
formulating a new NLP task to measure “language understanding”, etc.</p>
<ul>
<li>I’m personally not sure what to make of this development: why did we collectively
agree that architectural research wasn’t worth pursuing anymore?</li>
<li>But spinning this the other way, we see that Transformers are a <em>fascinating</em>
architecture: the model has proven so surprisingly versatile and easy to teach that
we are still making meaningful advances with the same architecture. In fact, it is
still an open question how and why Transformers perform as well as they do: there
is an open field of research focusing on answering this question for BERT (since
BERT has been uniquely successful model) called
<a href="https://huggingface.co/transformers/bertology.html">BERTology</a>.</li>
</ul>
</li>
<li>
<p>It was never a question of <em>whether</em> NLP systems would follow computer vision’s model
of fine-tuning pre-trained models (i.e. training a model on ImageNet and then doing
task-specific fine-tuning for downstream applications), but rather <em>how</em>.</p>
<ol>
<li>What specific task and/or dataset should NLP models be pre-trained on?
<ul>
<li>Language modelling has really won out here: BERT was originally published with a
<em>next-sentence prediction</em> (NSP) pre-training task, which RoBERTa completely did
away with.</li>
</ul>
</li>
<li>Exactly <em>what</em> is being learnt during pre-training?
<ul>
<li>Initially it was a separate vector for each token (i.e. pre-training a shallow
representation of text), and these days it is an entire network is pre-trained.</li>
<li>Sebastian Ruder <a href="https://thegradient.pub/nlp-imagenet/">wrote a great article</a>
that delves more into this topic.</li>
</ul>
</li>
</ol>
</li>
<li>
<p>There are (generally speaking) three flavors of Transformer models.</p>
<ol>
<li>Autoregressive models</li>
<li>Autoencoding models</li>
<li>Sequence-to-sequence models</li>
</ol>
<ul>
<li>Hugging Face does an excellent job of summarizing the differences between these
three flavors of models in <a href="https://huggingface.co/transformers/summary.html">their <em>Summary of the
Models</em></a>, which I’ve reproduced
here:</li>
</ul>
<blockquote>
<p>Autoregressive models are pretrained on the classic language modeling task: guess
the next token having read all the previous ones. They correspond to the decoder of
the original transformer model, and a mask is used on top of the full sentence so
that the attention heads can only see what was before in the next, and not what’s
after. Although those models can be fine-tuned and achieve great results on many
tasks, the most natural application is text generation. A typical example of such
models is GPT.</p>
<p>Autoencoding models are pretrained by corrupting the input tokens in some way and
trying to reconstruct the original sentence. They correspond to the encoder of the
original transformer model in the sense that they get access to the full inputs
without any mask. Those models usually build a bidirectional representation of the
whole sentence. They can be fine-tuned and achieve great results on many tasks such
as text generation, but their most natural application is sentence classification
or token classification. A typical example of such models is BERT.</p>
<p>[…]</p>
<p>Sequence-to-sequence models use both the encoder and the decoder of the original
transformer, either for translation tasks or by transforming other tasks to
sequence-to-sequence problems. They can be fine-tuned to many tasks but their most
natural applications are translation, summarization and question answering. The
original transformer model is an example of such a model (only for translation), T5
is an example that can be fine-tuned on other tasks.</p>
</blockquote>
</li>
<li>
<p>Different NLP models learn different kinds of embeddings, and it’s worth
understanding the differences between these various learnt representations.</p>
<ol>
<li>Contextual vs non-contextual embeddings
<ul>
<li>The first word embeddings (that is, word2vec and GloVe) were <em>non-contextual</em>:
each word had its own embedding, independent of the words that came before or
after it.</li>
<li>Almost all other embeddings are <em>contextual</em> now: when embedding a token, they
also consider the tokens before &/ after it.</li>
</ul>
</li>
<li>Unidirectional vs bidirectional embeddings
<ul>
<li>When considering the context of a token, the question is whether you should
consider the tokens both before and after it (i.e. bidirectional embeddings), or
just the tokens that came before (i.e. unidirectional embeddings).</li>
<li>Unidirectional embeddings make the sense when generating text (i.e. text
generation must be done in the way humans write text: in one direction). On the
other hand, bidirectional embeddings make sense when performing sentence-level
tasks such as summarization or rewriting.</li>
<li>The Transformer was notable in that it had bidirectional encoder blocks and
unidirectional decoder blocks. That’s why BERT [GPT-2] produces bidirectional
[unidirectional] embeddings, since it’s a stack of Transformer encoders
[decoders].</li>
<li>Note that the unidirectional/bidirectional distinction is related to whether or
not the model is autoregressive: autoregressive models learn unidirectional
embeddings.</li>
</ul>
</li>
</ol>
</li>
<li>
<p>Transformer-based models have had an interesting history with scaling.</p>
<ul>
<li>This trend probably started when GPT-2 was published: “it sounds very dumb and too
easy, but magical things happen if you make your Transformer model bigger”.</li>
<li>An open question is, how do Transformer models scale (along any dimension of
interest)? For example, how much does dataset size or the number of layers or the
number of training iterations matter in the ultimate performance of a Transformer
model? At what point does making your Transformer model “bigger” (along any
dimension of interest) provide diminishing returns?</li>
<li>There is some <a href="https://github.com/huggingface/awesome-papers#march-24-2020">solid
work</a> being done to
answer this question, and there seems to be good evidence for some fairly
surprising conclusions!</li>
</ul>
</li>
</ol>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>Since writing this blog post, there have been several more Transformer-based NLP models published, such as the <a href="https://ai.googleblog.com/2020/01/reformer-efficient-transformer.html">Reformer</a> from Google and <a href="https://arxiv.org/abs/2005.14165">GPT-3</a> from OpenAI. Because I can’t possibly keep up with <em>all</em> new Transformer-based models, I won’t be writing about them. <a href="#fnref:1" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
</ol>
</div>Adventures in Manipulating Python ASTshttps://www.georgeho.org/manipulating-python-asts/2020-03-27T00:00:00Z2020-03-27T00:00:00Z<p>A while back, I explored the possibility of simplifying <sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup> PyMC4’s model specification
API by manipulating the <a href="https://docs.python.org/3/library/ast.html">Python abstract syntax
tree</a> (AST) of the model code. The PyMC
developers didn’t end up pursuing those API changes any further, but not until I had the
chance to learn a lot about Python ASTs.</p>
<p>Enough curious people have asked me about my experience tinkering with ASTs that I
figure I’d write a short post about the details of my project, in the hope that someone
else will find it useful.</p>
<p>You should read this blog post as a quick overview of my experience with Python ASTs, or
an annotated list of links, and not a comprehensive tutorial on model specification APIs
or Python ASTs. For a full paper trail of my adventures with Python ASTs, check out <a href="https://github.com/eigenfoo/random/tree/master/python/ast-hiding-yield">my
notebooks on
GitHub</a>.</p>
<h2 id="the-problem">The Problem</h2>
<p>Originally, PyMC4’s proposed model specification API looked something like this:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">linear_regression</span>(x):
</span></span><span style="display:flex;"><span> scale <span style="color:#f92672">=</span> <span style="color:#66d9ef">yield</span> tfd<span style="color:#f92672">.</span>HalfCauchy(<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">1</span>)
</span></span><span style="display:flex;"><span> coefs <span style="color:#f92672">=</span> <span style="color:#66d9ef">yield</span> tfd<span style="color:#f92672">.</span>Normal(tf<span style="color:#f92672">.</span>zeros(x<span style="color:#f92672">.</span>shape[<span style="color:#ae81ff">1</span>]), <span style="color:#ae81ff">1</span>)
</span></span><span style="display:flex;"><span> predictions <span style="color:#f92672">=</span> <span style="color:#66d9ef">yield</span> tfd<span style="color:#f92672">.</span>Normal(tf<span style="color:#f92672">.</span>linalg<span style="color:#f92672">.</span>matvec(x, coefs), scale)
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">return</span> predictions
</span></span></code></pre></div><p>The main drawback to this API was that the <code>yield</code> keyword was confusing. Many users
don’t really understand Python generators, and those who do might only understand
<code>yield</code> as a drop-in replacement for <code>return</code> (that is, they might understand what it
means for a function to end in <code>yield foo</code>, but would be uncomfortable with <code>bar = yield foo</code>).</p>
<p>Furthermore, the <code>yield</code> keyword introduces a leaky abstraction<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup>: users don’t care
about whether model is a function or a generator, and they shouldn’t need to. More
generally, users shouldn’t have to know anything about how PyMC works in order to use
it: ideally, the only thing users would need to think about would be their data and
their model. Having to graft several <code>yield</code> keywords into their code is a fairly big
intrusion in that respect.</p>
<p>Finally, this model specification API is essentially moving the problem off of our
plates and onto our users. The entire point of the PyMC project is to provide a friendly
and easy-to-use interface for Bayesian modelling.</p>
<p>To enumerate the problem further, we wanted to:</p>
<ol>
<li>Hide the <code>yield</code> keyword from the user-facing model specification API.</li>
<li>Obtain the user-defined model as a generator.</li>
</ol>
<p>The main difficulty with the first goal is that as soon as we remove <code>yield</code> from the
model function, it is no longer a generator. However, the PyMC inference engine needs the
model as a generator, since this allows us to interrupt the control flow of the model at
various points to do certain things:</p>
<ul>
<li>Manage random variable names.</li>
<li>Perform sampling.</li>
<li>Other arbitrary PyMC magic that I’m truthfully not familiar with.</li>
</ul>
<p>In short, the user writes their model as a function, but we require the model as a
generator.</p>
<p>I opine on why this problem is challenging a lot more
<a href="https://github.com/eigenfoo/random/tree/master/python/ast-hiding-yield/00-prototype#why-is-this-problem-hard">here</a>.</p>
<h2 id="the-solution">The Solution</h2>
<p>First, I wrote a <code>FunctionToGenerator</code> class:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#66d9ef">class</span> <span style="color:#a6e22e">FunctionToGenerator</span>(ast<span style="color:#f92672">.</span>NodeTransformer):
</span></span><span style="display:flex;"><span> <span style="color:#e6db74">"""
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74"> This subclass traverses the AST of the user-written, decorated,
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74"> model specification and transforms it into a generator for the
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74"> model. Subclassing in this way is the idiomatic way to transform
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74"> an AST.
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74"> Specifically:
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74"> 1. Add `yield` keywords to all assignments
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74"> E.g. `x = tfd.Normal(0, 1)` -> `x = yield tfd.Normal(0, 1)`
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74"> 2. Rename the model specification function to
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74"> `_pm_compiled_model_generator`. This is done out an abundance
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74"> of caution more than anything.
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74"> 3. Remove the @Model decorator. Otherwise, we risk running into
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74"> an infinite recursion.
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74"> """</span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">visit_Assign</span>(self, node):
</span></span><span style="display:flex;"><span> new_node <span style="color:#f92672">=</span> node
</span></span><span style="display:flex;"><span> new_node<span style="color:#f92672">.</span>value <span style="color:#f92672">=</span> ast<span style="color:#f92672">.</span>Yield(value<span style="color:#f92672">=</span>new_node<span style="color:#f92672">.</span>value)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># Tie up loose ends in the AST.</span>
</span></span><span style="display:flex;"><span> ast<span style="color:#f92672">.</span>copy_location(new_node, node)
</span></span><span style="display:flex;"><span> ast<span style="color:#f92672">.</span>fix_missing_locations(new_node)
</span></span><span style="display:flex;"><span> self<span style="color:#f92672">.</span>generic_visit(node)
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">return</span> new_node
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">visit_FunctionDef</span>(self, node):
</span></span><span style="display:flex;"><span> new_node <span style="color:#f92672">=</span> node
</span></span><span style="display:flex;"><span> new_node<span style="color:#f92672">.</span>name <span style="color:#f92672">=</span> <span style="color:#e6db74">"_pm_compiled_model_generator"</span>
</span></span><span style="display:flex;"><span> new_node<span style="color:#f92672">.</span>decorator_list <span style="color:#f92672">=</span> []
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># Tie up loose ends in the AST.</span>
</span></span><span style="display:flex;"><span> ast<span style="color:#f92672">.</span>copy_location(new_node, node)
</span></span><span style="display:flex;"><span> ast<span style="color:#f92672">.</span>fix_missing_locations(new_node)
</span></span><span style="display:flex;"><span> self<span style="color:#f92672">.</span>generic_visit(node)
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">return</span> new_node
</span></span></code></pre></div><p>Subclassing <code>ast.NodeTransformer</code> (as <code>FunctionToGenerator</code> does) is the <a href="https://greentreesnakes.readthedocs.io/en/latest/manipulating.html#modifying-the-tree">recommended
way of modifying
ASTs</a>.
The functionality of <code>FunctionToGenerator</code> is pretty well described by the docstring:
the <code>visit_Assign</code> method adds the <code>yield</code> keyword to all assignments by wrapping the
visited <code>Assign</code> node within a <code>Yield</code> node. The <code>visit_FunctionDef</code> method removes the
decorator and renames the function to <code>_pm_compiled_model_generator</code>. All told, after
the <code>NodeTransformer</code> is done with the AST, we have one function,
<code>_pm_compiled_model_generator</code>, which is a modified version of the user-defined
function.</p>
<p>Second, the <code>Model</code> class:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#66d9ef">class</span> <span style="color:#a6e22e">Model</span>:
</span></span><span style="display:flex;"><span> <span style="color:#e6db74">""" pm.Model decorator. """</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">def</span> __init__(self, func):
</span></span><span style="display:flex;"><span> self<span style="color:#f92672">.</span>func <span style="color:#f92672">=</span> func
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># Introspect wrapped function, instead of the decorator class.</span>
</span></span><span style="display:flex;"><span> functools<span style="color:#f92672">.</span>update_wrapper(self, func)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># Uncompile wrapped function.</span>
</span></span><span style="display:flex;"><span> uncompiled <span style="color:#f92672">=</span> uncompile(func<span style="color:#f92672">.</span>__code__)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># Parse AST and modify it.</span>
</span></span><span style="display:flex;"><span> tree <span style="color:#f92672">=</span> parse_snippet(<span style="color:#f92672">*</span>uncompiled)
</span></span><span style="display:flex;"><span> tree <span style="color:#f92672">=</span> FunctionToGenerator()<span style="color:#f92672">.</span>visit(tree)
</span></span><span style="display:flex;"><span> uncompiled[<span style="color:#ae81ff">0</span>] <span style="color:#f92672">=</span> tree
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># Recompile wrapped function.</span>
</span></span><span style="display:flex;"><span> self<span style="color:#f92672">.</span>recompiled <span style="color:#f92672">=</span> recompile(<span style="color:#f92672">*</span>uncompiled)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># Execute recompiled code (defines `_pm_compiled_model_generator`)</span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># in the locals() namespace and assign it to an attribute.</span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># Refer to http://lucumr.pocoo.org/2011/2/1/exec-in-python/</span>
</span></span><span style="display:flex;"><span> exec(self<span style="color:#f92672">.</span>recompiled, <span style="color:#66d9ef">None</span>, locals())
</span></span><span style="display:flex;"><span> self<span style="color:#f92672">.</span>model_generator <span style="color:#f92672">=</span> locals()[<span style="color:#e6db74">"_pm_compiled_model_generator"</span>]
</span></span></code></pre></div><p>This class isn’t meant to be instantiated: rather, it’s <a href="https://realpython.com/primer-on-python-decorators/#classes-as-decorators">meant to be used as a Python
decorator</a>.
Essentially, it “uncompiles” the function to get the Python source code of the function.
This source code is then passed to the <code>parse_snippet</code><sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup> function, which returns the
AST for the function. We then modify this AST with the <code>FunctionToGenerator</code> class that
we defined above. Finally, we recompile this AST and execute it. Recall that executing
this recompiled AST defines a new function called <code>_pm_compiled_model_generator</code>. This
new function, accessed via the <code>locals</code> variable<sup id="fnref:4"><a href="#fn:4" class="footnote-ref" role="doc-noteref">4</a></sup>, is then bound to the class’s
<code>self.model_generator</code>, which explains the confusing-looking
<code>self.model_generator = locals()["_pm_compiled_model_generator"]</code>.</p>
<p>Finally, the user facing API looks like this:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#a6e22e">@Model</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">linear_regression</span>(x):
</span></span><span style="display:flex;"><span> scale <span style="color:#f92672">=</span> tfd<span style="color:#f92672">.</span>HalfCauchy(<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">1</span>)
</span></span><span style="display:flex;"><span> coefs <span style="color:#f92672">=</span> tfd<span style="color:#f92672">.</span>Normal(tf<span style="color:#f92672">.</span>zeros(x<span style="color:#f92672">.</span>shape[<span style="color:#ae81ff">1</span>]), <span style="color:#ae81ff">1</span>)
</span></span><span style="display:flex;"><span> predictions <span style="color:#f92672">=</span> tfd<span style="color:#f92672">.</span>Normal(tf<span style="color:#f92672">.</span>linalg<span style="color:#f92672">.</span>matvec(x, coefs), scale)
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">return</span> predictions
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>linear_regression<span style="color:#f92672">.</span>model_generator(tf<span style="color:#f92672">.</span>zeros([<span style="color:#ae81ff">3</span>, <span style="color:#ae81ff">10</span>])) <span style="color:#75715e"># Shape is irrelevant here</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Out[8]:</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># <generator object _pm_compiled_model_generator at 0x107a5c5c8></span>
</span></span></code></pre></div><p>As you can see, the users need not write <code>yield</code> while specifying their models, and the
PyMC inference engine can now simply call the <code>model_generator</code> method of
<code>linear_regression</code> to produce a generator called <code>_pm_compiled_model_generator</code>, as
desired. Success!</p>
<h2 id="lessons-learnt">Lessons Learnt</h2>
<p>Again, PyMC4’s model specification API will <em>not</em> be incorporating these changes: the
PyMC developers have since decided that the <code>yield</code> keyword is the most elegant (but not
necessarily the easiest) way for users to specify statistical models. This post is just
meant to summarize the lessons learnt while pursuing this line of inquiry.</p>
<p>Reading and parsing the AST is perfectly safe: that’s basically just a form of code
introspection, which is totally a valid thing to do! It’s when you want to modify or
even rewrite the AST that things start getting <del>janky</del> dangerous (especially if you
want to execute the modified AST <em>instead</em> of the written code, as I was trying to do!).</p>
<p>If you want to programmatically modify the AST (e.g. “insert a <code>yield</code> keyword in front
of every assignment of a TensorFlow Distribution”, as in our case), stop and consider if
you’re attempting to modify the <em>semantics</em> of the written code, and if you’re sure that
that’s a good idea (e.g. the <code>yield</code> keywords in the code <em>mean something</em>, and remove
those keywords changes the apparent semantics of the code).</p>
<h2 id="further-reading">Further Reading</h2>
<p>I’ve only given a high-level overview of this project here, and a lot of the technical
details were glossed over. If you’re hungry for more, check out the following resources:</p>
<ul>
<li>Notebooks and more extensive documentation on this project <a href="https://github.com/eigenfoo/random/tree/master/python/ast-hiding-yield">are on
GitHub</a>. In
particular, it might be helpful to peruse the <a href="https://github.com/eigenfoo/random/tree/master/python/ast-hiding-yield/00-prototype#links-and-references">links and references at the end of the
READMEs</a>.</li>
<li>For those looking to programmatically inspect/modify Python ASTs the same way I did
here, you might find <a href="https://twitter.com/remilouf/status/1213079103156424704">this Twitter
thread</a> helpful.</li>
<li>And for those wondering how PyMC4’s model specification API ended up, some very smart
people gave their feedback on this work <a href="https://twitter.com/avibryant/status/1150827954319982592">on
Twitter</a>.</li>
</ul>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>Or should I say, complicating? At any rate, changing! <a href="#fnref:1" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
<li id="fn:2">
<p>I was <a href="https://twitter.com/avibryant/status/1150827954319982592">subsequently
convinced</a> that this
isn’t a leaky abstraction after all. <a href="#fnref:2" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
<li id="fn:3">
<p>I omitted the implementation of <code>parse_snippet</code> for brevity. If you want
to see it, check out the “AST Helper Functions” section of <a href="https://github.com/eigenfoo/random/blob/master/python/ast-hiding-yield/00-prototype/hiding-yield.ipynb">this
notebook</a>. <a href="#fnref:3" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
<li id="fn:4">
<p>For way more information on <code>exec</code>, <code>eval</code>, <code>locals</code> and <code>globals</code>, check
out <a href="https://lucumr.pocoo.org/2011/2/1/exec-in-python/">Armin Ronacher’s blog
post</a> and <a href="https://stackoverflow.com/questions/2220699/whats-the-difference-between-eval-exec-and-compile">this
StackOverflow
answer</a>. <a href="#fnref:4" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
</ol>
</div>Benchmarks for Mass Matrix Adaptationhttps://www.georgeho.org/mass-matrix-benchmarks/2019-12-14T00:00:00Z2019-12-14T00:00:00Z<p>I was lucky enough to be invited to attend the <a href="https://gradientretreat.com/">Gradient
Retreat</a> earlier this month. It was an entire week
on a beautiful island with some amazingly intelligent Bayesians, and no demands
on my time other than the self-set (and admittedly vague) goal of contributing
to probabilistic programming in some way.</p>
<p>I initially tried to implement mass matrix adaptation in Tensorflow Probability,
but I quickly readjusted my goals to something more achievable: running some
benchmarks with tuning in Hamiltonian Monte Carlo (HMC).</p>
<figure>
<a href="https://www.georgeho.org/assets/images/galiano.jpg"><img src="https://www.georgeho.org/assets/images/galiano.jpg" alt="A view of a forest on Galiano Island"></a>
<a href="https://www.georgeho.org/assets/images/galiano2.jpg"><img src="https://www.georgeho.org/assets/images/galiano2.jpg" alt="The view from a bluff on Galiano Island"></a>
<figcaption>Pictures from Galiano Island.</figcaption>
</figure>
<p>A quick rundown for those unfamiliar: <em>tuning</em> is what happens before sampling,
during which the goal is not to actually draw samples, but to <em>prepare</em> to draw
samples<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup>. For HMC and its variants, this means estimating HMC parameters such
as the step size, integration time and mass matrix<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup>, the last of which is
basically the covariance matrix of the model parameters. Because my life is
finite (and I assume everybody else’s is too), I limited myself to mass matrix
adaptation.</p>
<p>(If you’re still uncertain about the details of tuning or mass matrix
adaptation, check out <a href="https://colcarroll.github.io/hmc_tuning_talk/">Colin Carroll’s essay on HMC
tuning</a> or the <a href="https://mc-stan.org/docs/2_20/reference-manual/hmc-algorithm-parameters.html">Stan reference
manual on HMC
parameters</a>:
I don’t explain many more concepts in the rest of this post.)</p>
<p>The interesting thing about tuning is that there are no rules: there are no
asymptotic guarantees we can rely on and no mathematical results to which we can
turn for enlightened inspiration. The only thing we care about is obtaining a
decent estimate of the mass matrix, and preferably quickly.</p>
<p>Accompanying this lack of understanding of mass matrix adaptation is an
commensurate lack of (apparent) scientific inquiry — there is scant literature
to look to, and for open source developers, there is little prior art to draw
from when writing new implementations of HMC!</p>
<p>So I decided to do some empirical legwork and benchmark various methods of mass
matrix adaptation. Here are the questions I was interested in answering:</p>
<ol>
<li>Is the assumption that the mass matrix is diagonal (in other words, assume
that all parameters are uncorrelated) a good assumption to make? What are
the implications of this assumption for the tuning time, and the number of
effective samples per second?</li>
<li>Does the tuning schedule (i.e. the sizes of the adaptation windows) make a
big difference? Specifically, should we have a schedule of constant
adaptation windows, or an “expanding schedule” of exponentially growing
adaptation windows?</li>
<li>Besides assuming the mass matrix is diagonal, are there any other ways of
simplifying mass matrix adaptation? For example, could we approximate the
mass matrix as low rank?</li>
</ol>
<p>I benchmarked five different mass matrix adaptation methods:</p>
<ol>
<li>A diagonal mass matrix (<code>diag</code>)</li>
<li>A full (a.k.a. dense) mass matrix (<code>full</code>)</li>
<li>A diagonal mass matrix adapted on an expanding schedule (<code>diag_exp</code>)</li>
<li>A full mass matrix adapted on an expanding schedule (<code>diag_exp</code>)</li>
<li>A low-rank approximation to the mass matrix using <a href="https://github.com/aseyboldt/covadapt">Adrian Seyboldt’s <code>covadapt</code> library</a>.</li>
</ol>
<p>I benchmarked these adaptation methods against six models:</p>
<ol>
<li>A 100-dimensional multivariate normal with a non-diagonal covariance matrix (<code>mvnormal</code>)</li>
<li>A 100-dimensional multivariate normal with a low-rank covariance matrix (<code>lrnormal</code>)</li>
<li>A <a href="https://docs.pymc.io/notebooks/stochastic_volatility.html">stochastic volatility model</a> (<code>stoch_vol</code>)</li>
<li>The <a href="https://docs.pymc.io/notebooks/Diagnosing_biased_Inference_with_Divergences.html#The-Eight-Schools-Model">eight schools model</a> (<code>eight</code>)</li>
<li>The <a href="https://docs.pymc.io/notebooks/hierarchical_partial_pooling.html">PyMC3 baseball model</a> (<code>baseball</code>)</li>
<li>A <a href="https://docs.pymc.io/notebooks/GP-SparseApprox.html#Examples">sparse Gaussian process approximation</a> (<code>gp</code>)</li>
</ol>
<p>Without further ado, the main results are shown below. Afterwards, I make some
general observations on the benchmarks, and finally I describe various
shortcomings of my experimental setup (which, if I were more optimistic, I would
call “directions for further work”).</p>
<h3 id="tuning-times">Tuning Times</h3>
<p>This tabulates the tuning time, in seconds, of each adaptation method for each
model. Lower is better. The lowest tuning time for each model is shown in bold
italics.</p>
<table>
<thead>
<tr>
<th style="text-align:left"></th>
<th style="text-align:right"><strong><code>mvnormal</code></strong></th>
<th style="text-align:right"><strong><code>lrnormal</code></strong></th>
<th style="text-align:right"><strong><code>stoch_vol</code></strong></th>
<th style="text-align:right"><strong><code>gp</code></strong></th>
<th style="text-align:right"><strong><code>eight</code></strong></th>
<th style="text-align:right"><strong><code>baseball</code></strong></th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:left"><strong><code>diag</code></strong></td>
<td style="text-align:right">365.34</td>
<td style="text-align:right">340.10</td>
<td style="text-align:right">239.59</td>
<td style="text-align:right">18.47</td>
<td style="text-align:right">2.92</td>
<td style="text-align:right">5.32</td>
</tr>
<tr>
<td style="text-align:left"><strong><code>full</code></strong></td>
<td style="text-align:right"><em><strong>8.29</strong></em></td>
<td style="text-align:right">364.07</td>
<td style="text-align:right">904.95</td>
<td style="text-align:right"><em><strong>14.24</strong></em></td>
<td style="text-align:right"><em><strong>2.91</strong></em></td>
<td style="text-align:right"><em><strong>4.93</strong></em></td>
</tr>
<tr>
<td style="text-align:left"><strong><code>diag_exp</code></strong></td>
<td style="text-align:right">358.50</td>
<td style="text-align:right">360.91</td>
<td style="text-align:right"><em><strong>219.65</strong></em></td>
<td style="text-align:right">16.25</td>
<td style="text-align:right">3.05</td>
<td style="text-align:right">5.08</td>
</tr>
<tr>
<td style="text-align:left"><strong><code>full_exp</code></strong></td>
<td style="text-align:right">8.46</td>
<td style="text-align:right">142.20</td>
<td style="text-align:right">686.58</td>
<td style="text-align:right">14.87</td>
<td style="text-align:right">3.21</td>
<td style="text-align:right">6.04</td>
</tr>
<tr>
<td style="text-align:left"><strong><code>covadapt</code></strong></td>
<td style="text-align:right">386.13</td>
<td style="text-align:right"><em><strong>89.92</strong></em></td>
<td style="text-align:right">398.08</td>
<td style="text-align:right">N/A</td>
<td style="text-align:right">N/A</td>
<td style="text-align:right">N/A</td>
</tr>
</tbody>
</table>
<h3 id="effective-samples-per-second">Effective Samples per Second</h3>
<p>This tabulates the number of effective samples drawn by each adaptation method
for each model. Higher is better. The highest numbers of effective samples per
second is shown in bold italics.</p>
<table>
<thead>
<tr>
<th style="text-align:left"></th>
<th style="text-align:right"><strong><code>mvnormal</code></strong></th>
<th style="text-align:right"><strong><code>lrnormal</code></strong></th>
<th style="text-align:right"><strong><code>stoch_vol</code></strong></th>
<th style="text-align:right"><strong><code>gp</code></strong></th>
<th style="text-align:right"><strong><code>eight</code></strong></th>
<th style="text-align:right"><strong><code>baseball</code></strong></th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:left"><strong><code>diag</code></strong></td>
<td style="text-align:right">0.02</td>
<td style="text-align:right">1.55</td>
<td style="text-align:right"><em><strong>11.22</strong></em></td>
<td style="text-align:right">65.36</td>
<td style="text-align:right">761.82</td>
<td style="text-align:right">455.23</td>
</tr>
<tr>
<td style="text-align:left"><strong><code>full</code></strong></td>
<td style="text-align:right">1.73</td>
<td style="text-align:right">0.01</td>
<td style="text-align:right">6.71</td>
<td style="text-align:right"><em><strong>106.30</strong></em></td>
<td style="text-align:right"><em><strong>840.77</strong></em></td>
<td style="text-align:right"><em><strong>495.93</strong></em></td>
</tr>
<tr>
<td style="text-align:left"><strong><code>diag_exp</code></strong></td>
<td style="text-align:right">0.02</td>
<td style="text-align:right">1.51</td>
<td style="text-align:right">9.79</td>
<td style="text-align:right">59.89</td>
<td style="text-align:right">640.90</td>
<td style="text-align:right">336.71</td>
</tr>
<tr>
<td style="text-align:left"><strong><code>full_exp</code></strong></td>
<td style="text-align:right"><em><strong>1,799.11</strong></em></td>
<td style="text-align:right"><em><strong>1,753.65</strong></em></td>
<td style="text-align:right">0.16</td>
<td style="text-align:right">101.99</td>
<td style="text-align:right">618.28</td>
<td style="text-align:right">360.14</td>
</tr>
<tr>
<td style="text-align:left"><strong><code>covadapt</code></strong></td>
<td style="text-align:right">0.02</td>
<td style="text-align:right">693.87</td>
<td style="text-align:right">5.71</td>
<td style="text-align:right">N/A</td>
<td style="text-align:right">N/A</td>
<td style="text-align:right">N/A</td>
</tr>
</tbody>
</table>
<h2 id="observations">Observations</h2>
<blockquote>
<p><strong>tldr:</strong> As is typical with these sorts of things, no one adaptation method
uniformly outperforms the others.</p>
</blockquote>
<ul>
<li>A full mass matrix can provide significant improvements over a diagonal mass
matrix for both the tuning time and the number of effective samples per
second. This improvement can sometimes go up to two orders of magnitude!
<ul>
<li>This is most noticeable in the <code>mvnormal</code> model, with heavily correlated
parameters.</li>
<li>Happily, my benchmarks are not the only instance of full mass matrices
outperforming diagonal ones: <a href="https://dfm.io/posts/pymc3-mass-matrix/">Dan Foreman-Mackey demonstrated something
similar in one of his blog posts</a>.</li>
<li>However, in models with less extreme correlations among parameters, this
advantage shrinks significantly (although it doesn’t go away entirely).
Full matrices can also take longer to tune. You can see this in the baseball
or eight schools model.</li>
<li>Nevertheless, full mass matrices never seem to perform egregiously <em>worse</em>
than diagonal mass matrices. This makes sense theoretically: a full mass
matrix can be estimated to be diagonal (at the cost of a quadratic memory
requirement as opposed to linear), but not vice versa.</li>
</ul>
</li>
<li>Having an expanding schedule for tuning can sometimes give better performance,
but nowhere near as significant as the difference between diagonal and full
matrices. This difference is most noticeable for the <code>mvnormal</code> and <code>lrnormal</code>
models (probably because these models have a constant covariance matrix and so
more careful estimates using expanding windows can provide much better
sampling).</li>
<li>I suspect the number of effective samples per second for a full mass matrix on
the <code>lrnormal</code> model (0.01 effective samples per second) is a mistake (or
some other computational fluke): it looks way too low to be reasonable.</li>
<li>I’m also surprised that <code>full_exp</code> does really badly (in terms of effective
samples per second) on the <code>stoch_vol</code> model, despite <code>full</code> doing decently
well! This is either a fluke, or a really interesting phenomenon to dig in to.</li>
<li><code>covadapt</code> seems to run into some numerical difficulties? While running these
benchmarks I ran into an inscrutable and non-reproducible
<a href="https://stackoverflow.com/q/18436667"><code>ArpackError</code></a> from SciPy.</li>
</ul>
<h2 id="experimental-setup">Experimental Setup</h2>
<ul>
<li>All samplers were run for 2000 tuning steps and 1000 sampling steps. This is
unusually high, but is necessary for <code>covadapt</code> to work well, and I wanted to
use the same number of iterations across all the benchmarks.</li>
<li>My expanding schedule is as follows: the first adaptation window is 100
iterations, and each subsequent window is 1.005 times the previous window.
These numbers give 20 updates within 2000 iterations, while maintaining an
exponentially increasing adaptation window size.</li>
<li>I didn’t run <code>covadapt</code> for models with fewer than 100 model parameters.
With so few parameters, there’s no need to approximate a mass matrix as
low-rank: you can just estimate the full mass matrix!</li>
<li>I set <code>target_accept</code> (a.k.a. <code>adapt_delta</code> to Stan users) to 0.9 to make all
divergences go away.</li>
<li>All of these numbers were collected by sampling once per model per adaptation
method (yes only once, sorry) in PyMC3, running on my MacBook Pro.</li>
</ul>
<h2 id="shortcomings">Shortcomings</h2>
<ul>
<li>In some sense comparing tuning times is not a fair comparison: it’s possible
that some mass matrix estimates converge quicker than others, and so comparing
their tuning times is essentially penalizing these methods for converging
faster than others.</li>
<li>It’s also possible that my expanding schedule for the adaptation windows just
sucks! There’s no reason why the first window needs to be 100 iterations, or
why 1.005 should be a good multiplier. It looks like Stan <a href="https://github.com/stan-dev/stan/blob/736311d88e99b997f5b902409752fb29d6ec0def/src/stan/mcmc/windowed_adaptation.hpp#L95">doubles their
adaptation window
sizes</a>
during warmup.</li>
<li>These benchmarks are done only for very basic toy models: I should test more
extensively on more models that people in The Real World™ use.</li>
<li>If you are interested in taking these benchmarks further (or perhaps just want
to fact-check me on my results), the code is <a href="https://github.com/eigenfoo/mass-matrix-benchmarks">sitting in this GitHub
repository</a><sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup>.</li>
</ul>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>It’s good to point out that mass matrix adaptation is to make sampling
more efficient, not more valid. Theoretically, any mass matrix would work,
but a good one (i.e. a good estimate of the covariance matrix of the model
parameters) could sample orders of magnitudes more efficiently. <a href="#fnref:1" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
<li id="fn:2">
<p>…uh, <em><em>sweats and looks around nervously for differential geometers</em></em>
more formally called the <em>metric</em>… <a href="#fnref:2" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
<li id="fn:3">
<p>There are some violin plots lying around in the notebook, a relic from a
time when I thought that I would have the patience to run each model and
adaptation method multiple times. <a href="#fnref:3" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
</ol>
</div>Introducing `stan-vim`https://www.georgeho.org/stan-vim/2019-11-11T00:00:00Z2019-11-11T00:00:00Z<center>
<img
src="https://www.georgeho.org/assets/images/stan-logo.png"
alt="Stan logo">
</center>
<p>I made a Vim plugin for Stan!</p>
<p>I’ve been reading and writing a lot of Stan lately, but mainly in barebones text
editors (or even just by <code>cat</code>ing out the file), so I had to make do with none
of the creature comforts of my favorite text editor, Vim.</p>
<p>But I also wasn’t happy with the syntax highlighting provided by
<a href="https://github.com/maverickg/stan.vim">existing</a>
<a href="https://github.com/mdlerch/mc-stan.vim">Vim</a>
<a href="https://github.com/ssp3nc3r/stan-syntax-vim">plugins</a> (and they also looked out
of date and thinly maintained…), so I just went ahead and learnt a truckload
of Vimscript<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup>.</p>
<p>Check out the plugin! You can find installation instructions
<a href="https://github.com/eigenfoo/stan-vim#installation">here</a> and documentation
<a href="https://github.com/eigenfoo/stan-vim#documentation">here</a>. Screenshots of
syntax highlighting and projects links are below.</p>
<ul>
<li><a href="https://github.com/eigenfoo/stan-vim">GitHub</a></li>
<li><a href="https://vimawesome.com/plugin/stan-vim-is-written-on">VimAwesome</a></li>
<li><a href="https://www.vim.org/scripts/script.php?script_id=5835">Vim Online</a></li>
</ul>
<figure>
<a href="https://raw.githubusercontent.com/eigenfoo/stan-vim/master/screenshots/screenshot0.png"><img src="https://raw.githubusercontent.com/eigenfoo/stan-vim/master/screenshots/screenshot0.png" alt="Screenshot of a Stan model in stan-vim"></a>
<a href="https://raw.githubusercontent.com/eigenfoo/stan-vim/master/screenshots/screenshot1.png"><img src="https://raw.githubusercontent.com/eigenfoo/stan-vim/master/screenshots/screenshot1.png" alt="Screenshot of the stan-vim documentation"></a>
<a href="https://raw.githubusercontent.com/eigenfoo/stan-vim/master/screenshots/screenshot2.png"><img src="https://raw.githubusercontent.com/eigenfoo/stan-vim/master/screenshots/screenshot2.png" alt="Screenshot of another Stan model in stan-vim"></a>
<figcaption>Screenshots of <code>stan-vim</code> syntax highlighting.</figcaption>
</figure>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>As it turns out, <a href="https://www.reddit.com/r/vim/comments/54224o/why_is_there_so_much_hate_for_vimscript/">Vimscript is a very not-good
language</a>.
This is probably the last Vim plugin I write. <a href="#fnref:1" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
</ol>
</div>Anatomy of a Probabilistic Programming Frameworkhttps://www.georgeho.org/prob-prog-frameworks/2019-09-30T00:00:00Z2019-09-30T00:00:00Z<p>Recently, the PyMC4 developers <a href="https://openreview.net/forum?id=rkgzj5Za8H">submitted an
abstract</a> to the <a href="https://program-transformations.github.io/"><em>Program Transformations
for Machine Learning</em> NeurIPS workshop</a>. I
realized that despite knowing a thing or two about Bayesian modelling, I don’t
understand how probabilistic programming frameworks are structured, and therefore
couldn’t appreciate the sophisticated design work going into PyMC4. So I trawled through
papers, documentation and source code<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup> of various open-source probabilistic
programming frameworks, and this is what I’ve managed to take away from it.</p>
<p>I assume you know a fair bit about probabilistic programming and Bayesian modelling, and
are familiar with the big players in the probabilistic programming world. If you’re
unsure, you can <a href="https://www.georgeho.org/bayesian-inference-reading/">read up here</a>.</p>
<div>
<h2>Contents</h2>
<nav id="TableOfContents">
<ul>
<li><a href="#dissecting-probabilistic-programming-frameworks">Dissecting Probabilistic Programming Frameworks</a>
<ul>
<li><a href="#specifying-the-model-languageapi">Specifying the model: language/API</a></li>
<li><a href="#building-the-model-density-distributions-and-transformations">Building the model density: distributions and transformations</a></li>
<li><a href="#computing-the-posterior-inference-algorithm">Computing the posterior: inference algorithm</a></li>
<li><a href="#computing-the-mode-optimizer">Computing the mode: optimizer</a></li>
<li><a href="#computing-gradients-autodifferentiation">Computing gradients: autodifferentiation</a></li>
<li><a href="#monitoring-inference-diagnostics">Monitoring inference: diagnostics</a></li>
</ul>
</li>
<li><a href="#a-zoo-of-probabilistic-programming-frameworks">A Zoo of Probabilistic Programming Frameworks</a>
<ul>
<li><a href="#stan">Stan</a></li>
<li><a href="#tensorflow-probability-aka-tfp">TensorFlow Probability (a.k.a. TFP)</a></li>
<li><a href="#pymc3">PyMC3</a></li>
<li><a href="#pymc4">PyMC4</a></li>
<li><a href="#pyro">Pyro</a></li>
</ul>
</li>
</ul>
</nav>
</div>
<h2 id="dissecting-probabilistic-programming-frameworks">Dissecting Probabilistic Programming Frameworks</h2>
<p>A probabilistic programming framework needs to provide six things:</p>
<ol>
<li>A language or API for users to specify a model</li>
<li>A library of probability distributions and transformations to build the posterior
density</li>
<li>At least one inference algorithm, which either draws samples from the posterior (in
the case of Markov Chain Monte Carlo, MCMC) or computes some approximation of it (in
the case of variational inference, VI)</li>
<li>At least one optimizer, which can compute the mode of the posterior density</li>
<li>An autodifferentiation library to compute gradients required by the inference
algorithm and optimizer</li>
<li>A suite of diagnostics to monitor and analyze the quality of inference</li>
</ol>
<p>These six pieces come together like so:</p>
<p><img src="https://www.georgeho.org/assets/images/prob-prog-flowchart.png" alt="Flowchart illustrating the structure of a probabilistic programming
frameworks"></p>
<p>Let’s break this down one by one.</p>
<h3 id="specifying-the-model-languageapi">Specifying the model: language/API</h3>
<p>This is what users will use to specify their models. Most frameworks will let users
write in some existing programming language and call the framework’s functions and
classes, but <del>some others</del> — why don’t I just say it — Stan rolls their own
domain-specific language.</p>
<p>The main question here is what language you think is best for users to specify models
in: any sufficiently popular host language (such as Python) will reduce the learning
curve for users and make the framework easier to develop and maintain, but a creating
your own language allows you to introduce helpful abstractions for your framework’s
particular use case (as <a href="https://mc-stan.org/docs/2_20/reference-manual/blocks-chapter.html">Stan
does</a>, for example).</p>
<p>At this point I should point out the non-universal, Python bias in this post: there are
plenty of interesting non-Python probabilistic programming frameworks out there (e.g.
<a href="https://greta-stats.org/">Greta</a> in R, <a href="https://turing.ml/dev/">Turing</a> and
<a href="https://www.gen.dev/">Gen</a> in Julia, <a href="https://github.com/p2t2/figaro">Figaro</a> and
<a href="https://github.com/stripe/rainier">Rainier</a> in Scala), as well as universal
probabilistic programming systems<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup> (e.g.
<a href="http://probcomp.csail.mit.edu/software/venture/">Venture</a> from MIT,
<a href="https://probprog.github.io/anglican/index.html">Angelican</a> from Oxford)<sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup>. I just
don’t know anything about any of them.</p>
<h3 id="building-the-model-density-distributions-and-transformations">Building the model density: distributions and transformations</h3>
<p>These are what the user’s model calls, in order to compile/build the model itself
(whether that means a posterior log probability, in the case of MCMC, or some loss
function to minimize, in the case of VI). By <em>distributions</em>, I mean the probability
distributions that the random variables in your model can assume (e.g. Normal or
Poisson), and by <em>transformations</em> I mean deterministic mathematical operations you can
perform on these random variables, while still keeping track of the derivative of these
transformations<sup id="fnref:4"><a href="#fn:4" class="footnote-ref" role="doc-noteref">4</a></sup> (e.g. exponentials, logarithms, sines or cosines).</p>
<p>This is a good time to point out that the interactions between the language/API and the
distributions and transformations libraries is a major design problem. Here’s a (by no
means exhaustive) list of necessary considerations:</p>
<ol>
<li>In order to build the model density, the framework must keep track of every
distribution and transformation, while also computing the derivatives of any such
transformations. This results in a Jekyll-and-Hyde problem where every transformation
requires a forward and backwards definition. Should this tracking happen eagerly, or
should it be deferred until the user specifies what the model will be used for?</li>
<li>Theoretically, a model’s specification should be the same whether it is to be used
for evaluation, inference or debugging. However, in practice, the program execution
(and computational graph) are different for these three purposes. How should the
framework manage this?</li>
<li>The framework must also keep track of the shapes of random variables, which is
frighteningly non-trivial! Check out <a href="https://ericmjl.github.io/blog/2019/5/29/reasoning-about-shapes-and-probability-distributions/">this blog
post</a>
or <a href="https://arxiv.org/abs/1711.10604">the original Tensorflow Distributions paper</a>
(specifically section 3.3 on shape semantics) for more details.</li>
</ol>
<p>For a more comprehensive treatment, I can’t recommend <a href="https://docs.google.com/presentation/d/1xgNRJDwkWjTHOYMj5aGefwWiV8x-Tz55GfkBksZsN3g/edit?usp=sharing">Junpeng Lao’s PyData Córdoba 2019
talk</a>
highly enough — he explains in depth the main challenges in implementing a probabilistic
programming API and highlights how various frameworks manage these difficulties.</p>
<h3 id="computing-the-posterior-inference-algorithm">Computing the posterior: inference algorithm</h3>
<p>Having specified and built the model, the framework must now actually perform inference:
given a model and some data, obtain the posterior (either by sampling from it, in the
case of MCMC, or by approximating it, in the case of VI).</p>
<p>Most probabilistic programming frameworks out there implement both MCMC and VI
algorithms, although strength of support and quality of documentation can lean heavily
one way or another. For example, Stan invests heavily into its MCMC, whereas Pyro has
the most extensive support for its stochastic VI.</p>
<h3 id="computing-the-mode-optimizer">Computing the mode: optimizer</h3>
<p>Sometimes, instead of performing full-blown inference, it’s useful to find the mode of
the model density. These modes can be used as point estimates of parameters, or as the
basis of approximations to a Bayesian posterior. Or perhaps you’re doing VI, and you
need some way to perform SGD on a loss function. In either case, a probabilistic
programming framework calls for an optimizer.</p>
<p>If you don’t need to do VI, then a simple and sensible thing to do is to use some
<a href="https://en.wikipedia.org/wiki/Broyden%E2%80%93Fletcher%E2%80%93Goldfarb%E2%80%93Shanno_algorithm">BFGS-based optimization
algorithm</a>
(e.g. some quasi-Newton method like
<a href="https://en.wikipedia.org/wiki/Limited-memory_BFGS">L-BFGS</a>) and call it a day.
However, frameworks that focus on VI need to implement <a href="http://docs.pyro.ai/en/stable/optimization.html#module-pyro.optim.optim">optimizers commonly seen in deep
learning</a>, such
as Adam or RMSProp.</p>
<h3 id="computing-gradients-autodifferentiation">Computing gradients: autodifferentiation</h3>
<p>Both the inference algorithm and the optimizer require gradients (at least, if you’re
not using ancient inference algorithms and optimizers!), and so you’ll need some way to
compute these gradients.</p>
<p>The easiest thing to do would be to rely on a deep learning framework like TensorFlow or
PyTorch. I’ve learned not to get too excited about this though: while deep learning
frameworks’ heavy optimization of parallelized routines lets you e.g. obtain <a href="https://colindcarroll.com/2019/08/18/very-parallel-mcmc-sampling/">thousands
of MCMC chains in a reasonable amount of
time</a>, it’s not
obvious that this is useful at all (although there’s definitely some work going on in
this area).</p>
<h3 id="monitoring-inference-diagnostics">Monitoring inference: diagnostics</h3>
<p>Finally, once the inference algorithm has worked its magic, you’ll want a way to verify
the validity and efficiency of that inference. This involves some <a href="https://arviz-devs.github.io/arviz/api.html#stats">off-the-shelf
statistical diagnostics</a> (e.g. BFMI,
information criteria, effective sample size, etc.), but mainly <a href="https://arviz-devs.github.io/arviz/api.html#plots">lots and lots of
visualization</a>.</p>
<h2 id="a-zoo-of-probabilistic-programming-frameworks">A Zoo of Probabilistic Programming Frameworks</h2>
<p>Having outlined the basic internals of probabilistic programming frameworks, I think
it’s helpful to go through several of the popular frameworks as examples. I’ve tried to
link to the relevant source code in the frameworks where possible.</p>
<h3 id="stan">Stan</h3>
<p>It’s very easy to describe how Stan is structured: literally everything is
implemented from scratch in C++.</p>
<ol>
<li>Stan has a compiler for <a href="https://github.com/stan-dev/stan/tree/develop/src/stan/lang">a small domain-specific language for specifying Bayesian
models</a></li>
<li>Stan has libraries of <a href="https://github.com/stan-dev/math/tree/develop/stan/math/prim">probability
distributions</a> and
<a href="https://github.com/stan-dev/math/tree/develop/stan/math/prim/fun">transforms</a></li>
<li>Stan implements <a href="https://github.com/stan-dev/stan/tree/develop/src/stan/mcmc/hmc">dynamic
HMC</a> and
<a href="https://github.com/stan-dev/stan/tree/develop/src/stan/variational">variational
inference</a></li>
<li>Stan also rolls their own <a href="https://github.com/stan-dev/math/tree/develop/stan/math">autodifferentiation
library</a><sup id="fnref:5"><a href="#fn:5" class="footnote-ref" role="doc-noteref">5</a></sup></li>
<li>Stan implements an <a href="https://github.com/stan-dev/stan/tree/develop/src/stan/optimization">L-BFGS based
optimizer</a> (but
also implements <a href="https://mc-stan.org/docs/2_20/reference-manual/optimization-algorithms-chapter.html">a less efficient Newton
optimizer</a>)</li>
<li>Finally, Stan has a <a href="https://github.com/stan-dev/stan/tree/develop/src/stan/analyze/mcmc">suite of
diagnostics</a></li>
</ol>
<p>Note that contrary to popular belief, Stan <em>does not</em> implement NUTS:</p>
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Stan implements a dynamic Hamiltonian Monte Carlo method with multinomial sampling of dynamic length trajectories, generalized termination criterion, and improved adaptation of the Euclidean metric.</p>— Dan Simpson (<a href="https://twitter.com/dan_p_simpson">@dan_p_simpson</a>) <a href="https://twitter.com/dan_p_simpson/status/1037332473175265280">September 5, 2018</a></blockquote>
<p>And in case you’re looking for a snazzy buzzword to drop:</p>
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Adaptive HMC. <a href="https://twitter.com/betanalpha">@betanalpha</a> is reluctant to give it a more specific name because, to paraphrase, that’s just marketing bullshit that leads to us celebrating tiny implementation details rather than actual meaningful contributions to comp stats. This is a wide-ranging subtweet.</p>— Dan Simpson (<a href="https://twitter.com/dan_p_simpson">@dan_p_simpson</a>) <a href="https://twitter.com/dan_p_simpson/status/1034098649406554113">August 27, 2018</a></blockquote>
<h3 id="tensorflow-probability-aka-tfp">TensorFlow Probability (a.k.a. TFP)</h3>
<ol>
<li>TFP users write Python (albeit through an <a href="https://colcarroll.github.io/ppl-api/">extremely verbose
API</a>)</li>
<li>TFP implements their own
<a href="https://github.com/tensorflow/probability/tree/master/tensorflow_probability/python/distributions">distributions</a>
and
<a href="https://github.com/tensorflow/probability/tree/master/tensorflow_probability/python/bijectors">transforms</a>
(which TensorFlow, for some reason, calls “bijectors”). You can find more details in
<a href="https://arxiv.org/abs/1711.10604">their arXiv paper</a></li>
<li>TFP implements <a href="https://github.com/tensorflow/probability/tree/master/tensorflow_probability/python/mcmc">a ton of
MCMC</a>
algorithms and a handful of <a href="https://github.com/tensorflow/probability/tree/master/tensorflow_probability/python/vi">VI
algorithms</a>
in TensorFlow</li>
<li>TFP implements <a href="https://github.com/tensorflow/probability/tree/master/tensorflow_probability/python/optimizer">several
optimizers</a>,
including Nelder-Mead, BFGS and L-BFGS (again, in TensorFlow)</li>
<li>TFP relies on TensorFlow to compute gradients (er, duh)</li>
<li>TFP implements <a href="https://github.com/tensorflow/probability/blob/master/tensorflow_probability/python/mcmc/diagnostic.py">a handful of
metrics</a>
(e.g. effective sample size and potential scale reduction), but seems to lack a
comprehensive suite of diagnostics and visualizations: even
<a href="https://github.com/tensorflow/probability/tree/master/tensorflow_probability/python/experimental/edward2">Edward2</a>
(an experimental interface to TFP for flexible modelling, inference and criticism)
suggests that you <a href="https://github.com/tensorflow/probability/blob/master/tensorflow_probability/python/experimental/edward2/Upgrading_From_Edward_To_Edward2.md#model--inference-criticism">build your metrics manually or use boilerplate in
<code>tf.metrics</code></a></li>
</ol>
<h3 id="pymc3">PyMC3</h3>
<ol>
<li>PyMC3 users write Python code, using a context manager pattern (i.e. <code>with pm.Model as model</code>)</li>
<li>PyMC3 implements its own
<a href="https://github.com/pymc-devs/pymc3/tree/master/pymc3/distributions">distributions</a>
and
<a href="https://github.com/pymc-devs/pymc3/blob/master/pymc3/distributions/transforms.py">transforms</a></li>
<li>PyMC3 implements
<a href="https://github.com/pymc-devs/pymc3/blob/master/pymc3/step_methods/hmc/nuts.py">NUTS</a>,
(as well as <a href="https://github.com/pymc-devs/pymc3/tree/master/pymc3/step_methods">a range of other MCMC step
methods</a>) and
<a href="https://github.com/pymc-devs/pymc3/tree/master/pymc3/variational">several variational inference
algorithms</a>,
although NUTS is the default and recommended inference algorithm</li>
<li>PyMC3 (specifically, the <code>find_MAP</code> function) <a href="https://github.com/pymc-devs/pymc3/blob/master/pymc3/tuning/starting.py">relies on
<code>scipy.optimize</code></a>,
which in turn implements a BFGS-based optimizer</li>
<li>PyMC3 <a href="https://github.com/pymc-devs/pymc3/blob/master/pymc3/theanof.py">relies on
Theano</a> to compute
gradients</li>
<li>PyMC3 <a href="https://github.com/pymc-devs/pymc3/blob/master/pymc3/plots/__init__.py">delegates posterior visualization and
diagnostics</a>
to its cousin project <a href="https://arviz-devs.github.io/arviz/">ArviZ</a></li>
</ol>
<p>Some remarks:</p>
<ul>
<li>PyMC3’s context manager pattern is an interceptor for sampling statements: essentially
<a href="https://arxiv.org/abs/1811.06150">an accidental implementation of effect handlers</a>.</li>
<li>PyMC3’s distributions are simpler than those of TFP or PyTorch: they simply need to
have a <code>random</code> and a <code>logp</code> method, whereas TFP/PyTorch implement a whole bunch of
other methods to handle shapes, parameterizations, etc. In retrospect, we realize
that this is <a href="https://docs.pymc.io/developer_guide.html#what-we-got-wrong">one of PyMC3’s design
flaws</a>.</li>
</ul>
<h3 id="pymc4">PyMC4</h3>
<p>PyMC4 is still under active development (at least, at the time of writing), but it’s
safe to call out the overall architecture.</p>
<ol>
<li>PyMC4 users will write Python, although now with a generator pattern (e.g. <code>x = yield Normal(0, 1, "x")</code>), instead of a context manager</li>
<li>PyMC4 will <a href="https://github.com/pymc-devs/pymc4/tree/master/pymc4/distributions/">rely on TensorFlow distributions (a.k.a.
<code>tfd</code>)</a> for both
distributions and transforms</li>
<li>PyMC4 will also <a href="https://github.com/pymc-devs/pymc4/tree/master/pymc4/inference/">rely on TensorFlow for
MCMC</a> (although the
specifics of the exact MCMC algorithm are still fairly fluid at the time of writing)</li>
<li>As far as I can tell, the optimizer is still TBD</li>
<li>Because PyMC4 relies on TFP, which relies on TensorFlow, TensorFlow manages all
gradient computations automatically</li>
<li>Like its predecessor, PyMC4 will delegate diagnostics and visualization to ArviZ</li>
</ol>
<p>Some remarks:</p>
<ul>
<li>With the generator pattern for model specification, PyMC4 embraces the notion of a
probabilistic program as one that defers its computation. For more color on this, see
<a href="https://twitter.com/avibryant/status/1150827954319982592">this Twitter thread</a> I had
with <a href="https://about.me/avibryant">Avi Bryant</a>.</li>
</ul>
<h3 id="pyro">Pyro</h3>
<ol>
<li>Pyro users write Python</li>
<li>Pyro <a href="https://github.com/pyro-ppl/pyro/blob/dev/pyro/distributions/__init__.py">relies on PyTorch
distributions</a>
(<a href="https://github.com/pyro-ppl/pyro/tree/dev/pyro/distributions">implementing its own where
necessary</a>), and also
relies on PyTorch distributions <a href="https://github.com/pyro-ppl/pyro/tree/dev/pyro/distributions/transforms">for its
transforms</a></li>
<li>Pyro implements <a href="http://docs.pyro.ai/en/stable/inference.html">many inference
algorithms</a> in PyTorch (including <a href="https://github.com/pyro-ppl/pyro/tree/dev/pyro/infer/mcmc">HMC
and NUTS</a>), but support
for <a href="https://github.com/pyro-ppl/pyro/blob/dev/pyro/infer/svi.py">stochastic VI</a> is
the most extensive</li>
<li>Pyro implements <a href="https://github.com/pyro-ppl/pyro/blob/master/pyro/optim/optim.py">its own
optimizer</a> in
PyTorch</li>
<li>Pyro relies on PyTorch to compute gradients (again, duh)</li>
<li>As far as I can tell, Pyro doesn’t provide any diagnostic or visualization
functionality</li>
</ol>
<p>Some remarks:</p>
<ul>
<li>Pyro includes the Poutine submodule, which is a library of composable <a href="https://arxiv.org/abs/1811.06150">effect
handlers</a>. While this might sound like recondite
abstractions, they allow you to implement your own custom inference algorithms and
otherwise manipulate Pyro probabilistic programs. In fact, all of Pyro’s inference
algorithms use these effect handlers.</li>
</ul>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>In case you’re testifying under oath and need more reliable sources than
a blog post, I’ve kept a <a href="https://www.zotero.org/eigenfoo/items/collectionKey/AE8882GQ">Zotero
collection</a> for
this project. <a href="#fnref:1" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
<li id="fn:2">
<p>Universal probabilistic programming is an interesting field of inquiry,
but has mainly remained in the realm of academic research. For a (much) more
comprehensive treatment, check out <a href="http://www.robots.ox.ac.uk/~twgr/assets/pdf/rainforth2017thesis.pdf">Tom Rainforth’s PhD
thesis</a>. <a href="#fnref:2" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
<li id="fn:3">
<p>Since publishing this blog post, I have been informed that I am more
ignorant than I know: I have forgotten
<a href="https://github.com/cscherrer/Soss.jl">Soss.jl</a> in Julia and
<a href="https://github.com/thu-ml/zhusuan">ZhuSuan</a> in Python. <a href="#fnref:3" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
<li id="fn:4">
<p>It turns out that such transformations must be <a href="https://en.wikipedia.org/wiki/Local_diffeomorphism">local
diffeomorphisms</a>, and the
derivative information requires computing the log determinant of the Jacobian
of the transformation, commonly abbreviated to <code>log_det_jac</code> or something
similar. <a href="#fnref:4" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
<li id="fn:5">
<p>As an aside, I’ll say that it’s mind boggling how Stan does this. To
quote a (nameless) PyMC core developer:</p>
<blockquote>
<p>I think that maintaining your own autodifferentiation library is the
path of a crazy person.</p>
</blockquote>
 <a href="#fnref:5" class="footnote-backref" role="doc-backlink">↩︎</a></li>
</ol>
</div>Graduated Cooper Union, Joining Point72https://www.georgeho.org/joining-point72/2019-07-22T00:00:00Z2019-07-22T00:00:00Z<p>Some exciting personal news: I’ve <em>(finally)</em> graduated from <a href="http://cooper.edu/welcome">The Cooper
Union</a>, and I’m joining <a href="https://www.point72.com/">Point72 Asset
Management</a> as a data scientist/research analyst!</p>
<p>Point72 is an American hedge fund, headquartered in Connecticut. I’ll be based
in New York, working out of their <a href="https://www.hudsonyardsnewyork.com/work/55-hudson-yards">Hudson
Yards</a> offices.</p>
<center>
<img
src="https://www.georgeho.org/assets/images/point72-logo.png"
alt="Point72 logo">
</center>
<p>In this next chapter of my life, my professional focuses are:</p>
<ol>
<li><strong>Keep learning.</strong> Bayesian methods and deep learning, mostly.</li>
<li><strong>Open source.</strong> I’ve been involved with developing
<a href="https://github.com/pymc-devs/pymc4">PyMC4</a>. These are exciting times for the
PyMC project: I hope to keep contributing!</li>
</ol>
<p>My four years of college were incredibly rewarding, but I’m excited to enter the
real world. Stay tuned!</p>Python Port of _Common Statistical Tests are Linear Models_https://www.georgeho.org/stat-tests-are-linear-model/2019-06-28T00:00:00Z2019-06-28T00:00:00Z<p>I ported <a href="https://lindeloev.net">Jonas Lindeløv</a>’s essay, <a href="https://lindeloev.github.io/tests-as-linear/"><em>Common Statistical
Tests are Linear Models</em></a> from R
to Python. Check it out on <a href="https://www.georgeho.org/tests-as-linear/">my
blog</a>,
<a href="https://github.com/eigenfoo/tests-as-linear">GitHub</a>, or
<a href="https://gke.mybinder.org/v2/gh/eigenfoo/tests-as-linear/master?filepath=tests-as-linear.ipynb">Binder</a>!</p>Decaying Evidence and Contextual Bandits — Bayesian Reinforcement Learning (Part 2)https://www.georgeho.org/bayesian-bandits-2/2019-06-02T00:00:00Z2019-06-02T00:00:00Z<blockquote>
<p>This is the second of a two-part series about Bayesian bandit algorithms.
Check out the first post <a href="https://www.georgeho.org/bayesian-bandits/">here</a>.</p>
</blockquote>
<p><a href="https://www.georgeho.org/bayesian-bandits/">Previously</a>, I introduced the
multi-armed bandit problem, and a Bayesian approach to solving/modelling it
(Thompson sampling). We saw that conjugate models made it possible to run the
bandit algorithm online: the same is even true for non-conjugate models, so long
as the rewards are bounded.</p>
<p>In this follow-up blog post, we’ll take a look at two extensions to the
multi-armed bandit. The first allows the bandit to model nonstationary rewards
distributions, whereas the second allows the bandit to model context. Jump in!</p>
<figure>
<a href="https://www.georgeho.org/assets/images/multi-armed-bandit.jpg"><img src="https://www.georgeho.org/assets/images/multi-armed-bandit.jpg" alt="Cartoon of a multi-armed bandit"></a>
<figcaption>An example of a multi-armed bandit situation. Source: <a href="https://www.inverse.com/article/13762-how-the-multi-armed-bandit-determines-what-ads-and-stories-you-see-online">Inverse</a>.</figcaption>
</figure>
<h2 id="nonstationary-bandits">Nonstationary Bandits</h2>
<p>Up until now, we’ve concerned ourselves with stationary bandits: in other words,
we assumed that the rewards distribution for each arm did not change over time.
In the real world though, rewards distributions need not be stationary: customer
preferences change, trading algorithms deteriorate, and news articles rise and
fall in relevance.</p>
<p>Nonstationarity could mean one of two things for us:</p>
<ol>
<li>either we are lucky enough to know that rewards are similarly distributed
throughout all time (e.g. the rewards are always normally distributed, or
always binomially distributed), and that it is merely the parameters of these
distributions that are liable to change,</li>
<li>or we aren’t so unlucky, and the rewards distributions are not only changing,
but don’t even have a nice parametric form.</li>
</ol>
<p>Good news, though: there is a neat trick to deal with both forms of
nonstationarity!</p>
<h3 id="decaying-evidence-and-posteriors">Decaying evidence and posteriors</h3>
<p>But first, some notation. Suppose we have a model with parameters $\theta$. We
place a prior $\color{purple}{\pi_0(\theta)}$ on it<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup>, and at the $t$‘th
time step, we observe data $D_t$, compute the likelihood $\color{blue}{P(D_t
| \theta)}$ and update the posterior from $\color{red}{\pi_t(\theta |
D_{1:t})}$ to $\color{green}{\pi_{t+1}(\theta | D_{1:t+1})}$.</p>
<p>This is a quintessential application of Bayes’ Theorem. Mathematically:</p>
<p>$$ \color{green}{\pi_{t+1}(\theta | D_{1:t+1})} \propto \color{blue}{P(D_{t+1} |
\theta)} \cdot \color{red}{\pi_t (\theta | D_{1:t})} \tag{1} \label{1} $$</p>
<p>However, for problems with nonstationary rewards distributions, we would like
data points observed a long time ago to have less weight than data points
observed recently. This is only prudent: in the absence of recent data, we would
like to adopt a more conservative “no-data” prior, rather than allow our
posterior to be informed by outdated data. This can be achieved by modifying the
Bayesian update to:</p>
<p>$$ \color{green}{\pi_{t+1}(\theta | D_{1:t+1})} \propto \color{magenta}{[}
\color{blue}{P(D_{t+1} | \theta)} \cdot \color{red}{\pi_t (\theta | D_{1:t})}
{\color{magenta}{]^{1-\epsilon}}} \cdot
\color{purple}{\pi_0(\theta)}^\color{magenta}{\epsilon} \tag{2} \label{2} $$</p>
<p>for some $0 < \color{magenta}{\epsilon} \ll 1$. We can think of
$\color{magenta}{\epsilon}$ as controlling the rate of decay of the
evidence/posterior (i.e. how quickly we should distrust past data points).
Notice that if we stop observing data points at time $T$, then
$\color{red}{\pi_t(\theta | D_{1:T})} \rightarrow
\color{purple}{\pi_0(\theta)}$ as $t \rightarrow \infty$.</p>
<p>Decaying the evidence (and therefore the posterior) can be used to address both
types of nonstationarity identified above. Simply use $(\ref{2})$ as a drop-in
replacement for $(\ref{1})$ when updating the hyperparameters. Whether you’re
using a conjugate model or the algorithm by <a href="https://arxiv.org/abs/1111.1797">Agarwal and
Goyal</a> (introduced in <a href="https://www.georgeho.org/bayesian-bandits">the previous blog
post</a>), using $(\ref{2})$ will decay
the evidence and posterior, as desired.</p>
<p>For more information (and a worked example for the Beta-Binomial model!), check
out <a href="https://austinrochford.com/resources/talks/boston-bayesians-2017-bayes-bandits.slides.html#/3">Austin Rochford’s talk for Boston
Bayesians</a>
about Bayesian bandit algorithms for e-commerce.</p>
<h2 id="contextual-bandits">Contextual Bandits</h2>
<p>We can think of the multi-armed bandit problem as follows<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup>:</p>
<ol>
<li>A policy chooses an arm $a$ from $k$ arms.</li>
<li>The world reveals the reward $R_a$ of the chosen arm.</li>
</ol>
<p>However, this formulation fails to capture an important phenomenon: there is
almost always extra information that is available when making each decision.
For instance, online ads occur in the context of the web page in which they
appear, and online store recommendations are given in the context of the user’s
current cart contents (among other things).</p>
<p>To take advantage of this information, we might think of a different formulation
where, on each round:</p>
<ol>
<li>The world announces some context information $x$.</li>
<li>A policy chooses an arm $a$ from $k$ arms.</li>
<li>The world reveals the reward $R_a$ of the chosen arm.</li>
</ol>
<p>In other words, contextual bandits call for some way of taking context as input
and producing arms/actions as output.</p>
<p>Alternatively, if you think of regular multi-armed bandits as taking no input
whatsoever (but still producing outputs, the arms to pull), you can think of
contextual bandits as algorithms that both take inputs and produce outputs.</p>
<h3 id="bayesian-contextual-bandits">Bayesian contextual bandits</h3>
<p>Contextual bandits give us a very general framework for thinking about
sequential decision making (and reinforcement learning). Clearly, there are many
ways to make a bandit algorithm take context into account. Linear regression is
a straightforward and classic example: simply assume that the rewards depend
linearly on the context.</p>
<p>For a refresher on the details of Bayesian linear regression, refer to <a href="https://www.microsoft.com/en-us/research/people/cmbishop/#!prml-book"><em>Pattern
Recognition and Machine
Learning</em></a>
by Christopher Bishop: specifically, section 3.3 on Bayesian linear regression
and exercises 3.12 and 3.13<sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup>. Briefly though, if we place a Gaussian prior on
the regression weights and an inverse gamma prior on the noise parameter (i.e.,
the noise of the observations), then their joint prior will be conjugate to a
Gaussian likelihood, and the posterior predictive distribution for the rewards
will be a Student’s $t$.</p>
<p>Since we need to maintain posteriors of the rewards for each arm (so that we can
do Thompson sampling), we need to run a separate Bayesian linear regression for
each arm. At every iteration we then Thompson sample from each Student’s $t$
posterior, and select the arm with the highest sample.</p>
<p>However, Bayesian linear regression is a textbook example of a model that lacks
expressiveness: in most circumstances, we want something that can model
nonlinear functions as well. One (perfectly valid) way of doing this would be to
hand-engineer some nonlinear features and/or basis functions before feeding them
into a Bayesian linear regression. However, in the 21st century, the trendier
thing to do is to have a neural network learn those features for you. This is
exactly what is proposed in a <a href="https://arxiv.org/abs/1802.09127">ICLR 2018 paper from Google
Brain</a>. They find that this model — which they
call <code>NeuralLinear</code> — performs decently well across a variety of tasks, even
compared to other bandit algorithms. In the words of the authors:</p>
<blockquote>
<p>We believe [<code>NeuralLinear</code>’s] main strength is that it is able to
<em>simultaneously</em> learn a data representation that greatly simplifies the task
at hand, and to accurately quantify the uncertainty over linear models that
explain the observed rewards in terms of the proposed representation.</p>
</blockquote>
<p>For more information, be sure to check out the <a href="https://arxiv.org/abs/1802.09127">Google Brain
paper</a> and the accompanying <a href="https://github.com/tensorflow/models/tree/master/research/deep_contextual_bandits">TensorFlow
code</a>.</p>
<h2 id="further-reading">Further Reading</h2>
<p>For non-Bayesian approaches to contextual bandits, <a href="https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Contextual-Bandit-algorithms">Vowpal
Wabbit</a>
is a great resource: <a href="http://hunch.net/~jl/">John Langford</a> and the team at
<a href="https://www.microsoft.com/research/">Microsoft Research</a> has <a href="https://arxiv.org/abs/1402.0555v2">extensively
researched</a> contextual bandit algorithms.
They’ve provided blazingly fast implementations of recent algorithms and written
good documentation for them.</p>
<p>For the theory and math behind bandit algorithms, <a href="https://banditalgs.com/">Tor Lattimore and Csaba
Szepesvári’s book</a> covers a breathtaking amount of
ground.</p>
<blockquote>
<p>This is the second of a two-part series about Bayesian bandit algorithms.
Check out the first post <a href="https://www.georgeho.org/bayesian-bandits/">here</a>.</p>
</blockquote>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>Did you know you can make <a href="http://adereth.github.io/blog/2013/11/29/colorful-equations/">colored equations with
MathJax</a>?
Technology frightens me sometimes. <a href="#fnref:1" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
<li id="fn:2">
<p>This explanation is largely drawn from <a href="http://hunch.net/?p=298">from John Langford’s
<code>hunch.net</code></a>. <a href="#fnref:2" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
<li id="fn:3">
<p>If you don’t want to do Bishop’s exercises, there’s a partially complete
solutions manual <a href="https://github.com/GoldenCheese/PRML-Solution-Manual/">on
GitHub</a> 😉 <a href="#fnref:3" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
</ol>
</div>Autoregressive Models in Deep Learning — A Brief Surveyhttps://www.georgeho.org/deep-autoregressive-models/2019-03-09T00:00:00Z2019-03-09T00:00:00Z<p>My current project involves working with deep autoregressive models: a class of
remarkable neural networks that aren’t usually seen on a first pass through deep
learning. These notes are a quick write-up of my reading and research: I assume
basic familiarity with deep learning, and aim to highlight general trends and
similarities across autoregressive models, instead of commenting on individual
architectures.</p>
<p><strong>tldr:</strong> <em>Deep autoregressive models are sequence models, yet feed-forward
(i.e. not recurrent); generative models, yet supervised. They are a compelling
alternative to RNNs for sequential data, and GANs for generation tasks.</em></p>
<h2 id="deep-autoregressive-models">Deep Autoregressive Models</h2>
<p>To be explicit (at the expense of redundancy), this blog post is about <em>deep
autoregressive generative sequence models</em>. That’s quite a mouthful of jargon
(and two of those words are actually unnecessary), so let’s unpack that.</p>
<ol>
<li>
<p>Deep</p>
<ul>
<li>Well, these papers are using TensorFlow or PyTorch… so they must be
“deep” 😉</li>
<li>You would think this word is unnecessary, but it’s actually not!
Autoregressive linear models like
<a href="https://en.wikipedia.org/wiki/Autoregressive%E2%80%93moving-average_model">ARMA</a>
or
<a href="https://en.wikipedia.org/wiki/Autoregressive_conditional_heteroskedasticity">ARCH</a>
have been used in statistics, econometrics and financial modelling for
ages.</li>
</ul>
</li>
<li>
<p>Autoregressive</p>
<ul>
<li>
<p><a href="https://deepgenerativemodels.github.io/notes/autoregressive/">Stanford has a good
introduction</a>
to autoregressive models, but I think a good way to explain these models is
to compare them to recurrent neural networks (RNNs), which are far more
well-known.</p>
<figure>
<a href="https://www.georgeho.org/assets/images/rnn-unrolled.png"><img src="https://www.georgeho.org/assets/images/rnn-unrolled.png" alt="Recurrent neural network (RNN) block diagram, both rolled and unrolled"></a>
<figcaption>Obligatory RNN diagram. Source: <a href="https://colah.github.io/posts/2015-08-Understanding-LSTMs/">Chris Olah</a>.</figcaption>
</figure>
<ul>
<li>
<p>Like an RNN, an autoregressive model’s output $h_t$ at time $t$
depends on not just $x_t$, but also $x$’s from previous time steps.
However, <em>unlike</em> an RNN, the previous $x$’s are not provided via some
hidden state: they are given as just another input to the model.</p>
</li>
<li>
<p>The following animation of Google DeepMind’s WaveNet illustrates this
well: the $t$th output is generated in a <em>feed-forward</em> fashion from
several input $x$ values.<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup></p>
<figure>
<a href="https://www.georgeho.org/assets/images/wavenet-animation.gif"><img src="https://www.georgeho.org/assets/images/wavenet-animation.gif" alt="WaveNet animation"></a>
<figcaption>WaveNet animation. Source: <a href="https://deepmind.com/blog/wavenet-generative-model-raw-audio/">Google DeepMind</a>.</figcaption>
</figure>
</li>
<li>
<p>Put simply, <strong>an autoregressive model is merely a feed-forward model which
predicts future values from past values.</strong></p>
</li>
<li>
<p>I’ll explain this more later, but it’s worth saying now: autoregressive
models offer a compelling bargain. You can have stable, parallel and
easy-to-optimize training, faster inference computations, and completely
do away with the fickleness of <a href="https://en.wikipedia.org/wiki/Backpropagation_through_time">truncated backpropagation through
time</a>, if you
are willing to accept a model that (by design) <em>cannot have</em> infinite
memory. There is <a href="http://www.offconvex.org/2018/07/27/approximating-recurrent/">recent
research</a> to
suggest that this is a worthwhile tradeoff.</p>
</li>
</ul>
</li>
</ul>
</li>
<li>
<p>Generative</p>
<ul>
<li>Informally, a generative model is one that can generate new data after
learning from the dataset.</li>
<li>More formally, a generative model models the joint distribution $P(X, Y)$
of the observation $X$ and the target $Y$. Contrast this to a
discriminative model that models the conditional distribution $P(Y|X)$.</li>
<li>GANs and VAEs are two families of popular generative models.</li>
<li>This is unnecessary word #1: any autoregressive model can be run
sequentially to generate a new sequence! Start with your seed $x_1, x_2,
…, x_k$ and predict $x_{k+1}$. Then use $x_2, x_3, …, x_{k+1}$ to
predict $x_{k+2}$, and so on.</li>
</ul>
</li>
<li>
<p>Sequence model</p>
<ul>
<li>Fairly self explanatory: a model that deals with sequential data, whether
it is mapping sequences to scalars (e.g. language models), or mapping
sequences to sequences (e.g. machine translation models).</li>
<li>Although sequence models are designed for sequential data (duh), there has
been success at applying them to non-sequential data. For example,
PixelCNN (discussed below) can generate entire images, even though images
are not sequential in nature: the model generates a pixel at a time, in
sequence!<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup></li>
<li>Notice that an autoregressive model must be a sequence model, so it’s
redundant to further describe these models as sequential (which makes this
unnecessary word #2).</li>
</ul>
</li>
</ol>
<p>A good distinction is that “generative” and “sequential” describe <em>what</em> these
models do, or what kind of data they deal with. “Autoregressive” describes <em>how</em>
these models do what they do: i.e. they describe properties of the network or
its architecture.</p>
<h2 id="some-architectures-and-applications">Some Architectures and Applications</h2>
<p>Deep autoregressive models have seen a good degree of success: below is a list
of some of examples. Each architecture merits exposition and discussion, but
unfortunately there isn’t enough space here to devote to do any of them justice.</p>
<ul>
<li><a href="https://arxiv.org/abs/1601.06759">PixelCNN by Google DeepMind</a> was probably
the first deep autoregressive model, and the progenitor of most of the other
models below. Ironically, the authors spend the bulk of the paper discussing a
recurrent model, PixelRNN, and consider PixelCNN as a “workaround” to avoid
excessive computation. However, PixelCNN is probably this paper’s most lasting
contribution.</li>
<li><a href="https://arxiv.org/abs/1701.05517">PixelCNN++ by OpenAI</a> is, unsurprisingly,
PixelCNN but with various improvements.</li>
<li><a href="https://deepmind.com/blog/wavenet-generative-model-raw-audio/">WaveNet by Google
DeepMind</a> is
heavily inspired by PixelCNN, and models raw audio, not just encoded music.
They had to pull <a href="https://en.wikipedia.org/wiki/%CE%9C-law_algorithm">a neat trick from telecommunications/signals
processing</a> in order to
cope with the sheer size of audio (high-quality audio involves at least 16-bit
precision samples, which means a 65,536-way-softmax per time step!)</li>
<li><a href="https://arxiv.org/abs/1706.03762">Transformer, a.k.a. <em>the “attention is all you need” model</em> by Google
Brain</a> is now a mainstay of NLP, performing
very well at many NLP tasks and being incorporated into subsequent models like
<a href="https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html">BERT</a>.</li>
</ul>
<p>These models have also found applications: for example, <a href="https://arxiv.org/abs/1610.10099">Google DeepMind’s
ByteNet can perform neural machine translation (in linear
time!)</a> and <a href="https://arxiv.org/abs/1610.00527">Google DeepMind’s Video Pixel
Network can model video</a>.<sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup></p>
<h2 id="some-thoughts-and-observations">Some Thoughts and Observations</h2>
<ol>
<li>
<p>Given previous values $x_1, x_2, …, x_t$, these models do not output a
<em>value</em> for $x_{t+1}$, they output the <em>predictive probability
distribution</em> $P(x_{t+1} | x_1, x_2, …, x_t)$ for $x_{t+1}$.</p>
<ul>
<li>If the $x$’s are discrete, then you can do this by outputting an $N$-way
softmaxxed tensor, where $N$ is the number of discrete classes. This is
what the original PixelCNN did, but gets problematic when $N$ is large
(e.g. in the case of WaveNet, where $N = 2^{16}$).</li>
<li>If the $x$’s are continuous, you can model the probability distribution
itself as the sum of basis functions, and having the model output the
parameters of these basis functions. This massively reduces the memory
footprint of the model, and was an important contribution of PixelCNN++.</li>
<li>Theoretically you could have an autoregressive model that <em>doesn’t</em> model
the conditional distribution… but most recent models do.</li>
</ul>
</li>
<li>
<p>Autoregressive models are supervised.</p>
<ul>
<li>With the success and hype of GANs and VAEs, it is easy to assume that all
generative models are unsupervised: this is not true!</li>
<li>This means that that training is stable and highly parallelizable, that it
is straightfoward to tune hyperparameters, and that inference is
computationally inexpensive. We can also break out all the good stuff from
ML-101: train-valid-test splits, cross validation, loss metrics, etc. These
are all things that we lose when we resort to e.g. GANs.</li>
</ul>
</li>
<li>
<p>Autoregressive models work on both continuous and discrete data.</p>
<ul>
<li>Autoregressive sequential models have worked for audio (WaveNet), images
(PixelCNN++) and text (Transformer): these models are very flexible in the
kind of data that they can model.</li>
<li>Contrast this to GANs, which (as far as I’m aware) cannot model discrete
data.</li>
</ul>
</li>
<li>
<p>Autoregressive models are very amenable to conditioning.</p>
<ul>
<li>There are many options for conditioning! You can condition on both discrete
and continuous variables; you can condition at multiple time scales; you can
even condition on latent embeddings or the outputs of other neural networks.</li>
<li>There is one ostensible problem with using autoregressive models as
generative models: you can only condition on your data’s labels. I.e.
unlike a GAN, you cannot condition on random noise and expect the model to
shape the noise space into a semantically (stylistically) meaningful latent
space.</li>
<li>Google DeepMind followed up their original PixelRNN paper with <a href="https://arxiv.org/abs/1606.05328">another
paper</a> that describes one way to overcome
this problem. Briefly: to condition, they incorporate the latent vector into
the PixelCNN’s activation functions; to produce/learn the latent vectors,
they use a convolutional encoder; and to generate an image given a latent
vector, they replace the traditional deconvolutional decoder with a
conditional PixelCNN.</li>
<li>WaveNet goes even futher and employs “global” and “local” conditioning (both
are achieved by incorporating the latent vectors into WaveNet’s activation
functions). The authors devise a battery of conditioning schemes to capture
speaker identity, linguistic features of input text, music genre, musical
instrument, etc.</li>
</ul>
</li>
<li>
<p>Generating output sequences of variable length is not straightforward.</p>
<ul>
<li>Neither WaveNet nor PixelCNN needed to worry about a variable output length:
both audio and images are comprised of a fixed number of outputs (i.e. audio
is just $N$ samples, and images are just $N^2$ pixels).</li>
<li>Text, on the other hand, is different: sentences can be of variable length.
One would think that this is a nail in a coffin, but thankfully text is
discrete: the standard trick is to have a “stop token” that indicates that
the sentence is finished (i.e. model a full stop as its own token).</li>
<li>As far as I am aware, there is no prior literature on having both problems:
a variable-length output of continuous values.</li>
</ul>
</li>
<li>
<p>Autoregressive models can model multiple time scales</p>
<ul>
<li>
<p>In the case of music, there are important patterns to model at multiple
time scales: individual musical notes drive correlations between audio
samples at the millisecond scale, and music exhibits rhythmic patterns
over the course of minutes. This is well illustrated by the following
animation:</p>
<figure>
<a href="https://www.georgeho.org/assets/images/audio-animation.gif"><img src="https://www.georgeho.org/assets/images/audio-animation.gif" alt="Audio at multiple time scales"></a>
<figcaption>Audio exhibits patterns at multiple time scales. Source: <a href="https://deepmind.com/blog/wavenet-generative-model-raw-audio/">Google DeepMind</a>.</figcaption>
</figure>
</li>
<li>
<p>There are two main ways model many patterns at many different time scales:
either make the receptive field of your model <em>extremely</em> wide (e.g.
through dilated convolutions), or condition your model on a subsampled
version of your generated output, which is in turn produced by an
unconditioned model.</p>
<ul>
<li>Google DeepMind composes an unconditional PixelRNN with one or more
conditional PixelRNNs to form a so-called “multi-scale” PixelRNN: the
first PixelRNN generates a lower-resolution image that conditions the
subsequent PixelRNNs.</li>
<li>WaveNet employs a different technique and calls them “context stacks”.</li>
</ul>
</li>
</ul>
</li>
<li>
<p>How the hell can any of this stuff work?</p>
<ul>
<li>
<p>RNNs are theoretically more expressive and powerful than autoregressive
models. However, recent work suggests that such infinite-horizon memory is
seldom achieved in practice.</p>
</li>
<li>
<p>To quote <a href="http://www.offconvex.org/2018/07/27/approximating-recurrent/">John Miller at the Berkeley AI Research
lab</a>:</p>
<blockquote>
<p><strong>Recurrent models trained in practice are effectively feed-forward.</strong>
This could happen either because truncated backpropagation through time
cannot learn patterns significantly longer than $k$ steps, or, more
provocatively, because models <em>trainable by gradient descent</em> cannot have
long-term memory.</p>
</blockquote>
</li>
</ul>
</li>
</ol>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>There’s actually a lot more nuance than meets the eye in this animation,
but all I’m trying to illustrate is the feed-forward nature of autoregressive
models. <a href="#fnref:1" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
<li id="fn:2">
<p>I personally think it’s breathtakingly that machines can do this. Imagine
your phone keyboard’s word suggestions (those are autoregressive!) spitting
out an entire novel. Or imagine weaving a sweater but you had to choose the
color of every stitch, in order, in advance. <a href="#fnref:2" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
<li id="fn:3">
<p>In case you haven’t noticed, Google DeepMind seemed to have had an
infatuation with autoregressive models back in 2016. <a href="#fnref:3" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
</ol>
</div>Modern Computational Methods for Bayesian Inference — A Reading Listhttps://www.georgeho.org/bayesian-inference-reading/2019-01-02T00:00:00Z2019-01-02T00:00:00Z<p>Lately I’ve been troubled by how little I actually knew about how Bayesian
inference <em>really worked</em>. I could explain to you <a href="https://maria-antoniak.github.io/2018/11/19/data-science-crash-course.html">many other machine learning
techniques</a>,
but with Bayesian modelling… well, there’s a model (which is basically the
likelihood, I think?), and then there’s a prior, and then, um…</p>
<p>What actually happens when you run a sampler? What makes inference
“variational”? And what is this automatic differentiation doing in my
variational inference? <em>Cue long sleepless nights, contemplating my own
ignorance.</em></p>
<p>So to celebrate the new year<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup>, I compiled a list of things to read — blog
posts, journal papers, books, anything that would help me understand (or at
least, appreciate) the math and computation that happens when I press the <em>Magic
Inference Button™</em>. Again, this reading list isn’t focused on how to use
Bayesian modelling for a <em>specific</em> use case<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup>; it’s focused on how modern
computational methods for Bayesian inference work <em>in general</em>.</p>
<p>So without further ado…</p>
<div>
<h2>Contents</h2>
<nav id="TableOfContents">
<ul>
<li><a href="#markov-chain-monte-carlo">Markov-Chain Monte Carlo</a>
<ul>
<li><a href="#for-the-uninitiated">For the uninitiated</a></li>
<li><a href="#hamiltonian-monte-carlo-and-the-no-u-turn-sampler">Hamiltonian Monte Carlo and the No-U-Turn Sampler</a></li>
<li><a href="#sequential-monte-carlo-and-other-sampling-methods">Sequential Monte Carlo and other sampling methods</a></li>
</ul>
</li>
<li><a href="#variational-inference">Variational Inference</a>
<ul>
<li><a href="#for-the-uninitiated-1">For the uninitiated</a></li>
<li><a href="#automatic-differentiation-variational-inference-advi">Automatic differentiation variational inference (ADVI)</a></li>
</ul>
</li>
<li><a href="#open-source-software-for-bayesian-inference">Open-Source Software for Bayesian Inference</a></li>
<li><a href="#further-topics">Further Topics</a>
<ul>
<li><a href="#approximate-bayesian-computation-abc-and-likelihood-free-methods">Approximate Bayesian computation (ABC) and likelihood-free methods</a></li>
<li><a href="#expectation-propagation">Expectation propagation</a></li>
<li><a href="#operator-variational-inference-opvi">Operator variational inference (OPVI)</a></li>
</ul>
</li>
</ul>
</nav>
</div>
<h2 id="markov-chain-monte-carlo">Markov-Chain Monte Carlo</h2>
<h3 id="for-the-uninitiated">For the uninitiated</h3>
<ol>
<li><a href="https://twiecki.github.io/blog/2015/11/10/mcmc-sampling/">MCMC Sampling for
Dummies</a> by Thomas
Wiecki. A basic introduction to MCMC with accompanying Python snippets. The
Metropolis sampler is used an introduction to sampling.</li>
<li><a href="http://www.mcmchandbook.net/HandbookChapter1.pdf">Introduction to Markov Chain Monte
Carlo</a> by Charles Geyer.
The first chapter of the aptly-named <a href="http://www.mcmchandbook.net/"><em>Handbook of Markov Chain Monte
Carlo</em></a>.</li>
<li><a href="https://arxiv.org/abs/2001.06249">Markov Chain Monte Carlo Methods, a survey with some frequent
misunderstandings</a> is an instructive
collection of Cross-Validated questions that clear up common
misunderstandings of MCMC.</li>
</ol>
<h3 id="hamiltonian-monte-carlo-and-the-no-u-turn-sampler">Hamiltonian Monte Carlo and the No-U-Turn Sampler</h3>
<ol>
<li><a href="https://arogozhnikov.github.io/2016/12/19/markov_chain_monte_carlo.html">Hamiltonian Monte Carlo
explained</a>.
A visual and intuitive explanation of HMC: great for starters.</li>
<li><a href="https://arxiv.org/abs/1701.02434">A Conceptual Introduction to Hamiltonian Monte
Carlo</a> by Michael Betancourt. An excellent
paper for a solid conceptual understanding and principled intuition for HMC.</li>
<li><a href="https://colindcarroll.com/2019/04/06/exercises-in-automatic-differentiation-using-autograd-and-jax/">Exercises in Automatic Differentiation using <code>autograd</code> and
<code>jax</code></a>
by Colin Carroll. This is the first in a series of blog posts that explain
HMC from the very beginning. See also <a href="https://colindcarroll.com/2019/04/11/hamiltonian-monte-carlo-from-scratch/">Hamiltonian Monte Carlo from
Scratch</a>,
<a href="https://colindcarroll.com/2019/04/21/step-size-adaptation-in-hamiltonian-monte-carlo/">Step Size Adaptation in Hamiltonian Monte
Carlo</a>,
and <a href="https://colindcarroll.com/2019/04/28/choice-of-symplectic-integrator-in-hamiltonian-monte-carlo/">Choice of Symplectic Integrator in Hamiltonian Monte
Carlo</a>.</li>
<li><a href="https://arxiv.org/abs/1111.4246">The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte
Carlo</a> by Matthew Hoffman and Andrew Gelman.
The original NUTS paper.</li>
<li><a href="http://www.mcmchandbook.net/HandbookChapter5.pdf">MCMC Using Hamiltonian
Dynamics</a> by Radford Neal.</li>
<li><a href="https://colindcarroll.com/talk/hamiltonian-monte-carlo/">Hamiltonian Monte Carlo in
PyMC3</a> by Colin
Carroll.</li>
</ol>
<h3 id="sequential-monte-carlo-and-other-sampling-methods">Sequential Monte Carlo and other sampling methods</h3>
<ol>
<li>Chapter 11 (Sampling Methods) of <a href="https://www.microsoft.com/en-us/research/people/cmbishop/#!prml-book">Pattern Recognition and Machine
Learning</a>
by Christopher Bishop. Covers rejection, importance, Metropolis-Hastings,
Gibbs and slice sampling. Perhaps not as rampantly useful as NUTS, but good
to know nevertheless.</li>
<li><a href="https://chi-feng.github.io/mcmc-demo/">The Markov-chain Monte Carlo Interactive
Gallery</a> by Chi Feng. A fantastic
library of visualizations of various MCMC samplers.</li>
<li>For non-Markov chain based Monte Carlo methods, there is <a href="https://www.stats.ox.ac.uk/~doucet/doucet_defreitas_gordon_smcbookintro.pdf">An Introdution to
Sequential Monte Carlo
Methods</a>
by Arnaud Doucet, Nando de Freitas and Neil Gordon. This chapter from <a href="https://www.springer.com/us/book/9780387951461">the
authors’ textbook on SMC</a>
provides motivation for using SMC methods, and gives a brief introduction to
a basic particle filter.</li>
<li><a href="http://www.stats.ox.ac.uk/~doucet/smc_resources.html">Sequential Monte Carlo Methods & Particle Filters
Resources</a> by Arnaud
Doucet. A list of resources on SMC and particle filters: way more than you
probably ever need to know about them.</li>
</ol>
<h2 id="variational-inference">Variational Inference</h2>
<h3 id="for-the-uninitiated-1">For the uninitiated</h3>
<ol>
<li><a href="http://willwolf.io/2018/11/11/em-for-lda/">Deriving
Expectation-Maximization</a> by Will
Wolf. The first blog post in a series that builds from EM all the way to VI.
Also check out <a href="http://willwolf.io/2018/11/23/mean-field-variational-bayes/">Deriving Mean-Field Variational
Bayes</a>.</li>
<li><a href="https://arxiv.org/abs/1601.00670">Variational Inference: A Review for
Statisticians</a> by David Blei, Alp
Kucukelbir and Jon McAuliffe. An high-level overview of variational
inference: the authors go over one example (performing VI on GMMs) in depth.</li>
<li>Chapter 10 (Approximate Inference) of <a href="https://www.microsoft.com/en-us/research/people/cmbishop/#!prml-book">Pattern Recognition and Machine
Learning</a>
by Christopher Bishop.</li>
</ol>
<h3 id="automatic-differentiation-variational-inference-advi">Automatic differentiation variational inference (ADVI)</h3>
<ol>
<li><a href="https://arxiv.org/abs/1603.00788">Automatic Differentiation Variational
Inference</a> by Alp Kucukelbir, Dustin Tran
et al. The original ADVI paper.</li>
<li><a href="https://papers.nips.cc/paper/5758-automatic-variational-inference-in-stan">Automatic Variational Inference in
Stan</a>
by Alp Kucukelbir, Rajesh Ranganath, Andrew Gelman and David Blei.</li>
</ol>
<h2 id="open-source-software-for-bayesian-inference">Open-Source Software for Bayesian Inference</h2>
<p>There are many open-source software libraries for Bayesian modelling and
inference, and it is instructive to look into the inference methods that they do
(or do not!) implement.</p>
<ol>
<li><a href="http://mc-stan.org/">Stan</a></li>
<li><a href="http://docs.pymc.io/">PyMC3</a></li>
<li><a href="http://pyro.ai/">Pyro</a></li>
<li><a href="https://www.tensorflow.org/probability/">Tensorflow Probability</a></li>
<li><a href="http://edwardlib.org/">Edward</a></li>
<li><a href="https://greta-stats.org/">Greta</a></li>
<li><a href="https://dotnet.github.io/infer/">Infer.NET</a></li>
<li><a href="https://www.mrc-bsu.cam.ac.uk/software/bugs/">BUGS</a></li>
<li><a href="http://mcmc-jags.sourceforge.net/">JAGS</a></li>
</ol>
<h2 id="further-topics">Further Topics</h2>
<p>Bayesian inference doesn’t stop at MCMC and VI: there is bleeding-edge research
being done on other methods of inference. While they aren’t ready for real-world
use, it is interesting to see what they are.</p>
<h3 id="approximate-bayesian-computation-abc-and-likelihood-free-methods">Approximate Bayesian computation (ABC) and likelihood-free methods</h3>
<ol>
<li><a href="https://arxiv.org/abs/1001.2058">Likelihood-free Monte Carlo</a> by Scott
Sisson and Yanan Fan.</li>
</ol>
<h3 id="expectation-propagation">Expectation propagation</h3>
<ol>
<li><a href="https://arxiv.org/abs/1412.4869">Expectation propagation as a way of life: A framework for Bayesian inference
on partitioned data</a> by Aki Vehtari, Andrew
Gelman, et al.</li>
</ol>
<h3 id="operator-variational-inference-opvi">Operator variational inference (OPVI)</h3>
<ol>
<li><a href="https://arxiv.org/abs/1610.09033">Operator Variational Inference</a> by Rajesh
Ranganath, Jaan Altosaar, Dustin Tran and David Blei. The original OPVI
paper.</li>
</ol>
<p>(I’ve tried to include as many relevant and helpful resources as I could find,
but if you feel like I’ve missed something, <a href="https://twitter.com/@_eigenfoo">drop me a
line</a>!)</p>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p><a href="https://twitter.com/year_progress/status/1079889949871300608">Relevant tweet
here.</a> <a href="#fnref:1" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
<li id="fn:2">
<p>If that’s what you’re looking for, check out my <a href="https://www.georgeho.org/bayesian-modelling-cookbook">Bayesian modelling
cookbook</a> or <a href="https://betanalpha.github.io/assets/case_studies/principled_bayesian_workflow.html">Michael
Betancourt’s excellent essay on a principles Bayesian
workflow</a>. <a href="#fnref:2" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
</ol>
</div>Probabilistic and Bayesian Matrix Factorizations for Text Clusteringhttps://www.georgeho.org/matrix-factorizations/2018-10-13T00:00:00Z2018-10-13T00:00:00Z<p>Natural language processing is in a curious place right now. It was always a
late bloomer (as far as machine learning subfields go), and it’s not immediately
obvious how close the field is to viable, large-scale, production-ready
techniques (in the same way that, say, <a href="https://clarifai.com/models/">computer vision
is</a>). For example, <a href="https://ruder.io">Sebastian
Ruder</a> predicted that the field is <a href="https://thegradient.pub/nlp-imagenet/">close to a watershed
moment</a>, and that soon we’ll have
downloadable language models. However, <a href="https://thegradient.pub/author/ana/">Ana
Marasović</a> points out that there is <a href="https://thegradient.pub/frontiers-of-generalization-in-natural-language-processing/">a
tremendous amount of work demonstrating
that</a>:</p>
<blockquote>
<p>“despite good performance on benchmark datasets, modern NLP techniques are
nowhere near the skill of humans at language understanding and reasoning when
making sense of novel natural language inputs”.</p>
</blockquote>
<p>I am confident that I am <em>very</em> bad at making lofty predictions about the
future. Instead, I’ll talk about something I know a bit about: simple solutions
to concrete problems, with some Bayesianism thrown in for good measure!</p>
<p>This blog post summarizes some literature on probabilistic and Bayesian
matrix factorization methods, keeping an eye out for applications to one
specific task in NLP: text clustering. It’s exactly what it sounds like, and
there’s been a fair amount of success in applying text clustering to many other
NLP tasks (e.g. check out these examples in <a href="https://www-users.cs.umn.edu/~hanxx023/dmclass/scatter.pdf">document
organization</a>,
<a href="http://jmlr.csail.mit.edu/papers/volume3/bekkerman03a/bekkerman03a.pdf">corpus</a>
<a href="https://www.cs.technion.ac.il/~rani/el-yaniv-papers/BekkermanETW01.pdf">summarization</a>
and <a href="http://www.kamalnigam.com/papers/emcat-aaai98.pdf">document
classification</a>).</p>
<p>What follows is a literature review of three matrix factorization techniques for
machine learning: one classical, one probabilistic and one Bayesian. I also
experimented with applying these methods to text clustering: I gave a guest
lecture on my results to a graduate-level machine learning class at The Cooper
Union (the slide deck is below). Dive in!</p>
<h2 id="non-negative-matrix-factorization-nmf">Non-Negative Matrix Factorization (NMF)</h2>
<p>NMF is a <a href="https://en.wikipedia.org/wiki/Non-negative_matrix_factorization">very
well-known</a>
<a href="http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html">matrix
factorization</a>
<a href="https://arxiv.org/abs/1401.5226">technique</a>, perhaps most famous for its
applications in <a href="http://blog.echen.me/2011/10/24/winning-the-netflix-prize-a-summary/">collaborative filtering and the Netflix
Prize</a>.</p>
<p>Factorize your (entrywise non-negative) $m \times n$ matrix $V$ as
$V = WH$, where $W$ is $m \times p$ and $H$ is $p \times n$. $p$
is the dimensionality of your latent space, and each latent dimension usually
comes to quantify something with semantic meaning. There are several algorithms
to compute this factorization, but Lee and Seung’s <a href="https://dl.acm.org/citation.cfm?id=3008829">multiplicative update
rule</a> (originally published in NIPS
2000) is most popular.</p>
<p>Fairly simple: enough said, I think.</p>
<h2 id="probabilistic-matrix-factorization-pmf">Probabilistic Matrix Factorization (PMF)</h2>
<p>Originally introduced as a paper at <a href="https://papers.nips.cc/paper/3208-probabilistic-matrix-factorization">NIPS
2007</a>,
<em>probabilistic matrix factorization</em> is essentially the exact same model as NMF,
but with uncorrelated (a.k.a. “spherical”) multivariate Gaussian priors placed
on the rows and columns of $U$ and $V$. Expressed as a graphical model, PMF
would look like this:</p>
<figure>
<a href="https://www.georgeho.org/assets/images/pmf.png"><img style="float: middle" src="https://www.georgeho.org/assets/images/pmf.png" alt="Graphical model (using plate notation) for probabilistic matrix factorization (PMF)"></a>
</figure>
<p>Note that the priors are placed on the <em>rows</em> of the $U$ and $V$ matrices.</p>
<p>The authors then (somewhat disappointing) proceed to find the MAP estimate of
the $U$ and $V$ matrices. They show that maximizing the posterior is
equivalent to minimizing the sum-of-squared-errors loss function with two
quadratic regularization terms:</p>
<p>$$
\frac{1}{2} \sum_{i=1}^{N} \sum_{j=1}^{M} {I_{ij} (R_{ij} - U_i^T V_j)^2} +
\frac{\lambda_U}{2} \sum_{i=1}^{N} |U|_{Fro}^2 +
\frac{\lambda_V}{2} \sum_{j=1}^{M} |V|_{Fro}^2
$$</p>
<p>where $|\cdot|_{Fro}$ denotes the Frobenius norm, and $I_{ij}$ is 1 if document
$i$ contains word $j$, and 0 otherwise.</p>
<p>This loss function can be minimized via gradient descent, and implemented in
your favorite deep learning framework (e.g. Tensorflow or PyTorch).</p>
<p>The problem with this approach is that while the MAP estimate is often a
reasonable point in low dimensions, it becomes very strange in high dimensions,
and is usually not informative or special in any way. Read <a href="https://www.inference.vc/high-dimensional-gaussian-distributions-are-soap-bubble/">Ferenc Huszár’s blog
post</a>
for more.</p>
<h2 id="bayesian-probabilistic-matrix-factorization-bpmf">Bayesian Probabilistic Matrix Factorization (BPMF)</h2>
<p>Strictly speaking, PMF is not a Bayesian model. After all, there aren’t any
priors or posteriors, only fixed hyperparameters and a MAP estimate. <em>Bayesian
probabilistic matrix factorization</em>, originally published by <a href="https://dl.acm.org/citation.cfm?id=1390267">researchers from
the University of Toronto</a> is a
fully Bayesian treatment of PMF.</p>
<p>Instead of saying that the rows/columns of U and V are normally distributed with
zero mean and some precision matrix, we place hyperpriors on the mean vector and
precision matrices. The specific priors are Wishart priors on the covariance
matrices (with scale matrix $W_0$ and $\nu_0$ degrees of freedom), and
Gaussian priors on the means (with mean $\mu_0$ and covariance equal to the
covariance given by the Wishart prior). Expressed as a graphical model, BPMF
would look like this:</p>
<figure>
<a href="https://www.georgeho.org/assets/images/bpmf.png"><img style="float: middle" src="https://www.georgeho.org/assets/images/bpmf.png" alt="Graphical model (using plate notation) for Bayesian probabilistic matrix factorization (BPMF)"></a>
</figure>
<p>Note that, as above, the priors are placed on the <em>rows</em> of the $U$ and $V$
matrices, and that $n$ is the dimensionality of latent space (i.e. the number
of latent dimensions in the factorization).</p>
<p>The authors then sample from the posterior distribution of $U$ and $V$ using
a Gibbs sampler. Sampling takes several hours: somewhere between 5 to 180,
depending on how many samples you want. Nevertheless, the authors demonstrate
that BPMF can achieve more accurate and more robust results on the Netflix data
set.</p>
<p>I would propose two changes to the original paper:</p>
<ol>
<li>Use an LKJ prior on the covariance matrices instead of a Wishart prior.
<a href="https://docs.pymc.io/notebooks/LKJ.html">According to Michael Betancourt and the PyMC3 docs, this is more numerically
stable</a>, and will lead to better
inference.</li>
<li>Use a more robust sampler such as NUTS (instead of a Gibbs sampler), or even
resort to variational inference. The paper makes it clear that BPMF is a
computationally painful endeavor, so any speedup to the method would be a
great help. It seems to me that for practical real-world applications to
collaborative filtering, we would want to use variational inference. Netflix
ain’t waiting 5 hours for their recommendations.</li>
</ol>
<h2 id="application-to-text-clustering">Application to Text Clustering</h2>
<p>Most of the work in these matrix factorization techniques focus on
dimensionality reduction: that is, the problem of finding two factor matrices
that faithfully reconstruct the original matrix when multiplied together.
However, I was interested in applying the exact same techniques to a separate
task: text clustering.</p>
<p>A natural question is: why is matrix factorization<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup> a good technique to use
for text clustering? Because it is simultaneously a clustering and a feature
engineering technique: not only does it offer us a latent representation of the
original data, but it also gives us a way to easily <em>reconstruct</em> the original
data from the latent variables! This is something that <a href="https://www.georgeho.org/lda-sucks">latent Dirichlet
allocation</a>, for instance, cannot do.</p>
<p>Matrix factorization lives an interesting double life: clustering technique by
day, feature transformation technique by night. <a href="http://charuaggarwal.net/text-cluster.pdf">Aggarwal and
Zhai</a> suggest that chaining matrix
factorization with some other clustering technique (e.g. agglomerative
clustering or topic modelling) is common practice and is called <em>concept
decomposition</em>, but I haven’t seen any other source back this up.</p>
<p>I experimented with using these techniques to cluster subreddits (<a href="https://www.georgeho.org/reddit-clusters">sound
familiar?</a>). In a nutshell, nothing seemed
to work out very well, and I opine on why I think that’s the case in the slide
deck below. This talk was delivered to a graduate-level course in frequentist
machine learning.</p>
<blockquote class="embedly-card"><h4><a href="https://speakerdeck.com/_eigenfoo/probabilistic-and-bayesian-matrix-factorizations-for-text-clustering">Probabilistic and Bayesian Matrix Factorizations for Text Clustering</a></h4><p> I experimented with using these techniques to cluster subreddits. In a nutshell, nothing seemed to work out very well, and I opine on why I think that’s the case in this slide deck. This talk was delivered to a graduate-level course in frequentist machine learning. </p></blockquote>
<script async src="//cdn.embedly.com/widgets/platform.js" charset="UTF-8"></script>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>which is, by the way, a <a href="http://scikit-learn.org/stable/modules/decomposition.html">severely underappreciated technique in machine
learning</a> <a href="#fnref:1" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
</ol>
</div>Multi-Armed Bandits and Conjugate Models — Bayesian Reinforcement Learning (Part 1)https://www.georgeho.org/bayesian-bandits/2018-08-31T00:00:00Z2018-08-31T00:00:00Z<blockquote>
<p>This is the first of a two-part series about Bayesian bandit algorithms. Check
out the second post <a href="https://www.georgeho.org/bayesian-bandits-2/">here</a>.</p>
</blockquote>
<p>Let’s talk about Bayesianism. It’s developed a reputation (not entirely
justified, but not entirely unjustified either) for being too mathematically
sophisticated or too computationally intensive to work at scale. For instance,
inferring from a Gaussian mixture model is fraught with computational problems
(hierarchical funnels, multimodal posteriors, etc.), and may take a seasoned
Bayesian anywhere between a day and a month to do well. On the other hand, other
blunt hammers of estimation are as easy as a maximum likelihood estimate:
something you could easily get a SQL query to do if you wanted to.</p>
<p>In this blog post I hope to show that there is more to Bayesianism than just
MCMC sampling and suffering, by demonstrating a Bayesian approach to a classic
reinforcement learning problem: the <em>multi-armed bandit</em>.</p>
<p>The problem is this: imagine a gambler at a row of slot machines (each machine
being a “one-armed bandit”), who must devise a strategy so as to maximize
rewards. This strategy includes which machines to play, how many times to play
each machine, in which order to play them, and whether to continue with the
current machine or try a different machine.</p>
<p>This problem is a central problem in decision theory and reinforcement learning:
the agent (our gambler) starts out in a state of ignorance, but learns through
interacting with its environment (playing slots). For more details, Cam
Davidson-Pilon has a great introduction to multi-armed bandits in Chapter 6 of
his book <a href="https://nbviewer.jupyter.org/github/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/blob/master/Chapter6_Priorities/Ch6_Priors_PyMC3.ipynb"><em>Bayesian Methods for
Hackers</em></a>,
and Tor Lattimore and Csaba Szepesvári cover a breathtaking amount of the
underlying theory in their book <a href="http://banditalgs.com/"><em>Bandit Algorithms</em></a>.</p>
<p>So let’s get started! I assume that you are familiar with:</p>
<ul>
<li>some basic probability, at least enough to know some distributions: normal,
Bernoulli, binomial…</li>
<li>some basic Bayesian statistics, at least enough to understand what a
<a href="https://en.wikipedia.org/wiki/Conjugate_prior">conjugate prior</a> (and
conjugate model) is, and why one might like them.</li>
<li><a href="https://jeffknupp.com/blog/2013/04/07/improve-your-python-yield-and-generators-explained/">Python generators and the <code>yield</code>
keyword</a>,
to understand some of the code I’ve written<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup>.</li>
</ul>
<p>Dive in!</p>
<h2 id="the-algorithm">The Algorithm</h2>
<p>The algorithm is straightforward. The description below is taken from Cam
Davidson-Pilon over at Data Origami<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup>.</p>
<p>For each round,</p>
<ol>
<li>Sample a random variable $X_b$ from the prior of bandit $b$, for all
$b$.</li>
<li>Select the bandit with largest sample, i.e. select bandit $B =
\text{argmax}(X_b)$.</li>
<li>Observe the result of pulling bandit $B$, and update your prior on bandit
$B$ using the conjugate model update rule.</li>
<li>Repeat!</li>
</ol>
<p>What I find remarkable about this is how dumbfoundingly simple it is! No MCMC
sampling, no $\hat{R}$s to diagnose, no pesky divergences… all it requires is
a conjugate model, and the rest is literally just counting.</p>
<p><strong>NB:</strong> This algorithm is technically known as <em>Thompson sampling</em>, and is only
one of many algorithms out there. The main difference is that there are other
ways to go from our current priors to a decision on which bandit to play
next. E.g. instead of simply sampling from our priors, we could use the
upper bound of the 90% credible region, or some dynamic quantile of the
posterior (as in Bayes UCB). See Data Origami<sup id="fnref1:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup> for more information.</p>
<h3 id="stochastic-aka-stationary-bandits">Stochastic (a.k.a. stationary) bandits</h3>
<p>Let’s take this algorithm for a spin! Assume we have rewards which are Bernoulli
distributed (this would be the situation we face when e.g. modelling
click-through rates). The conjugate prior for the Bernoulli distribution is the
Beta distribution (this is a special case of the Beta-Binomial model).</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">make_bandits</span>(params):
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">pull</span>(arm, size<span style="color:#f92672">=</span><span style="color:#66d9ef">None</span>):
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">while</span> <span style="color:#66d9ef">True</span>:
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># Bernoulli distributed rewards</span>
</span></span><span style="display:flex;"><span> reward <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>random<span style="color:#f92672">.</span>binomial(n<span style="color:#f92672">=</span><span style="color:#ae81ff">1</span>, p<span style="color:#f92672">=</span>params[arm], size<span style="color:#f92672">=</span>size)
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">yield</span> reward
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">return</span> pull, len(params)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">bayesian_strategy</span>(pull, num_bandits):
</span></span><span style="display:flex;"><span> num_rewards <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>zeros(num_bandits)
</span></span><span style="display:flex;"><span> num_trials <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>zeros(num_bandits)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">while</span> <span style="color:#66d9ef">True</span>:
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># Sample from the bandits' priors, and choose largest</span>
</span></span><span style="display:flex;"><span> choice <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>argmax(
</span></span><span style="display:flex;"><span> np<span style="color:#f92672">.</span>random<span style="color:#f92672">.</span>beta(a<span style="color:#f92672">=</span><span style="color:#ae81ff">2</span> <span style="color:#f92672">+</span> num_rewards, b<span style="color:#f92672">=</span><span style="color:#ae81ff">2</span> <span style="color:#f92672">+</span> num_trials <span style="color:#f92672">-</span> num_rewards)
</span></span><span style="display:flex;"><span> )
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># Sample the chosen bandit</span>
</span></span><span style="display:flex;"><span> reward <span style="color:#f92672">=</span> next(pull(choice))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># Update</span>
</span></span><span style="display:flex;"><span> num_rewards[choice] <span style="color:#f92672">+=</span> reward
</span></span><span style="display:flex;"><span> num_trials[choice] <span style="color:#f92672">+=</span> <span style="color:#ae81ff">1</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">yield</span> choice, reward, num_rewards, num_trials
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">if</span> __name__ <span style="color:#f92672">==</span> <span style="color:#e6db74">"__main__"</span>:
</span></span><span style="display:flex;"><span> pull, num_bandits <span style="color:#f92672">=</span> make_bandits([<span style="color:#ae81ff">0.2</span>, <span style="color:#ae81ff">0.5</span>, <span style="color:#ae81ff">0.7</span>])
</span></span><span style="display:flex;"><span> play <span style="color:#f92672">=</span> bayesian_strategy(pull, num_bandits)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">for</span> _ <span style="color:#f92672">in</span> range(<span style="color:#ae81ff">100</span>):
</span></span><span style="display:flex;"><span> choice, reward, num_rewards, num_trials <span style="color:#f92672">=</span> next(play)
</span></span></code></pre></div><p>Here, <code>pull</code> returns the result of pulling on the <code>arm</code>‘th bandit, and
<code>make_bandits</code> is just a factory function for <code>pull</code>.</p>
<p>The <code>bayesian_strategy</code> function actually implements the algorithm. We only need
to keep track of the number of times we win and the number of times we played
(<code>num_rewards</code> and <code>num_trials</code>, respectively). It samples from all current
<code>np.random.beta</code> priors (where the original prior was a $\text{Beta}(2,
2)$, which is symmetrix about 0.5 and explains the odd-looking <code>a=2+</code> and
<code>b=2+</code> there), picks the <code>np.argmax</code>, <code>pull</code>s that specific bandit, and updates
<code>num_rewards</code> and <code>num_trials</code>.</p>
<p>I’ve omitted the data visualization code here, but if you want to see it, check
out the <a href="https://github.com/eigenfoo/wanderings/blob/afcf37a8c6c2a2ac38f6708c1f3dd50db2ebe71f/bayes/bayesian-bandits.ipynb">Jupyter notebook on my
GitHub</a></p>
<figure>
<a href="https://www.georgeho.org/assets/images/beta-binomial.png"><img style="float: middle" src="https://www.georgeho.org/assets/images/beta-binomial.png" alt="Posterior distribution after several pulls for the Beta-Binomial model"></a>
</figure>
<h3 id="generalizing-to-conjugate-models">Generalizing to conjugate models</h3>
<p>In fact, this algorithm isn’t just limited to Bernoulli-distributed rewards: it
will work for any <a href="https://en.wikipedia.org/wiki/Conjugate_prior#Table_of_conjugate_distributions">conjugate
model</a>!
Here I implement the Gamma-Poisson model (that is, Poisson distributed rewards,
with a Gamma conjugate prior) to illustrate how extensible this framework is.
(Who cares about Poisson distributed rewards, you ask? Anyone who worries about
returning customers, for one!)</p>
<p>Here’s what we need to change:</p>
<ul>
<li>The rewards distribution in the <code>pull</code> function (in practice, you don’t get
to pick this, so <em>technically</em> there’s nothing to change if you’re doing this
in production!)</li>
<li>The sampling from the prior in <code>bayesian_strategy</code></li>
<li>The variables you need to keep track of and the update rule in <code>bayesian_strategy</code></li>
</ul>
<p>Without further ado:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">make_bandits</span>(params):
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">pull</span>(arm, size<span style="color:#f92672">=</span><span style="color:#66d9ef">None</span>):
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">while</span> <span style="color:#66d9ef">True</span>:
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># Poisson distributed rewards</span>
</span></span><span style="display:flex;"><span> reward <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>random<span style="color:#f92672">.</span>poisson(lam<span style="color:#f92672">=</span>params[arm], size<span style="color:#f92672">=</span>size)
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">yield</span> reward
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">return</span> pull, len(params)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">bayesian_strategy</span>(pull, num_bandits):
</span></span><span style="display:flex;"><span> num_rewards <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>ones(num_bandits)
</span></span><span style="display:flex;"><span> num_trials <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>ones(num_bandits)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">while</span> <span style="color:#66d9ef">True</span>:
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># Sample from the bandits' priors, and choose largest</span>
</span></span><span style="display:flex;"><span> choice <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>argmax(np<span style="color:#f92672">.</span>random<span style="color:#f92672">.</span>gamma(num_rewards, scale<span style="color:#f92672">=</span><span style="color:#ae81ff">1</span> <span style="color:#f92672">/</span> num_trials))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># Sample the chosen bandit</span>
</span></span><span style="display:flex;"><span> reward <span style="color:#f92672">=</span> next(pull(choice))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># Update</span>
</span></span><span style="display:flex;"><span> num_rewards[choice] <span style="color:#f92672">+=</span> reward
</span></span><span style="display:flex;"><span> num_trials[choice] <span style="color:#f92672">+=</span> <span style="color:#ae81ff">1</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">yield</span> choice, reward, num_rewards, num_trials
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">if</span> __name__ <span style="color:#f92672">==</span> <span style="color:#e6db74">"__main__"</span>:
</span></span><span style="display:flex;"><span> pull, num_bandits <span style="color:#f92672">=</span> make_bandits([<span style="color:#ae81ff">4.0</span>, <span style="color:#ae81ff">4.5</span>, <span style="color:#ae81ff">5.0</span>])
</span></span><span style="display:flex;"><span> play <span style="color:#f92672">=</span> bayesian_strategy(pull, num_bandits)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">for</span> _ <span style="color:#f92672">in</span> range(<span style="color:#ae81ff">100</span>):
</span></span><span style="display:flex;"><span> choice, reward, num_rewards, num_trials <span style="color:#f92672">=</span> next(play)
</span></span></code></pre></div><figure>
<a href="https://www.georgeho.org/assets/images/gamma-poisson.png"><img style="float: middle" src="https://www.georgeho.org/assets/images/gamma-poisson.png" alt="Posterior distribution after several pulls for the Gamma-Poisson model"></a>
</figure>
<p>This really demonstrates how lean and mean conjugate models can be, especially
considering how much of a pain MCMC or approximate inference methods would be,
compared to literal <em>counting</em>. Conjugate models aren’t just textbook examples:
they’re <em>(gasp)</em> actually useful!</p>
<h3 id="generalizing-to-arbitrary-rewards-distributions">Generalizing to arbitrary rewards distributions</h3>
<p>OK, so if we have a conjugate model, we can use Thompson sampling to solve the
multi-armed bandit problem. But what if our rewards distribution doesn’t have a
conjugate prior, or what if we don’t even <em>know</em> our rewards distribution?</p>
<p>In general this problem is very difficult to solve. Theoretically, we could
place some fairly uninformative prior on our rewards, and after every pull we
could run MCMC to get our posterior, but that doesn’t scale, especially for the
online algorithms that we have in mind. Luckily a recent paper by Agrawal and
Goyal<sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup> gives us some help, <em>if we assume rewards are bounded on the interval
$[0, 1]$</em> (of course, if we have bounded rewards, then we can just normalize
them by their maximum value to get rewards between 0 and 1).</p>
<p>This solutions bootstraps the first Beta-Bernoulli model to this new situation.
Here’s what happens:</p>
<ol>
<li>Sample a random variable $X_b$ from the (Beta) prior of bandit $b$, for
all $b$.</li>
<li>Select the bandit with largest sample, i.e. select bandit $B =
\text{argmax}(X_b)$.</li>
<li>Observe the reward $R$ from bandit $B$.</li>
<li><strong>Observe the outcome $r$ from a Bernoulli trial with probability of success $R$.</strong></li>
<li>Update posterior of $B$ with this observation $r$.</li>
<li>Repeat!</li>
</ol>
<p>Here I do this for the logit-normal distribution (i.e. a random variable whose
logit is normally distributed). Note that <code>np.expit</code> is the inverse of the logit
function.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">make_bandits</span>(params):
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">pull</span>(arm, size<span style="color:#f92672">=</span><span style="color:#66d9ef">None</span>):
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">while</span> <span style="color:#66d9ef">True</span>:
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># Logit-normal distributed returns (or any distribution with finite support)</span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># `expit` is the inverse of `logit`</span>
</span></span><span style="display:flex;"><span> reward <span style="color:#f92672">=</span> expit(np<span style="color:#f92672">.</span>random<span style="color:#f92672">.</span>normal(loc<span style="color:#f92672">=</span>params[arm], scale<span style="color:#f92672">=</span><span style="color:#ae81ff">1</span>, size<span style="color:#f92672">=</span>size))
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">yield</span> reward
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">return</span> pull, len(params)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">bayesian_strategy</span>(pull, num_bandits):
</span></span><span style="display:flex;"><span> num_rewards <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>zeros(num_bandits)
</span></span><span style="display:flex;"><span> num_trials <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>zeros(num_bandits)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">while</span> <span style="color:#66d9ef">True</span>:
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># Sample from the bandits' priors, and choose largest</span>
</span></span><span style="display:flex;"><span> choice <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>argmax(
</span></span><span style="display:flex;"><span> np<span style="color:#f92672">.</span>random<span style="color:#f92672">.</span>beta(<span style="color:#ae81ff">2</span> <span style="color:#f92672">+</span> num_rewards, <span style="color:#ae81ff">2</span> <span style="color:#f92672">+</span> num_trials <span style="color:#f92672">-</span> num_rewards)
</span></span><span style="display:flex;"><span> )
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># Sample the chosen bandit</span>
</span></span><span style="display:flex;"><span> reward <span style="color:#f92672">=</span> next(pull(choice))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># Sample a Bernoulli with probability of success = reward</span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># Remember, reward is normalized to be in [0, 1]</span>
</span></span><span style="display:flex;"><span> outcome <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>random<span style="color:#f92672">.</span>binomial(n<span style="color:#f92672">=</span><span style="color:#ae81ff">1</span>, p<span style="color:#f92672">=</span>reward)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#75715e"># Update</span>
</span></span><span style="display:flex;"><span> num_rewards[choice] <span style="color:#f92672">+=</span> outcome
</span></span><span style="display:flex;"><span> num_trials[choice] <span style="color:#f92672">+=</span> <span style="color:#ae81ff">1</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">yield</span> choice, reward, num_rewards, num_trials
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">if</span> __name__ <span style="color:#f92672">==</span> <span style="color:#e6db74">"__main__"</span>:
</span></span><span style="display:flex;"><span> pull, num_bandits <span style="color:#f92672">=</span> make_bandits([<span style="color:#ae81ff">0.2</span>, <span style="color:#ae81ff">1.8</span>, <span style="color:#ae81ff">2</span>])
</span></span><span style="display:flex;"><span> play <span style="color:#f92672">=</span> bayesian_strategy(pull, num_bandits)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span> <span style="color:#66d9ef">for</span> _ <span style="color:#f92672">in</span> range(<span style="color:#ae81ff">100</span>):
</span></span><span style="display:flex;"><span> choice, reward, num_rewards, num_trials <span style="color:#f92672">=</span> next(play)
</span></span></code></pre></div><figure>
<a href="https://www.georgeho.org/assets/images/bounded.png"><img style="float: middle" src="https://www.georgeho.org/assets/images/bounded.png" alt="Posterior distribution after several pulls with an arbitrary reward distribution (e.g. the logit normal)"></a>
</figure>
<h2 id="final-remarks">Final Remarks</h2>
<p>None of this theory is new: I’m just advertising it! See Cam Davidson-Pilon’s
great blog post about Bayesian bandits<sup id="fnref2:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup> for a much more in-depth treatment,
and of course, read around papers on arXiv if you want to go deeper!</p>
<p>Also, if you want to see all the code that went into this blog post, check out
<a href="https://github.com/eigenfoo/wanderings/blob/afcf37a8c6c2a2ac38f6708c1f3dd50db2ebe71f/bayes/bayesian-bandits.ipynb">the notebook
here</a>.</p>
<blockquote>
<p>This is the first of a two-part series about Bayesian bandit algorithms. Check
out the second post <a href="https://www.georgeho.org/bayesian-bandits-2/">here</a>.</p>
</blockquote>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>I’ve hopped on board the functional programming bandwagon, and couldn’t
help but think that to demonstrate this idea, I didn’t need a framework, a
library or even a class. Just two functions! <a href="#fnref:1" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
<li id="fn:2">
<p>Davidson-Pilon, Cameron. “Multi-Armed Bandits.” DataOrigami, 6 Apr. 2013,
<a href="https://dataorigami.net/blogs/napkin-folding/79031811-multi-armed-bandits">dataorigami.net/blogs/napkin-folding/79031811-multi-armed-bandits</a> <a href="#fnref:2" class="footnote-backref" role="doc-backlink">↩︎</a> <a href="#fnref1:2" class="footnote-backref" role="doc-backlink">↩︎</a> <a href="#fnref2:2" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
<li id="fn:3">
<p><a href="https://arxiv.org/abs/1111.1797">arXiv:1111.1797</a> [cs.LG] <a href="#fnref:3" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
</ol>
</div>Cookbook — Bayesian Modelling with PyMC3https://www.georgeho.org/bayesian-modelling-cookbook/2018-06-24T00:00:00Z2018-06-24T00:00:00Z<p>Recently I’ve started using <a href="https://github.com/pymc-devs/pymc3">PyMC3</a> for
Bayesian modelling, and it’s an amazing piece of software! The API only exposes
as much of heavy machinery of MCMC as you need — by which I mean, just the
<code>pm.sample()</code> method (a.k.a., as <a href="http://twiecki.github.io/blog/2013/08/12/bayesian-glms-1/">Thomas
Wiecki</a> puts it, the
<em>Magic Inference Button™</em>). This really frees up your mind to think about your
data and model, which is really the heart and soul of data science!</p>
<p>That being said however, I quickly realized that the water gets very deep very
fast: I explored my data set, specified a hierarchical model that made sense to
me, hit the <em>Magic Inference Button™</em>, and… uh, what now? I blinked at the
angry red warnings the sampler spat out.</p>
<p>So began by long, rewarding and ongoing exploration of Bayesian modelling. This
is a compilation of notes, tips, tricks and recipes that I’ve collected from
everywhere: papers, documentation, peppering my <a href="https://twitter.com/twiecki">more
experienced</a>
<a href="https://twitter.com/aseyboldt">colleagues</a> with questions. It’s still very much
a work in progress, but hopefully somebody else finds it useful!</p>
<p><img src="https://www.georgeho.org/assets/images/pymc-logo.png" alt="PyMC logo"></p>
<div>
<h2>Contents</h2>
<nav id="TableOfContents">
<ul>
<li><a href="#for-the-uninitiated">For the Uninitiated</a>
<ul>
<li><a href="#bayesian-modelling">Bayesian modelling</a></li>
<li><a href="#markov-chain-monte-carlo">Markov-chain Monte Carlo</a></li>
<li><a href="#variational-inference">Variational inference</a></li>
</ul>
</li>
<li><a href="#model-formulation">Model Formulation</a>
<ul>
<li><a href="#hierarchical-models">Hierarchical models</a></li>
</ul>
</li>
<li><a href="#model-implementation">Model Implementation</a></li>
<li><a href="#mcmc-initialization-and-sampling">MCMC Initialization and Sampling</a></li>
<li><a href="#mcmc-trace-diagnostics">MCMC Trace Diagnostics</a>
<ul>
<li><a href="#fixing-divergences">Fixing divergences</a></li>
<li><a href="#other-common-warnings">Other common warnings</a></li>
<li><a href="#model-reparameterization">Model reparameterization</a></li>
</ul>
</li>
<li><a href="#model-diagnostics">Model Diagnostics</a></li>
</ul>
</nav>
</div>
<h2 id="for-the-uninitiated">For the Uninitiated</h2>
<ul>
<li>First of all, <em>welcome!</em> It’s a brave new world out there — where statistics
is cool, Bayesian and (if you’re lucky) even easy. Dive in!</li>
</ul>
<blockquote>
<p><strong>EDIT (1/24/2020):</strong> I published a <a href="https://www.georgeho.org/bayesian-inference-reading/">subsequent blog
post</a> with a reading list
for Bayesian inference and modelling. Check it out for reading material in
addition to the ones I list below!</p>
</blockquote>
<h3 id="bayesian-modelling">Bayesian modelling</h3>
<ul>
<li>
<p>If you don’t know any probability, I’d recommend <a href="https://betanalpha.github.io/assets/case_studies/probability_theory.html">Michael
Betancourt’s</a>
crash-course in practical probability theory.</p>
</li>
<li>
<p>For an introduction to general Bayesian methods and modelling, I really liked
<a href="http://camdavidsonpilon.github.io/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/">Cam Davidson Pilon’s <em>Bayesian Methods for
Hackers</em></a>:
it really made the whole “thinking like a Bayesian” thing click for me.</p>
</li>
<li>
<p>If you’re willing to spend some money, I’ve heard that <a href="https://sites.google.com/site/doingbayesiandataanalysis/"><em>Doing Bayesian Data
Analysis</em> by
Kruschke</a> (a.k.a.
<em>“the puppy book”</em>) is for the bucket list.</p>
</li>
<li>
<p>Here we come to a fork in the road. The central problem in Bayesian modelling
is this: given data and a probabilistic model that we think models this data,
how do we find the posterior distribution of the model’s parameters? There are
currently two good solutions to this problem. One is Markov-chain Monte Carlo
sampling (a.k.a. MCMC sampling), and the other is variational inference
(a.k.a. VI). Both methods are mathematical Death Stars: extremely powerful but
incredibly complicated. Nevertheless, I think it’s important to get at least a
hand-wavy understanding of what these methods are. If you’re new to all this,
my personal recommendation is to invest your time in learning MCMC: it’s been
around longer, we know that there are sufficiently robust tools to help you,
and there’s a lot more support/documentation out there.</p>
</li>
</ul>
<h3 id="markov-chain-monte-carlo">Markov-chain Monte Carlo</h3>
<ul>
<li>
<p>For a good high-level introduction to MCMC, I liked <a href="https://www.youtube.com/watch?v=DJ0c7Bm5Djk&feature=youtu.be&t=4h40m9s">Michael Betancourt’s
StanCon 2017
talk</a>:
especially the first few minutes where he provides a motivation for MCMC, that
really put all this math into context for me.</p>
</li>
<li>
<p>For a more in-depth (and mathematical) treatment of MCMC, I’d check out his
<a href="https://arxiv.org/abs/1701.02434">paper on Hamiltonian Monte Carlo</a>.</p>
</li>
</ul>
<h3 id="variational-inference">Variational inference</h3>
<ul>
<li>
<p>VI has been around for a while, but it was only in 2017 (2 years ago, at the
time of writing) that <em>automatic differentiation variational inference</em> was
invented. As such, variational inference is undergoing a renaissance and is
currently an active area of statistical research. Since it’s such a nascent
field, most resources on it are very theoretical and academic in nature.</p>
</li>
<li>
<p>Chapter 10 (on approximate inference) in Bishop’s <em>Pattern Recognition and
Machine Learning</em> and <a href="https://www.cs.princeton.edu/courses/archive/fall11/cos597C/lectures/variational-inference-i.pdf">this
tutorial</a>
by David Blei are excellent, if a bit mathematically-intensive, resources.</p>
</li>
<li>
<p>The most hands-on explanation of variational inference I’ve seen is the docs
for <a href="http://pyro.ai/examples/svi_part_i.html">Pyro</a>, a probabilistic
programming language developed by Uber that specializes in variational
inference.</p>
</li>
</ul>
<h2 id="model-formulation">Model Formulation</h2>
<ul>
<li>
<p>Try thinking about <em>how</em> your data would be generated: what kind of machine
has your data as outputs? This will help you both explore your data, as well
as help you arrive at a reasonable model formulation.</p>
</li>
<li>
<p>Try to avoid correlated variables. Some of the more robust samplers can cope
with <em>a posteriori</em> correlated random variables, but sampling is much easier
for everyone involved if the variables are uncorrelated. By the way, the bar
is pretty low here: if the jointplot/scattergram of the two variables looks
like an ellipse, thats usually okay. It’s when the ellipse starts looking like
a line that you should be alarmed.</p>
</li>
<li>
<p>Try to avoid discrete latent variables, and discrete parameters in general.
There is no good method to sample them in a smart way (since discrete
parameters have no gradients); and with “naïve” samplers (i.e. those that do
not take advantage of the gradient), the number of samples one needs to make
good inferences generally scales exponentially in the number of parameters.
For an instance of this, see <a href="https://docs.pymc.io/notebooks/marginalized_gaussian_mixture_model.html">this example on marginal Gaussian
mixtures</a>.</p>
</li>
<li>
<p>The <a href="https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations">Stan GitHub
wiki</a> has
some excellent recommendations on how to choose good priors. Once you get a
good handle on the basics of using PyMC3, I <em>100% recommend</em> reading this wiki
from start to end: the Stan community has fantastic resources on Bayesian
statistics, and even though their APIs are quite different, the mathematical
theory all translates over.</p>
</li>
</ul>
<h3 id="hierarchical-models">Hierarchical models</h3>
<ul>
<li>
<p>First of all, hierarchical models can be amazing! <a href="https://docs.pymc.io/notebooks/GLM-hierarchical.html">The PyMC3
docs</a> opine on this at
length, so let’s not waste any digital ink.</p>
</li>
<li>
<p>The poster child of a Bayesian hierarchical model looks something like this
(equations taken from
<a href="https://en.wikipedia.org/wiki/Bayesian_hierarchical_modeling">Wikipedia</a>):</p>
<p><img style="float: center"
src="https://wikimedia.org/api/rest_v1/media/math/render/svg/765f37f86fa26bef873048952dccc6e8067b78f4"
alt="Example Bayesian hierarchical model equation #1"></p>
<p><img style="float: center"
src="https://wikimedia.org/api/rest_v1/media/math/render/svg/ca8c0e1233fd69fa4325c6eacf8462252ed6b00a"
alt="Example Bayesian hierarchical model equation #2"></p>
<p><img style="float: center"
src="https://wikimedia.org/api/rest_v1/media/math/render/svg/1e56b3077b1b3ec867d6a0f2539ba9a3e79b45c1"
alt="Example Bayesian hierarchical model equation #3"></p>
<p>This hierarchy has 3 levels (some would say it has 2 levels, since there are
only 2 levels of parameters to infer, but honestly whatever: by my count there
are 3). 3 levels is fine, but add any more levels, and it becomes harder for
to sample. Try out a taller hierarchy to see if it works, but err on the side
of 3-level hierarchies.</p>
</li>
<li>
<p>If your hierarchy is too tall, you can truncate it by introducing a
deterministic function of your parameters somewhere (this usually turns out to
just be a sum). For example, instead of modelling your observations are drawn
from a 4-level hierarchy, maybe your observations can be modeled as the sum of
three parameters, where these parameters are drawn from a 3-level hierarchy.</p>
</li>
<li>
<p>More in-depth treatment here in <a href="https://arxiv.org/abs/1312.0906">(Betancourt and Girolami,
2013)</a>. <strong>tl;dr:</strong> hierarchical models all
but <em>require</em> you use to use Hamiltonian Monte Carlo; also included are some
practical tips and goodies on how to do that stuff in the real world.</p>
</li>
</ul>
<h2 id="model-implementation">Model Implementation</h2>
<ul>
<li>
<p>At the risk of overgeneralizing, there are only two things that can go wrong
in Bayesian modelling: either your data is wrong, or your model is wrong. And
it is a hell of a lot easier to debug your data than it is to debug your
model. So before you even try implementing your model, plot histograms of your
data, count the number of data points, drop any NaNs, etc. etc.</p>
</li>
<li>
<p>PyMC3 has one quirky piece of syntax, which I tripped up on for a while. It’s
described quite well in <a href="http://twiecki.github.io/blog/2014/03/17/bayesian-glms-3/#comment-2213376737">this comment on Thomas Wiecki’s
blog</a>.
Basically, suppose you have several groups, and want to initialize several
variables per group, but you want to initialize different numbers of variables
for each group. Then you need to use the quirky <code>variables[index]</code>
notation. I suggest using <code>scikit-learn</code>’s <code>LabelEncoder</code> to easily create the
index. For example, to make normally distributed heights for the iris dataset:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#75715e"># Different numbers of examples for each species</span>
</span></span><span style="display:flex;"><span>species <span style="color:#f92672">=</span> (<span style="color:#ae81ff">48</span> <span style="color:#f92672">*</span> [<span style="color:#e6db74">'setosa'</span>] <span style="color:#f92672">+</span> <span style="color:#ae81ff">52</span> <span style="color:#f92672">*</span> [<span style="color:#e6db74">'virginica'</span>] <span style="color:#f92672">+</span> <span style="color:#ae81ff">63</span> <span style="color:#f92672">*</span> [<span style="color:#e6db74">'versicolor'</span>])
</span></span><span style="display:flex;"><span>num_species <span style="color:#f92672">=</span> len(list(set(species))) <span style="color:#75715e"># 3</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># One variable per group</span>
</span></span><span style="display:flex;"><span>heights_per_species <span style="color:#f92672">=</span> pm<span style="color:#f92672">.</span>Normal(<span style="color:#e6db74">'heights_per_species'</span>,
</span></span><span style="display:flex;"><span> mu<span style="color:#f92672">=</span><span style="color:#ae81ff">0</span>, sd<span style="color:#f92672">=</span><span style="color:#ae81ff">1</span>, shape<span style="color:#f92672">=</span>num_species)
</span></span><span style="display:flex;"><span>idx <span style="color:#f92672">=</span> sklearn<span style="color:#f92672">.</span>preprocessing<span style="color:#f92672">.</span>LabelEncoder()<span style="color:#f92672">.</span>fit_transform(species)
</span></span><span style="display:flex;"><span>heights <span style="color:#f92672">=</span> heights_per_species[idx]
</span></span></code></pre></div></li>
<li>
<p>You might find yourself in a situation in which you want to use a centered
parameterization for a portion of your data set, but a noncentered
parameterization for the rest of your data set (see below for what these
parameterizations are). There’s a useful idiom for you here:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>num_xs <span style="color:#f92672">=</span> <span style="color:#ae81ff">5</span>
</span></span><span style="display:flex;"><span>use_centered <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>array([<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">1</span>]) <span style="color:#75715e"># len(use_centered) = num_xs</span>
</span></span><span style="display:flex;"><span>x_sd <span style="color:#f92672">=</span> pm<span style="color:#f92672">.</span>HalfCauchy(<span style="color:#e6db74">'x_sd'</span>, sd<span style="color:#f92672">=</span><span style="color:#ae81ff">1</span>)
</span></span><span style="display:flex;"><span>x_raw <span style="color:#f92672">=</span> pm<span style="color:#f92672">.</span>Normal(<span style="color:#e6db74">'x_raw'</span>, mu<span style="color:#f92672">=</span><span style="color:#ae81ff">0</span>, sd<span style="color:#f92672">=</span>x_sd<span style="color:#f92672">**</span>use_centered, shape<span style="color:#f92672">=</span>num_xs)
</span></span><span style="display:flex;"><span>x <span style="color:#f92672">=</span> pm<span style="color:#f92672">.</span>Deterministic(<span style="color:#e6db74">'x'</span>, x_sd<span style="color:#f92672">**</span>(<span style="color:#ae81ff">1</span> <span style="color:#f92672">-</span> use_centered) <span style="color:#f92672">*</span> x_raw)
</span></span></code></pre></div><p>You could even experiment with allowing <code>use_centered</code> to be <em>between</em> 0 and
1, instead of being <em>either</em> 0 or 1!</p>
</li>
<li>
<p>I prefer to use the <code>pm.Deterministic</code> function instead of simply using normal
arithmetic operations (e.g. I’d prefer to write <code>x = pm.Deterministic('x', y + z)</code> instead of <code>x = y + z</code>). This means that you can index the <code>trace</code> object
later on with just <code>trace['x']</code>, instead of having to compute it yourself with
<code>trace['y'] + trace['z']</code>.</p>
</li>
</ul>
<h2 id="mcmc-initialization-and-sampling">MCMC Initialization and Sampling</h2>
<ul>
<li>
<p>Have faith in PyMC3’s default initialization and sampling settings: someone
much more experienced than us took the time to choose them! NUTS is the most
efficient MCMC sampler known to man, and <code>jitter+adapt_diag</code>… well, you get
the point.</p>
</li>
<li>
<p>However, if you’re truly grasping at straws, a more powerful initialization
setting would be <code>advi</code> or <code>advi+adapt_diag</code>, which uses variational inference
to initialize the sampler. An even better option would be to use
<code>advi+adapt_diag_grad</code> <del>which is (at the time of writing) an experimental
feature in beta</del>.</p>
</li>
<li>
<p>Never initialize the sampler with the MAP estimate! In low dimensional
problems the MAP estimate (a.k.a. the mode of the posterior) is often quite a
reasonable point. But in high dimensions, the MAP becomes very strange. Check
out <a href="http://www.inference.vc/high-dimensional-gaussian-distributions-are-soap-bubble/">Ferenc Huszár’s blog
post</a>
on high-dimensional Gaussians to see why. Besides, at the MAP all the derivatives
of the posterior are zero, and that isn’t great for derivative-based samplers.</p>
</li>
</ul>
<h2 id="mcmc-trace-diagnostics">MCMC Trace Diagnostics</h2>
<ul>
<li>You’ve hit the <em>Magic Inference Button™</em>, and you have a <code>trace</code> object. Now
what? First of all, make sure that your sampler didn’t barf itself, and that
your chains are safe for consumption (i.e., analysis).</li>
</ul>
<ol>
<li>
<p>Theoretically, run the chain for as long as you have the patience or
resources for. In practice, just use the PyMC3 defaults: 500 tuning
iterations, 1000 sampling iterations.</p>
</li>
<li>
<p>Check for divergences. PyMC3’s sampler will spit out a warning if there are
diverging chains, but the following code snippet may make things easier:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#75715e"># Display the total number and percentage of divergent chains</span>
</span></span><span style="display:flex;"><span>diverging <span style="color:#f92672">=</span> trace[<span style="color:#e6db74">'diverging'</span>]
</span></span><span style="display:flex;"><span>print(<span style="color:#e6db74">'Number of Divergent Chains: </span><span style="color:#e6db74">{}</span><span style="color:#e6db74">'</span><span style="color:#f92672">.</span>format(diverging<span style="color:#f92672">.</span>nonzero()[<span style="color:#ae81ff">0</span>]<span style="color:#f92672">.</span>size))
</span></span><span style="display:flex;"><span>diverging_pct <span style="color:#f92672">=</span> diverging<span style="color:#f92672">.</span>nonzero()[<span style="color:#ae81ff">0</span>]<span style="color:#f92672">.</span>size <span style="color:#f92672">/</span> len(trace) <span style="color:#f92672">*</span> <span style="color:#ae81ff">100</span>
</span></span><span style="display:flex;"><span>print(<span style="color:#e6db74">'Percentage of Divergent Chains: </span><span style="color:#e6db74">{:.1f}</span><span style="color:#e6db74">'</span><span style="color:#f92672">.</span>format(diverging_pct))
</span></span></code></pre></div></li>
<li>
<p>Check the traceplot (<code>pm.traceplot(trace)</code>). You’re looking for traceplots
that look like “fuzzy caterpillars”. If the trace moves into some region and
stays there for a long time (a.k.a. there are some “sticky regions”), that’s
cause for concern! That indicates that once the sampler moves into some
region of parameter space, it gets stuck there (probably due to high
curvature or other bad topological properties).</p>
</li>
<li>
<p>In addition to the traceplot, there are <a href="https://docs.pymc.io/api/plots.html">a ton of other
plots</a> you can make with your trace:</p>
<ul>
<li><code>pm.plot_posterior(trace)</code>: check if your posteriors look reasonable.</li>
<li><code>pm.forestplot(trace)</code>: check if your variables have reasonable credible
intervals, and Gelman–Rubin scores close to 1.</li>
<li><code>pm.autocorrplot(trace)</code>: check if your chains are impaired by high
autocorrelation. Also remember that thinning your chains is a waste of
time at best, and deluding yourself at worst. See Chris Fonnesbeck’s
comment on <a href="https://github.com/pymc-devs/pymc/issues/23">this GitHub
issue</a> and <a href="https://twitter.com/junpenglao/status/1009748562136256512">Junpeng Lao’s
reply to Michael Betancourt’s
tweet</a></li>
<li><code>pm.energyplot(trace)</code>: ideally the energy and marginal energy
distributions should look very similar. Long tails in the distribution of
energy levels indicates deteriorated sampler efficiency.</li>
<li><code>pm.densityplot(trace)</code>: a souped-up version of <code>pm.plot_posterior</code>. It
doesn’t seem to be wildly useful unless you’re plotting posteriors from
multiple models.</li>
</ul>
</li>
<li>
<p>PyMC3 has a nice helper function to pretty-print a summary table of the
trace: <code>pm.summary(trace)</code> (I usually tack on a <code>.round(2)</code> for my sanity).
Look out for:</p>
<ul>
<li>the $\hat{R}$ values (a.k.a. the Gelman–Rubin statistic, a.k.a. the
potential scale reduction factor, a.k.a. the PSRF): are they all close to
1? If not, something is <em>horribly</em> wrong. Consider respecifying or
reparameterizing your model. You can also inspect these in the forest plot.</li>
<li>the sign and magnitude of the inferred values: do they make sense, or are
they unexpected and unreasonable? This could indicate a poorly specified
model. (E.g. parameters of the unexpected sign that have low uncertainties
might indicate that your model needs interaction terms.)</li>
</ul>
</li>
<li>
<p>As a drastic debugging measure, try to <code>pm.sample</code> with <code>draws=1</code>,
<code>tune=500</code>, and <code>discard_tuned_samples=False</code>, and inspect the traceplot.
During the tuning phase, we don’t expect to see friendly fuzzy caterpillars,
but we <em>do</em> expect to see good (if noisy) exploration of parameter space. So
if the sampler is getting stuck during the tuning phase, that might explain
why the trace looks horrible.</p>
</li>
<li>
<p>If you get scary errors that describe mathematical problems (e.g. <code>ValueError: Mass matrix contains zeros on the diagonal. Some derivatives might always be zero.</code>), then you’re <del>shit out of luck</del> exceptionally unlucky: those kinds of
errors are notoriously hard to debug. I can only point to the <a href="http://andrewgelman.com/2008/05/13/the_folk_theore/">Folk Theorem of
Statistical Computing</a>:</p>
<blockquote>
<p>If you’re having computational problems, probably your model is wrong.</p>
</blockquote>
</li>
</ol>
<h3 id="fixing-divergences">Fixing divergences</h3>
<blockquote>
<p><code>There were N divergences after tuning. Increase 'target_accept' or reparameterize.</code></p>
<p>— The <em>Magic Inference Button™</em></p>
</blockquote>
<ul>
<li>
<p>Divergences in HMC occur when the sampler finds itself in regions of extremely
high curvature (such as the opening of the a hierarchical funnel). Broadly
speaking, the sampler is prone to malfunction in such regions, causing the
sampler to fly off towards to infinity. The ruins the chains by heavily
biasing the samples.</p>
</li>
<li>
<p>Remember: if you have even <em>one</em> diverging chain, you should be worried.</p>
</li>
<li>
<p>Increase <code>target_accept</code>: usually 0.9 is a good number (currently the default
in PyMC3 is 0.8). This will help get rid of false positives from the test for
divergences. However, divergences that <em>don’t</em> go away are cause for alarm.</p>
</li>
<li>
<p>Increasing <code>tune</code> can sometimes help as well: this gives the sampler more time
to 1) find the typical set and 2) find good values for the step size, mass
matrix elements, etc. If you’re running into divergences, it’s always possible
that the sampler just hasn’t started the mixing phase and is still trying to
find the typical set.</p>
</li>
<li>
<p>Consider a <em>noncentered</em> parameterization. This is an amazing trick: it all
boils down to the familiar equation $X = \sigma Z + \mu$ from STAT 101, but
it honestly works wonders. See <a href="http://twiecki.github.io/blog/2017/02/08/bayesian-hierchical-non-centered/">Thomas Wiecki’s blog
post</a>
on it, and <a href="https://docs.pymc.io/notebooks/Diagnosing_biased_Inference_with_Divergences.html">this page from the PyMC3
documentation</a>.</p>
</li>
<li>
<p>If that doesn’t work, there may be something wrong with the way you’re
thinking about your data: consider reparameterizing your model, or
respecifying it entirely.</p>
</li>
</ul>
<h3 id="other-common-warnings">Other common warnings</h3>
<ul>
<li>
<p>It’s worth noting that far and away the worst warning to get is the one about
divergences. While a divergent chain indicates that your inference may be
flat-out <em>invalid</em>, the rest of these warnings indicate that your inference is
merely (lol, “merely”) <em>inefficient</em>.</p>
</li>
<li>
<p>It’s also worth noting that the <a href="https://mc-stan.org/misc/warnings.html">Brief Guide to Stan’s
Warnings</a> is a tremendous resource for
exactly what kinds of errors you might get when running HMC or NUTS, and how
you should think about them.</p>
</li>
<li>
<p><code>The number of effective samples is smaller than XYZ for some parameters.</code></p>
<ul>
<li>Quoting <a href="https://discourse.pymc.io/t/the-number-of-effective-samples-is-smaller-than-25-for-some-parameters/1050/3">Junpeng Lao on
<code>discourse.pymc3.io</code></a>:
“A low number of effective samples is usually an indication of strong
autocorrelation in the chain.”</li>
<li>Make sure you’re using an efficient sampler like NUTS. (And not, for
instance, Gibbs or Metropolis–Hastings.)</li>
<li>Tweak the acceptance probability (<code>target_accept</code>) — it should be large
enough to ensure good exploration, but small enough to not reject all
proposals and get stuck.</li>
</ul>
</li>
<li>
<p><code>The gelman-rubin statistic is larger than XYZ for some parameters. This indicates slight problems during sampling.</code></p>
<ul>
<li>When PyMC3 samples, it runs several chains in parallel. Loosely speaking,
the Gelman–Rubin statistic measures how similar these chains are. Ideally it
should be close to 1.</li>
<li>Increasing the <code>tune</code> parameter may help, for the same reasons as described
in the <em>Fixing Divergences</em> section.</li>
</ul>
</li>
<li>
<p><code>The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.</code></p>
<ul>
<li>NUTS puts a cap on the depth of the trees that it evaluates during each
iteration, which is controlled through the <code>max_treedepth</code>. Reaching the
maximum allowable tree depth indicates that NUTS is prematurely pulling the
plug to avoid excessive compute time.</li>
<li>Yeah, what the <em>Magic Inference Button™</em> says: try increasing
<code>max_treedepth</code> or <code>target_accept</code>.</li>
</ul>
</li>
</ul>
<h3 id="model-reparameterization">Model reparameterization</h3>
<ul>
<li>
<p>Countless warnings have told you to engage in this strange activity of
“reparameterization”. What even is that? Luckily, the <a href="https://github.com/stan-dev/stan/releases">Stan User
Manual</a> (specifically the
<em>Reparameterization and Change of Variables</em> section) has an excellent
explanation of reparameterization, and even some practical tips to help you do
it (although your mileage may vary on how useful those tips will be to you).</p>
</li>
<li>
<p>Asides from meekly pointing to other resources, there’s not much I can do to
help: this stuff really comes from a combination of intuition, statistical
knowledge and good ol’ experience. I can, however, cite some examples to give
you a better idea.</p>
<ul>
<li>The noncentered parameterization is a classic example. If you have a
parameter whose mean and variance you are also modelling, the noncentered
parameterization decouples the sampling of mean and variance from the
sampling of the parameter, so that they are now independent. In this way, we
avoid “funnels”.</li>
<li>The <a href="http://proceedings.mlr.press/v5/carvalho09a.html"><em>horseshoe
distribution</em></a> is known to
be a good shrinkage prior, as it is <em>very</em> spikey near zero, and has <em>very</em>
long tails. However, modelling it using one parameter can give multimodal
posteriors — an exceptionally bad result. The trick is to reparameterize and
model it as the product of two parameters: one to create spikiness at zero,
and one to create long tails (which makes sense: to sample from the
horseshoe, take the product of samples from a normal and a half-Cauchy).</li>
</ul>
</li>
</ul>
<h2 id="model-diagnostics">Model Diagnostics</h2>
<ul>
<li>Admittedly the distinction between the previous section and this one is
somewhat artificial (since problems with your chains indicate problems with
your model), but I still think it’s useful to make this distinction because
these checks indicate that you’re thinking about your data in the wrong way,
(i.e. you made a poor modelling decision), and <em>not</em> that the sampler is
having a hard time doing its job.</li>
</ul>
<ol>
<li>
<p>Run the following snippet of code to inspect the pairplot of your variables
one at a time (if you have a plate of variables, it’s fine to pick a couple
at random). It’ll tell you if the two random variables are correlated, and
help identify any troublesome neighborhoods in the parameter space (divergent
samples will be colored differently, and will cluster near such
neighborhoods).</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>pm<span style="color:#f92672">.</span>pairplot(trace,
</span></span><span style="display:flex;"><span> sub_varnames<span style="color:#f92672">=</span>[variable_1, variable_2],
</span></span><span style="display:flex;"><span> divergences<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>,
</span></span><span style="display:flex;"><span> color<span style="color:#f92672">=</span><span style="color:#e6db74">'C3'</span>,
</span></span><span style="display:flex;"><span> kwargs_divergence<span style="color:#f92672">=</span>{<span style="color:#e6db74">'color'</span>: <span style="color:#e6db74">'C2'</span>})
</span></span></code></pre></div></li>
<li>
<p>Look at your posteriors (either from the traceplot, density plots or
posterior plots). Do they even make sense? E.g. are there outliers or long
tails that you weren’t expecting? Do their uncertainties look reasonable to
you? If you had <a href="https://en.wikipedia.org/wiki/Plate_notation">a plate</a> of
variables, are their posteriors different? Did you expect them to be that
way? If not, what about the data made the posteriors different? You’re the
only one who knows your problem/use case, so the posteriors better look good
to you!</p>
</li>
<li>
<p>Broadly speaking, there are four kinds of bad geometries that your posterior
can suffer from:</p>
<ul>
<li>highly correlated posteriors: this will probably cause divergences or
traces that don’t look like “fuzzy caterpillars”. Either look at the
jointplots of each pair of variables, or look at the correlation matrix of
all variables. Try using a centered parameterization, or reparameterize in
some other way, to remove these correlations.</li>
<li>posteriors that form “funnels”: this will probably cause divergences. Try
using a noncentered parameterization.</li>
<li>heavy tailed posteriors: this will probably raise warnings about
<code>max_treedepth</code> being exceeded. If your data has long tails, you should
model that with a long-tailed distribution. If your data doesn’t have long
tails, then your model is ill-specified: perhaps a more informative prior
would help.</li>
<li>multimodal posteriors: right now this is pretty much a death blow. At the
time of writing, all samplers have a hard time with multimodality, and
there’s not much you can do about that. Try reparameterizing to get a
unimodal posterior. If that’s not possible (perhaps you’re <em>modelling</em>
multimodality using a mixture model), you’re out of luck: just let NUTS
sample for a day or so, and hopefully you’ll get a good trace.</li>
</ul>
</li>
<li>
<p>Pick a small subset of your raw data, and see what exactly your model does
with that data (i.e. run the model on a specific subset of your data). I find
that a lot of problems with your model can be found this way.</p>
</li>
<li>
<p>Run <a href="https://docs.pymc.io/notebooks/posterior_predictive.html"><em>posterior predictive
checks</em></a> (a.k.a.
PPCs): sample from your posterior, plug it back in to your model, and
“generate new data sets”. PyMC3 even has a nice function to do all this for
you: <code>pm.sample_ppc</code>. But what do you do with these new data sets? That’s a
question only you can answer! The point of a PPC is to see if the generated
data sets reproduce patterns you care about in the observed real data set,
and only you know what patterns you care about. E.g. how close are the PPC
means to the observed sample mean? What about the variance?</p>
<ul>
<li>For example, suppose you were modelling the levels of radon gas in
different counties in a country (through a hierarchical model). Then you
could sample radon gas levels from the posterior for each county, and take
the maximum within each county. You’d then have a distribution of maximum
radon gas levels across counties. You could then check if the <em>actual</em>
maximum radon gas level (in your observed data set) is acceptably within
that distribution. If it’s much larger than the maxima, then you would know
that the actual likelihood has longer tails than you assumed (e.g. perhaps
you should use a Student’s T instead of a normal?)</li>
<li>Remember that how well the posterior predictive distribution fits the data
is of little consequence (e.g. the expectation that 90% of the data should
fall within the 90% credible interval of the posterior). The posterior
predictive distribution tells you what values for data you would expect if
we were to remeasure, given that you’ve already observed the data you did.
As such, it’s informed by your prior as well as your data, and it’s not its
job to adequately fit your data!</li>
</ul>
</li>
</ol>Understanding Hate Speech on Reddit through Text Clusteringhttps://www.georgeho.org/reddit-clusters/2018-03-18T00:00:00Z2018-03-18T00:00:00Z<blockquote>
<p>Note: the following article contains several examples of hate speech
(including but not limited to racist, misogynistic and homophobic views).</p>
</blockquote>
<p>Have you heard of <code>/r/TheRedPill</code>? It’s an online forum (a subreddit, but I’ll
explain that later) where people (usually men) espouse an ideology predicated
entirely on gender. “Swallowers of the red pill”, as they call themselves,
maintain that it is <em>men</em>, not women, who are socially marginalized; that feminism
is something between a damaging ideology and a symptom of societal retardation;
that the patriarchy should actively assert its dominance over female
compatriots.</p>
<p>Despite being shunned by the world (or perhaps, because of it), <code>/r/TheRedPill</code>
has grown into a sizable community and evolved its own slang, language and
culture. Let me give you an example.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-txt" data-lang="txt"><span style="display:flex;"><span>Cluster #14:
</span></span><span style="display:flex;"><span>Cluster importance: 0.0489376285127
</span></span><span style="display:flex;"><span>shit: 2.433590
</span></span><span style="display:flex;"><span>test: 1.069885
</span></span><span style="display:flex;"><span>frame: 0.396684
</span></span><span style="display:flex;"><span>pass: 0.204953
</span></span><span style="display:flex;"><span>bitch: 0.163619
</span></span></code></pre></div><p>This is a snippet from a text clustering of <code>/r/TheRedPill</code> — you don’t really
need to understand the details right now: all you need to know is that each
cluster is simply a bunch of words that frequently appear together in Reddit
posts and comments. Following each word is a number indicating its importance in
the cluster, and on line 2 is the importance of this cluster to the subreddit
overall.</p>
<p>As it turns out, this cluster has picked up on a very specific meme on
<code>/r/TheRedPill</code>: the concept of the <em>shit test</em>, and how your frame can <em>pass</em> the
<em>shit tests</em> that life (but predominantly, <em>bitches</em>) can throw at you.</p>
<p>There’s absolutely no way I could explain this stuff better than the swallowers
of the red pill themselves, so I’ll just quote from a post on <code>/r/TheRedPill</code> and
a related blog.</p>
<p>The concept of the shit test very broad:</p>
<blockquote>
<p>… when somebody “gives you shit” and fucks around with your head to see how
you will react, what you are experiencing is typically a (series of) shit
test(s).</p>
</blockquote>
<p>A shit test is designed to test your temperament, or more colloquially,
<em>“determine your frame”</em>.</p>
<blockquote>
<p>Frame is a concept which essentially means “composure and self-control”.</p>
<p>… if you can keep composure/seem unfazed and/or assert your boundaries
despite a shit test, generally speaking you will be considered to have passed
the shit test. If you get upset, offended, doubt yourself or show weakness in
any discernible way when shit tested, it will be generally considered that you
failed the test.</p>
</blockquote>
<p>Finally, not only do shit tests test your frame, but they also serve a specific,
critical social function:</p>
<blockquote>
<p>When it comes right down to it shit tests are typically women’s way of
flirting.</p>
<p>… Those who “pass” show they can handle the woman’s BS and is “on her
level”, so to speak. This is where the evolutionary theory comes into play:
you’re demonstrating her faux negativity doesn’t phase you [sic] and that
you’re an emotionally developed person who isn’t going to melt down at the
first sign of trouble. Ergo you’ll be able to protect her when threats to
her safety emerge.</p>
</blockquote>
<p>If you want to learn more, I took all the above quotes from
<a href="https://www.reddit.com/r/TheRedPill/comments/22qnmk/newbies_read_this_the_definitive_guide_to_shit/">here</a>
and <a href="https://illimitablemen.com/2014/12/14/the-shit-test-encyclopedia/">here</a>:
feel free to toss yourself down that rabbit hole (but you may want to open those
links in Incognito mode).</p>
<p>Clearly though, the cluster did a good job of identifying one topic of
discussion on <code>/r/TheRedPill</code>. In fact, not only can clustering pick up on a
general topic of conversation, but also on specific memes, motifs and vocabulary
associated with it.</p>
<p>Interested? Read on! I’ll explain what I did, and describe some of my other
results.</p>
<hr>
<p>Reddit is — well, it’s pretty hard to describe what Reddit <em>is</em>, mainly because
Reddit comprises several thousand communities, called <em>subreddits</em>, which center
around topics broad (<code>/r/Sports</code>) and niche (<code>/r/thinkpad</code>), delightful
(<code>/r/aww</code>) and unsavory (<code>/r/Incels</code>).</p>
<p>Each subreddit is a unique community with its own rules, culture and standards.
Some are welcoming and inclusive, and anyone can post and comment; others, not
so much: you must be invited to even read their front page. Some have pliant
standards about what is acceptable as a post; others have moderators willing to
remove posts and ban users upon any infraction of community guidelines.</p>
<p>Whatever Reddit is though, two things are for certain:</p>
<ol>
<li>
<p>It’s widely used. <em>Very</em> widely used. At the time of writing, it’s the <a href="https://www.alexa.com/topsites/countries/US">fourth
most popular website in the United
States</a> and the <a href="https://www.alexa.com/topsites">sixth most popular
globally</a>.</p>
</li>
<li>
<p>Where there is free speech, there is hate speech. Reddit’s hate speech
problem is <a href="https://www.wired.com/2015/08/reddit-mods-handle-hate-speech/">well
documented</a>,
the <a href="https://www.inverse.com/article/43611-reddit-ceo-steve-huffman-hate-speech">center of recent
controversy</a>,
and even <a href="https://fivethirtyeight.com/features/dissecting-trumps-most-rabid-online-following/">the subject of statistical
analysis</a>.</p>
</li>
</ol>
<p>Now, there are many well-known hateful subreddits. The three that I decided to
focus on were <code>/r/TheRedPill</code>, <code>/r/The_Donald</code>, and<code>/r/CringeAnarchy</code>.</p>
<p>The goal here is to understand what these subreddits are like, and expose their
culture for people to see. To quote <a href="https://www.inverse.com/article/43611-reddit-ceo-steve-huffman-hate-speech">Steve Huffman, Reddit’s
CEO</a>:</p>
<blockquote>
<p>“I believe the best defense against racism and other repugnant views, both
on Reddit and in the world, is instead of trying to control what people
can and cannot say through rules, is to repudiate these views in a free
conversation, and empower our communities to do so on Reddit.”</p>
</blockquote>
<p>And there’s no way we can refute and repudiate these deplorable views without
knowing what those views are. And instead of spending hours of each of these
subreddits ourselves, let’s have a machine learn what gets talked about on these
subreddits.</p>
<hr>
<p>Now, how do we do this? This can be done using <em>clustering</em>, a machine learning
technique in which we’re given data points, and tasked with grouping them in
some way. A picture will explain better than words:</p>
<figure>
<a href="https://www.georgeho.org/assets/images/clusters.png"><img src="https://www.georgeho.org/assets/images/clusters.png" alt="Illustration of clustering"></a>
</figure>
<p>The clustering algorithm was hard to decide on. After several dead ends were
explored, I settled on non-negative matrix factorization of the document-term
matrix, featurized using tf-idfs. I don’t really want to go into the technical
details now: suffice to say that this technique is <a href="http://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html">known to work well for this
application</a>
(perhaps I’ll write another piece on this in the future).</p>
<p>Finally, we need our data points: <a href="https://bigquery.cloud.google.com/dataset/fh-bigquery:reddit_comments">Google
BigQuery</a>
has all posts and comments across all of Reddit, from the the beginning of
Reddit right up until the end of 2017. We decided to focus on the last two
months for which there is data: November and December, 2017.</p>
<p>I could talk at length about the technical details, but right now, I want to
focus on the results of the clustering. What follows are two hand-picked
clusters from each of the three subreddits, visualized as word clouds (you can
think of word clouds as visual representations of the code snippet above), as
well as an example comment from each of the clusters.</p>
<h2 id="rtheredpill"><code>/r/TheRedPill</code></h2>
<p>You already know <code>/r/TheRedPill</code>, so let me describe the clusters in more detail:
a good number of them are about sex, or about how to approach girls. Comments in
these clusters tend to give advice on how to pick up girls, or describe the
social/sexual exploits of the commenter.</p>
<p>What is interesting is that, as sex-obsessed as <code>/r/TheRedPill</code> is, many
swallowers (of the red pill) profess that sex is <em>not</em> the purpose of the
subreddit: the point is to becoming an “alpha male”. Even more interesting,
there is more talk about what an alpha male <em>is</em>, and what kind of people
<em>aren’t</em> alpha, than there is about how people can <em>become</em> alpha. This is the
first cluster shown below, and comprises around 3% of all text on
<code>/r/TheRedPill</code>.</p>
<p>The second cluster comprises around 6% of all text on <code>/r/TheRedPill</code>, and
contains comments that expound theories on the role of men, women and feminism
in today’s society (it isn’t pretty). Personally, the most repugnant views that
I’ve read are to be found in this cluster.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-txt" data-lang="txt"><span style="display:flex;"><span>I feel like the over dramatization of beta qualities in media/pop
</span></span><span style="display:flex;"><span>culture is due to the fact that anyone representing these qualities is
</span></span><span style="display:flex;"><span>already Alpha by default.
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>The actors who play the white knight lead roles, the rock stars that
</span></span><span style="display:flex;"><span>sing about pining for some chick... these men/characters are already
</span></span><span style="display:flex;"><span>very Alpha in both looks and status, so when beta BS comes from their
</span></span><span style="display:flex;"><span>mouths, it’s seen as attractive because it balances out their already
</span></span><span style="display:flex;"><span>alpha state into that "mostly alpha but some beta" balance that makes
</span></span><span style="display:flex;"><span>women swoon.
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>...
</span></span></code></pre></div><figure>
<a href="https://raw.githubusercontent.com/eigenfoo/reddit-clusters/master/wordclouds/images/TheRedPill/13_3.21%25.png"><img src="https://raw.githubusercontent.com/eigenfoo/reddit-clusters/master/wordclouds/images/TheRedPill/13_3.21%25.png" alt="/r/TheRedPill cluster #13"></a>
<a href="https://raw.githubusercontent.com/eigenfoo/reddit-clusters/master/wordclouds/images/TheRedPill/06_6.41%25.png"><img src="https://raw.githubusercontent.com/eigenfoo/reddit-clusters/master/wordclouds/images/TheRedPill/06_6.41%25.png" alt="/r/TheRedPill cluster #6"></a>
</figure>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-txt" data-lang="txt"><span style="display:flex;"><span>...
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>Since the dawn of humanity men were always in control, held all the
</span></span><span style="display:flex;"><span>power and women were happy because of it. But now men are forced to
</span></span><span style="display:flex;"><span>lose their masculinity and power or else they'll be killed/punished by
</span></span><span style="display:flex;"><span>other pussy men with big guns and laws who believe feminism is the
</span></span><span style="display:flex;"><span>right path for humanity.
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>...
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>Feminism is really a blessing in disguise because it's a wake up call
</span></span><span style="display:flex;"><span>for men and a hidden cry for help from women for men to regain their
</span></span><span style="display:flex;"><span>masculinity, integrity and control over women.
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>...
</span></span></code></pre></div><h2 id="rthe_donald"><code>/r/The_Donald</code></h2>
<p>You may have already heard of <code>/r/The_Donald</code> (a.k.a. the “pro-Trump cesspool”),
famed for their <a href="https://en.wikipedia.org/wiki//r/The_Donald#Conflict_with_Reddit_management">takeover of the Reddit front
page</a>,
and their <a href="https://en.wikipedia.org/wiki//r/The_Donald#Controversies">involvement in several recent
controversies</a>. It
may therefore be surprising to learn that there is an iota of lucid discussion
that goes on, although in a jeering, bullying tone.</p>
<p><code>/r/The_Donald</code> is the subreddit which has developed the most language and inside
jokes: from “nimble navigators” to “swamp creatures”, “spezzes” to the
“Trumpire”… Explaining these memes would take too long: reach out, or Google, if
you really want to know.</p>
<p>The first cluster accounts for 5% of all text on <code>/r/The_Donald</code>, and contains
(relatively) coherent arguments both for and against net neutrality. The second
cluster accounts for 1% of the all text on <code>/r/The_Donald</code>, and is actually from
the subreddit’s <code>MAGABrickBot</code>, which is a bot that keeps count of how many times
the word “brick” has been used in comments, by automatically generating this
comment.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-txt" data-lang="txt"><span style="display:flex;"><span>So much misinformation perpetuated by the Swamp... Abolishing Net
</span></span><span style="display:flex;"><span>Neutrality would benefit swamp creatures with corporate payouts but
</span></span><span style="display:flex;"><span>would be most damaging to conservatives long term.
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>Net Neutrality was NOT created by Obama, it was actually in effect
</span></span><span style="display:flex;"><span>from the very beginning...
</span></span></code></pre></div><figure>
<a href="https://raw.githubusercontent.com/eigenfoo/reddit-clusters/master/wordclouds/images/The_Donald/00_5.19%25.png"><img src="https://raw.githubusercontent.com/eigenfoo/reddit-clusters/master/wordclouds/images/The_Donald/00_5.19%25.png" alt="/r/The_Donald cluster #0"></a>
<a href="https://raw.githubusercontent.com/eigenfoo/reddit-clusters/master/wordclouds/images/The_Donald/02_1.26%25.png"><img src="https://raw.githubusercontent.com/eigenfoo/reddit-clusters/master/wordclouds/images/The_Donald/02_1.26%25.png" alt="/r/The_Donald cluster #2"></a>
</figure>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-txt" data-lang="txt"><span style="display:flex;"><span>**FOR THE LOVE OF GOD GET THIS PATRIOT A BRICK! THAT'S 92278 BRICKS
</span></span><span style="display:flex;"><span>HANDED OUT!**
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>We are at **14.3173880911%** of our goal to **BUILD THE WALL**
</span></span><span style="display:flex;"><span>starting from Imperial Beach, CA to Brownsville, Texas! Lets make sure
</span></span><span style="display:flex;"><span>everyone gets a brick in the United States! For every Centipede a
</span></span><span style="display:flex;"><span>brick, for every brick a Centipede!
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>At this rate, the wall will be **1071.35224786 MILES WIDE** and
</span></span><span style="display:flex;"><span>**353.552300867 FEET HIGH** by tomorrow! **DO YOUR PART!**
</span></span></code></pre></div><h2 id="rcringeanarchy"><code>/r/CringeAnarchy</code></h2>
<p>On the Internet, <em>cringe</em> is the second-hand embarrassment you feel when someone
acts extremely awkwardly or uncomfortably. And on <code>/r/CringeAnarchy</code> you can find
memes about the <em>real</em> cringe, which is, um, liberals and anyone else who
advocates for an inclusionary, equitable ideology. Their morally grey jokes run
the gamut of delicate topics: gender, race, sexuality, nationality…</p>
<p>In some respects, the clustering provided very little insight into this
subreddit: each such delicate topic had one or two clusters, and there’s nothing
really remarkable about any of them. This speaks to the inherent difficulty of
training a topic model on memes: I rant at greater length about this topic on
<a href="https://www.georgeho.org/lda-sucks/">one of my blog posts</a>.</p>
<p>Both clusters below comprise around 3% of text on <code>/r/CringeAnarchy</code>: one is to do
with race, and the other is to do with homosexuality.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-txt" data-lang="txt"><span style="display:flex;"><span>Has anyone here, non-black or otherwise, ever wished someone felt
</span></span><span style="display:flex;"><span>sorry for being black? Maybe it's just where I live... the majority is
</span></span><span style="display:flex;"><span>black. It's whatever.
</span></span></code></pre></div><figure>
<a href="https://raw.githubusercontent.com/eigenfoo/reddit-clusters/master/wordclouds/images/CringeAnarchy/08_3.10%25.png"><img src="https://raw.githubusercontent.com/eigenfoo/reddit-clusters/master/wordclouds/images/CringeAnarchy/08_3.10%25.png" alt="/r/CringeAnarchy cluster #8"></a>
<a href="https://raw.githubusercontent.com/eigenfoo/reddit-clusters/master/wordclouds/images/CringeAnarchy/12_2.92%25.png"><img src="https://raw.githubusercontent.com/eigenfoo/reddit-clusters/master/wordclouds/images/CringeAnarchy/12_2.92%25.png" alt="/r/CringeAnarchy cluster #8"></a>
</figure>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-txt" data-lang="txt"><span style="display:flex;"><span>...
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>Also, the distinction between bisexual and gay is academic. If you do
</span></span><span style="display:flex;"><span>a gay thing, you have done a gay thing. That's what "being gay" means
</span></span><span style="display:flex;"><span>to a LOT of people. Redefining it is as useful as all the other things
</span></span><span style="display:flex;"><span>SJWs are redefining.
</span></span></code></pre></div><hr>
<p>As much information as that might have been, this was just a glimpse into what
these subreddits are like: I made 20 clusters for each subreddit, and you could
argue that (for somewhat technical reasons) 20 clusters isn’t even enough!
Moreover, there is just no way I could distill everything I learned about these
communities into one Medium story: I’ve curated just the more remarkable or
provocative results to put here.</p>
<p>If you still have the stomach for this stuff, scroll through the complete log
files
<a href="https://github.com/eigenfoo/reddit-clusters/tree/master/clustering/nmf/results">here</a>,
or look through images of the word clouds
<a href="https://github.com/eigenfoo/reddit-clusters/tree/master/wordclouds/images">here</a>.</p>
<p>Finally, as has been said before, “Talk is cheap. Show me the code.” For
everything I’ve written to make these clusters, check out <a href="https://github.com/eigenfoo/reddit-clusters">this GitHub
repository</a>.</p>
<hr>
<p><strong>Update (2018-11-08):</strong> If you’re interested in the technical, data science side
of the project, check out the slide deck and speaker notes from <a href="https://www.georgeho.org/reddit-slides/">my recent
talk</a> on exactly that!</p>Why Latent Dirichlet Allocation Suckshttps://www.georgeho.org/lda-sucks/2018-03-06T00:00:00Z2018-03-06T00:00:00Z<p>As I learn more and more about data science and machine learning, I’ve noticed
that a lot of resources out there go something like this:</p>
<blockquote>
<p>Check out this thing! It’s great at this task! The important task! The one
that was impossible/hard to do before! Look how well it does! So good! So
fast!</p>
<p>Take this! It’s our algorithm/code/paper! We used it to do the thing! And now
you can do the thing too!</p>
</blockquote>
<p>Jokes aside, I do think it’s true that a lot of research and resources focus on
what things <em>can</em> do, or what things are <em>good</em> at doing. Whenever I actually
implement the hyped-up “thing”, I’m invariably frustrated when it doesn’t
perform so well as originally described.</p>
<p>Maybe I’m not smart enough to see this, but after I learn about a new technique
or tool or model, it’s not immediately obvious to me when <em>not</em> to use it. I
think it would be very helpful to learn what things <em>aren’t</em> good at doing, or
why things just plain <em>suck</em> at times. Doing so not only helps you understand
the technique/tool/model better, but also sharpens your understanding of your
use case and the task at hand: what is it about your application that makes it
unsuitable for such a technique?</p>
<p>Which is why I’m writing the first of what will (hopefully) be a series of posts
on <em>“Why [Thing] Sucks”</em>. The title is provocative but reductive: a better name
might be <em>When and Why [Thing] Might Suck</em>… but that doesn’t have quite the
same ring to it! In these articles I’ll be outlining what I tried and why it
didn’t work: documenting my failures and doing a quick post-mortem, if you will.
My hope is that this will be useful to anyone else trying to do the same thing
I’m doing.</p>
<hr>
<p>So first up: topic modelling. Specifically, <a href="https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation">latent Dirichlet
allocation</a>, or LDA
for short (not to be confused with <a href="https://www.georgeho.org/lda/">the other
LDA</a>, which I wrote a blog post about before).</p>
<p>If you’ve already encountered LDA and have seen <a href="https://en.wikipedia.org/wiki/Plate_notation">plate
notation</a> before, this picture
will probably refresh your memory:</p>
<p><img src="https://www.georgeho.org/assets/images/latent-dirichlet-allocation.png" alt="Latent Dirichlet allocation"></p>
<p>If you don’t know what LDA is, fret not, for there is
<a href="http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf">no</a>
<a href="http://obphio.us/pdfs/lda_tutorial.pdf">shortage</a>
<a href="http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/">of</a>
<a href="https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html">resources</a>
<a href="http://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation">about</a>
<a href="https://radimrehurek.com/gensim/models/ldamodel.html">this</a>
<a href="https://www.quora.com/What-is-a-good-explanation-of-Latent-Dirichlet-Allocation">stuff</a>.
I’m going to move on to when and why LDA isn’t the best idea.</p>
<p><strong>tl;dr:</strong> <em>LDA and topic modelling doesn’t work well with a) short documents,
in which there isn’t much text to model, or b) documents that don’t coherently
discuss a single topic.</em></p>
<p>Wait, what? Did George just say that topic modelling sucks when there’s not much
topic, and not much text to model? Isn’t that obvious?</p>
<p><em>Yes! Exactly!</em> Of course it’s <a href="https://en.wikipedia.org/wiki/Egg_of_Columbus">obvious in
retrospect</a>! Which is why I was
so upset when I realized I spent two whole weeks faffing around with LDA when
topic models were the opposite of what I needed, and so frustrated that more
people aren’t talking about when <em>not</em> to use/do certain things.</p>
<p>But anyways, <code><\rant></code> and let’s move on to why I say what I’m saying.</p>
<p>Recently, I’ve taken up a project in modelling the textual data on Reddit using
NLP techniques. There are, of course, many ways one count take this, but
something I was interested in was finding similarities between subreddits,
clustering comments, and visualizing these clusters somehow: what does Reddit
talk about on average? Of course, I turned to topic modelling and dimensionality
reduction.</p>
<p>The techniques that I came across first were LDA (<a href="https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation">latent Dirichlet
allocation</a>) and
t-SNE (<a href="https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding">t-distributed stochastic neighbor
embedding</a>).
Both techniques are well known and well documented, but I can’t say that using
them together is a popular choice of two techniques. However, there have been
some successes. For instance, <code>ShuaiW</code> had some success with this method <a href="https://web.archive.org/web/20171219104016/https://shuaiw.github.io/2016/12/22/topic-modeling-and-tsne-visualzation.html">when
using it the 20 newsgroups
dataset</a><sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup>;
some work done by Kagglers have <a href="https://www.kaggle.com/ykhorramz/lda-and-t-sne-interactive-visualization">yielded reasonable
results</a>,
and <a href="https://stats.stackexchange.com/questions/305356/plot-latent-dirichlet-allocation-output-using-t-sne">the StackExchange community doesn’t think its a ridiculous
idea</a>.</p>
<p>The dataset that I applied this technique to was the <a href="https://bigquery.cloud.google.com/dataset/fh-bigquery:reddit">Reddit dataset on Google
BigQuery</a>, which contains
data on all subreddits, posts and comments for as long as Reddit has been around.
I limited myself to the top 10 most active subreddits in December 2017 (the most
recent month for which we have data, at the time of writing), and chose 20 to be
the number of topics to model (any choice is as arbitrary as any other).</p>
<p>I ran LDA and t-SNE exactly as Shuai described on <a href="https://web.archive.org/web/20171219104016/https://shuaiw.github.io/2016/12/22/topic-modeling-and-tsne-visualzation.html">this blog
post</a><sup id="fnref1:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup>,
except using the great <a href="https://radimrehurek.com/gensim/"><code>gensim</code></a> library to
perform LDA, which was built with large corpora and efficient online algorithms
in mind. (Specifically, <code>gensim</code> implements online variational inference with
the EM algorthm, instead of using MCMC-based algorithms, which <code>lda</code> does. It
seems that variational Bayes scales better to very large corpora than collapsed
Gibbs sampling.)</p>
<p>Here are the results:</p>
<p><img src="https://www.georgeho.org/assets/images/lda-sucks.png" alt="LDA followed by t-SNE on the Reddit dataset"></p>
<p>Horrible, right? Nowhere near the well-separated clusters that Shuai got with
the 20 newsgroups. In fact, the tiny little huddles of around 5 to 10 comments
are probably artifacts of the dimensionality reduction done by t-SNE, so those
might even just be noise! You might say that there are at least 3 very large
clusters, but even that’s bad news! If they’re clustered together, you would
hope that they have the same topics, and that’s definitely not the case here!
These large clusters tells us that a lot of comments have roughly the same topic
distribution (i.e. they’re close to each other in the high-dimensional
topic-space), but their dominant topics (i.e. the topic with greatest
probability) don’t end up being the same.</p>
<p>By the way, t-SNE turns out to be <a href="https://distill.pub/2016/misread-tsne/">a really devious dimensionality reduction
technique</a>, and you really need to
experiment with the perplexity values in order to use it properly. I used the
default <code>perplexity=30</code> from sklearn for the previous plot, but I repeated the
visualizations for multiple other values and the results aren’t so hot either.
Note that I did these on a random subsample of 1000 comments, so as to reduce
compute time.</p>
<figure>
<a href="https://www.georgeho.org/assets/images/perplexity50.png"><img src="https://www.georgeho.org/assets/images/perplexity50.png" alt="t-SNE with perplexity value of 50"></a>
<a href="https://www.georgeho.org/assets/images/perplexity100.png"><img src="https://www.georgeho.org/assets/images/perplexity100.png" alt="t-SNE with perplexity value of 100"></a>
<figcaption>t-SNE with perplexity values of 50 and 100, respectively.</figcaption>
</figure>
<p>So, what went wrong? There’s a <a href="https://stackoverflow.com/questions/29786985/whats-the-disadvantage-of-lda-for-short-texts">nice StackOverflow
post</a>
that describes the problem well.</p>
<p>Firstly, latent Dirichlet allocation and other probabilistic topic models are
very complex and flexible. While this means that they have very high variance
and low bias, it also means that they need a lot of data (or data with a decent
signal-to-noise ratio) for them to learn anything meaningful. Particularly for
LDA, which infers topics on a document-by-document basis, if there aren’t enough
words in a document, there simply isn’t enough data to infer a reliable topic
distribution for that document.</p>
<p>Secondly, Reddit comments are by their nature very short and very-context
dependent, since they respond to a post, or another comment. So not only are
Reddit comments just short: it’s actually worse than that! They don’t even
discuss a certain topic coherently (by which I mean, they don’t necessarily use
words that pertain to what they’re talking about). I’ll give an example:</p>
<pre tabindex="0"><code>"I'm basing my knowledge on the fact that I watched the fucking rock fall."
</code></pre><p>Now, stopwords compose a little less than half of this comment, and they would
be stripped before LDA even looks at it. But that aside, what is this comment
about? What does the rock falling mean? What knowledge is this user claiming?
It’s a very confusing comment, but probably made complete sense in the context
of the post it responded to and the comments that came before it. As it is,
however, its impossible for <em>me</em> to figure out what topic this comment is about,
let alone an algorithm!</p>
<p>Also, just to drive the point home, here are the top 10 words in each of the 20
topics that LDA came up with, on the same dataset as before:</p>
<pre tabindex="0"><code>Topic #0:
got just time day like went friend told didn kids
Topic #1:
just gt people say right doesn know law like government
Topic #2:
removed com https www https www tax money http watch news
Topic #3:
people don just like think really good know want things
Topic #4:
years time did great ago ve just work life damn
Topic #5:
movie like love just really school star movies film story
Topic #6:
like just fucking shit head car looks new makes going
Topic #7:
game team season year good win play teams playing best
Topic #8:
right thing yeah don think use internet ok water case
Topic #9:
going like work just need way want money free fuck
Topic #10:
better just play games make ve ll seen lol fun
Topic #11:
like don know did feel shit big man didn guys
Topic #12:
deleted fuck guy year old man amp year old state lmao
Topic #13:
sure believe trump wrong saying comment post mueller evidence gt
Topic #14:
gt yes https com good oh wikipedia org en wiki
Topic #15:
think like good 10 look point lebron just pretty net
Topic #16:
gt said fucking american agree trump thanks obama states did
Topic #17:
trump vote party republicans election moore president republican democrats won
Topic #18:
war world country israel countries china military like happy does
Topic #19:
reddit message askreddit post questions com reddit com subreddit compose message compose
</code></pre><p>Now, it’s not entirely bad: topic 2 seems like its collecting the tokens from links
(I didn’t stopword those out, oops), topic 7 looks like its about football or
some other sport, 13 is probably about American politics, and 18 looks like
its about world news, etc.</p>
<p>But almost all other topics are just collections of words: it’s not immediately
obvious to me what each topic represents.</p>
<p>So yeah, there you have it, LDA really sucks sometimes.</p>
<hr>
<p><strong>Update (8/12/2018):</strong> In retrospect, I think that this whole blog post is
summarized well in the following tweet thread. Clustering algorithms will give
you clusters because that’s what they do, not because there actually <em>are</em>
clusters. In this case, extremely short and context-dependent documents make it
hard to justify that there are topic clusters in the first place.</p>
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">Algorithms that have to report something will always report something, even if it's a bad idea. Please do not use these algorithms unless you have principled reasons why there should be something. <a href="https://t.co/kzxZiuBfmm">https://t.co/kzxZiuBfmm</a></p>— \mathfrak{Michael Betancourt} (@betanalpha) <a href="https://twitter.com/betanalpha/status/1026619046626828288?ref_src=twsrc%5Etfw">August 7, 2018</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p><a href="https://github.com/ShuaiW"><code>ShuaiW</code></a> has since taken down his blog, so I
am linking to the Internet Archive of his blog post instead. <a href="#fnref:1" class="footnote-backref" role="doc-backlink">↩︎</a> <a href="#fnref1:1" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
</ol>
</div>~~Fruit~~ Loops and Learning - The LUPI Paradigm and SVM+https://www.georgeho.org/lupi/2018-01-30T00:00:00Z2018-01-30T00:00:00Z<p>Here’s a short story you might know: you have a black box, whose name is
<em>Machine Learning Algorithm</em>. It’s got two modes: training mode and testing
mode. You set it to training mode, and throw in a lot (sometimes <em>a lot</em> a lot)
of ordered pairs $(x_i, y_i), 1 \leq i \leq l$. Here, the $x_i$ are called
the <em>examples</em> and the $y_i$ are called the <em>targets</em>. Then, you set it to
testing mode and throw in some more examples, for which you don’t have the
corresponding targets. You hope the $y_i$s that come out are in some sense
the “right” ones.</p>
<p>Generally speaking, this is a parable of <em>supervised learning</em>. However, Vapnik
(the inventor of the
<a href="https://en.wikipedia.org/wiki/Support_vector_machine">SVM</a>) recently described
a new way to think about machine learning (e.g.
<a href="http://jmlr.csail.mit.edu/papers/volume16/vapnik15b/vapnik15b.pdf">here</a>):
<em>learning using privileged information</em>, or <em>LUPI</em> for short.</p>
<p>This post is meant to introduce the LUPI paradigm of machine learning to
people who are generally familiar with supervised learning and SVMs, and are
interested in seeing the math and intuition behind both things extended to the
LUPI paradigm.</p>
<h2 id="what-is-lupi">What is LUPI?</h2>
<p>The main idea is that instead of two-tuples $(x_i, y_i)$, the black box is fed
three-tuples $(x_i, x_i^{<em>}, y_i) $, where the $x^{</em>}$s are the so-called
<em>privileged information</em> that is only available during training, and not during
testing. The hope is that this information will train the model to better
generalize during the testing phase.</p>
<p>Vapnik offers many examples in which LUPI can be applied in real life: in
bioinformatics and proteomics (where advanced biological models, which the
machine might not necessarily “understand”, serve as the privileged
information), in financial time series analysis (where future movements of the
time series are the unknown at prediction time, but are available
retrospectively), and in the classic MNIST dataset, where the images were
converted to a lower resolution, but each annotated with a “poetic description”
(which was available for the training data but not for the testing data).</p>
<p>Vapnik’s team ran tests on well-known datasets in all three application areas
and found that his newly-developed LUPI methods performed noticeably better than
classical SVMs in both convergence time (i.e. the number of examples necessary
to achieve a certain degree of accuracy) and estimation of a good predictor
function. In fact, Vapnik’s proof-of-concept experiments are so whacky that
they actually <a href="https://nautil.us/issue/6/secret-codes/teaching-me-softly">make for an entertaining read
</a>!</p>
<h2 id="classical-svms-separable-and-non-separable-case">Classical SVMs (separable and non-separable case)</h2>
<p>There are many ways of thinking about SVMs, but I think that the one that is
most instructive here is to think of them as solving the following optimization
problem:</p>
<blockquote>
<p>Minimize $ \frac{1}{2} |w|^2 $</p>
<p>subject to $y_i [ w \cdot x_i + b ] \geq 1, 1 \leq i \leq l$.</p>
</blockquote>
<p>Basically all this is saying is that we want to find the hyperplane that
separates our data by the maximum margin. More technically speaking, this finds
the parameters ($w$ and $b$) of the maximum margin hyperplane, with $l_2$
regularization.</p>
<p>In the non-separable case, we concede that our hyperplane may not classify all
examples perfectly (or that it may not be desireable to do so: think of
overfitting), and so we introduce a so-called <em>slack variable</em> $\xi_i \geq 0$
for each example $i$, which measures the severity of misclassification of that
example. With that, the optimization becomes:</p>
<blockquote>
<p>Minimize $\frac{1}{2} |w|^2 + C\sum_{i=1}^{l}{\xi_i}$</p>
<p>subject to $y_i [ w \cdot x_i + b ] \geq 1 - \xi_i, \xi_i \geq 0, 1
\leq i \leq l$.</p>
</blockquote>
<p>where $C$ is some regularization parameter.</p>
<p>This says the same thing as the previous optimization problem, but now allows
points to be (a) classified properly ($\xi_i = 0$), (b) within the margin but
still classified properly ($0 < \xi_i < 1$), or (c) misclassified
($1 \leq \xi_i$).</p>
<p>In both the separable and non-separable cases, the decision rule is simply
$\hat{y} = \text{sign}(w \cdot x + b)$.</p>
<p>An important thing to note is that, in the separable case, the SVM uses $l$
examples to estimate the $n$ components of $w$, whereas in the nonseparable
case, the SVM uses $l$ examples to estimate $n+l$ parameters: the $n$
components of $w$ and $l$ values of slacks $\xi_i$. Thus, in the
non-separable case, the number of parameters to be estimated is always larger
than the number of examples: it does not matter here that most of slacks may be
equal to zero: the SVM still has to estimate all of them.</p>
<p>The way both optimization problems are actually <em>solved</em> is fairly involved (they
require <a href="https://en.wikipedia.org/wiki/Lagrange_multiplier">Lagrange
multipliers</a>), but in terms
of getting an intuitive feel for how SVMs work, I think that examining the
optimization problems suffice!</p>
<h2 id="what-is-svm">What is SVM+?</h2>
<p>In his paper introducing the LUPI paradigm, Vapnik outlines <em>SVM+</em>, a
modified form of the SVM that fits well into the LUPI paradigm, using privileged
information to improve performance. It should be emphasized that LUPI is a
paradigm - a way of thinking about machine learning - and not just a collection
of algorithms. SVM+ is just one technique that interoperates with the LUPI
paradigm.</p>
<p>The innovation of the SVM+ algorithm is that is uses the privileged information
to estimate the slack variables. Given the training three-tuple $(x, x^{*},
y)$, we map $x$ to the feature space $Z$, and $x^{*}$ to a separate feature
space $Z^{*}$. Then, the decision rule is $\hat{y} = \text{sign}(w \cdot x +
b)$ and the slack variables are estimated by $\xi = w^{*} \cdot x^{*} +
b^{*}$.</p>
<p>In order to find $w$, $b$, $w^{*}$ and $b^{*}$, we solve the following
optimization problem:</p>
<blockquote>
<p>Minimize $\frac{1}{2} (|w|^2 + \gamma |w^{*}|^2) +
C \sum_{i=1}^{l}{(w^{*} \cdot x_i^{*} + b^{*})}$</p>
<p>subject to $y_i [ w \cdot x_i + b ] \geq 1 - (w^{*} \cdot x^{*} + b^{*}),
(w^{*} \cdot x^{*} + b^{*}) \geq 0, 1 \leq i \leq l$.</p>
</blockquote>
<p>where $\gamma$ indicates the extent to which the slack estimation should be
regularized in comparison to the SVM. Notice how this optimization problem is
essentially identical to the non-separable classical SVM, except the slacks
$\xi_i$ are now estimated with $w^{*} \cdot x^{*} + b^{*}$.</p>
<p>Again, the method of actually solving this optimization problem involves
Lagrange multipliers and quadratic programming, but I think the intuition is
captured in the optimization problem statement.</p>
<h2 id="interpretation-of-svm">Interpretation of SVM+</h2>
<p>The SVM+ has a very ready interpretation. Instead of a single feature space, it
has two: one in which the non-privileged information lives (where decisions are
made), and one in which the privileged information lives (where slack variables
are estimated).</p>
<p>But what’s the point of this second feature space? How does it help us? Vapnik
terms this problem <em>knowledge transfer</em>: it’s all well and good for us to learn
from the privileged information, but it’s all for naught if we can’t use this
newfound knowledge in the test phase.</p>
<p>The way knowledge transfer is resolved here is by assuming that <em>examples in the
training set that are hard to separate in the privileged space, are also hard to
separate in the regular space</em>. Therefore, we can use the privileged information
to obtain an estimate for the slack variables.</p>
<p>Of course, SVMs are a technique with many possible interpretations, of which my
presentation (in terms of the optimization of $w$ and $b$) is just one. For
example, it’s possible to think of SVMs in terms of kernels functions, or as
linear classifiers minimizing hinge loss. In all cases, it’s possible and
worthwhile to understand that interpretation of SVMs, and how the LUPI paradigm
contributes to or extends that interpretation. I’m hoping to write a piece later
to explain these exact topics.</p>
<p>Vapnik also puts a great emphasis on analyzing SVM+ based on its statistical
learning theoretic properties (in particular, analyzing its rate of convergence
via the <a href="https://en.wikipedia.org/wiki/VC_dimension">VC dimension</a>). Vapnik was
one of the main pioneers behind statistical learning theory, and has written an
<a href="https://www.amazon.com/Statistical-Learning-Theory-Vladimir-Vapnik/dp/0471030031">entire
book</a>
on this stuff <del>which I have not read</del>, so I’ll leave that part aside for now. I
hope to understand this stuff one day.</p>
<h2 id="implementation-of-svm">Implementation of SVM+</h2>
<p>There’s just one catch: SVM+ is actually an fairly inefficient algorithm, and
definitely will not scale to large data sets. What’s so bad about it? <em>It has
$n$ training examples but $2n$ variables to estimate.</em> This is twice as many
variables to estimate as the standard formulation of the <a href="https://en.wikipedia.org/wiki/Support_vector_machine#Computing_the_SVM_classifier">vanilla
SVM</a>.
This isn’t something that we can patch: the problem is inherent to the
Lagrangian dual formulation that Vapnik and Vashist proposed in 1995.</p>
<p>Even worse, the optimization problem has constraints that are very different
from those of the standard SVM. In essence, this means that efficient libraries
out-of-the-box solvers for the standard SVM (e.g.
<a href="https://www.csie.ntu.edu.tw/~cjlin/libsvm/">LIBSVM</a> and
<a href="https://www.csie.ntu.edu.tw/~cjlin/liblinear/">LIBLINEAR</a>) can’t be used to
train an SVM+ model.</p>
<p>Luckily, <a href="https://www.researchgate.net/publication/301880839_Simple_and_Efficient_Learning_using_Privileged_Information">a recent paper by Xu et
al.</a>
describes a neat mathematical trick to implement SVM+ in a simple and efficient
way. With this amendment, the authors rechristen the algorithm as SVM2+.
Essentially, instead of using the hinge loss when training SVM+, we will instead
use the <em>squared</em> hinge loss. It turns out that changing the loss function in
this way leads to a tiny miracle.</p>
<p>This (re)formulation of SVM+ becomes <em>identical</em> to that of the standard SVM,
except we replace the Gram matrix (a.k.a. kernel matrix) $\bf K$ by $\bf K +
\bf Q_\lambda \odot (\bf y y^t)$, where</p>
<ul>
<li>$\bf y$ is the target vector</li>
<li>$\odot$ denotes the Hadamard product</li>
<li>$\bf{Q_\lambda}$ is given by $Q_\lambda = \frac{1}{\lambda} (\tilde{K}
(\frac{\lambda}{C} I_n + \tilde{K})^{-1} \tilde{K})$, and</li>
<li>$\bf \tilde{K}$ is the Gram matrix formed by the privileged information</li>
</ul>
<p>So by replacing the hinge loss with the squared hinge loss, the SVM+ formulation
can now be solved with existing libraries!</p>
<h2 id="extensions-to-svm">Extensions to SVM+</h2>
<p>In his paper, Vapnik makes it clear that LUPI is a very general and abstract
paradigm, and as such there is plenty of room for creativity and innovation -
not just in researching and developing new LUPI methods and algorithms, but also
in implementing and applying them. It is unknown how to best go about supplying
privileged information so as to get good performance. How should the data be
feature engineered? How much signal should be in the privileged information?
These are all open questions.</p>
<p>Vapnik himself opens up three avenues to extend the SVM+ algorithm:</p>
<ol>
<li><em>a mixture model of slacks:</em> when slacks are estimated by a mixture of a
smooth function and some prior</li>
<li><em>a model where privileged information is available only for a part of the
training data:</em> where we can only supply privileged information on a small
subset of the training examples</li>
<li><em>multiple-space privileged information:</em> where the privileged information we
can supply do not all share the same features</li>
</ol>
<p>Clearly, there’s a lot of potential in the LUPI paradigm, as well as a lot of
reasons to be skeptical. It’s very much a nascent perspective of machine
learning, so I’m interested in keeping an eye on it going forward. I’m hoping
to write more posts on LUPI in the future!</p>Linear Discriminant Analysis for Startershttps://www.georgeho.org/lda/2017-12-30T00:00:00Z2017-12-30T00:00:00Z<p><em>Linear discriminant analysis</em> (commonly abbreviated to LDA, and not to be
confused with <a href="https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation">the other
LDA</a>) is a very
common dimensionality reduction technique for classification problems. However,
that’s something of an understatement: it does so much more than “just”
dimensionality reduction.</p>
<p>In plain English, if you have high-dimensional data (i.e. a large number of
features) from which you wish to classify observations, LDA will help you
transform your data so as to make the classes as distinct as possible. More
rigorously, LDA will find the linear projection of your data into a
lower-dimensional subspace that optimizes some measure of class separation. The
dimension of this subspace is necessarily strictly less than the number of
classes.</p>
<p>This separation-maximizing property of LDA makes it so good at its job that it’s
sometimes considered a classification algorithm in and of itself, which leads to
some confusion. <em>Linear discriminant analysis</em> is a form of dimensionality
reduction, but with a few extra assumptions, it can be turned into a classifier.
(Avoiding these assumptions gives its relative, <em>quadratic discriminant
analysis</em>, but more on that later). Somewhat confusingly, some authors call the
dimensionality reduction technique “discriminant analysis”, and only prepend the
“linear” once we begin classifying. I actually like this naming convention more
(it tracks the mathematical assumptions a bit better, I think), but most people
nowadays call the entire technique “LDA”, so that’s what I’ll call it.</p>
<p>The goal of this post is to give a comprehensive introduction to, and
explanation of, LDA. I’ll look at LDA in three ways:</p>
<ol>
<li>LDA as an algorithm: what does it do, and how does it do it?</li>
<li>LDA as a theorem: a mathematical derivation of LDA</li>
<li>LDA as a machine learning technique: practical considerations when using LDA</li>
</ol>
<p>This is a lot for one post, but my hope is that there’s something in here for
everyone.</p>
<div>
<h2>Contents</h2>
<nav id="TableOfContents">
<ul>
<li><a href="#lda-as-an-algorithm">LDA as an Algorithm</a>
<ul>
<li><a href="#problem-statement">Problem statement</a></li>
<li><a href="#solution">Solution</a></li>
</ul>
</li>
<li><a href="#lda-as-a-theorem">LDA as a Theorem</a></li>
<li><a href="#lda-as-a-machine-learning-technique">LDA as a Machine Learning Technique</a>
<ul>
<li><a href="#regularization-aka-shrinkage">Regularization (a.k.a. shrinkage)</a></li>
<li><a href="#lda-as-a-classifier">LDA as a classifier</a></li>
<li><a href="#close-relatives-pca-qda-anova">Close relatives: PCA, QDA, ANOVA</a></li>
</ul>
</li>
</ul>
</nav>
</div>
<h2 id="lda-as-an-algorithm">LDA as an Algorithm</h2>
<h3 id="problem-statement">Problem statement</h3>
<p>Before we dive into LDA, it’s good to get an intuitive grasp of what LDA
tries to accomplish.</p>
<p>Suppose that:</p>
<ol>
<li>You have very high-dimensional data, and that</li>
<li>You are dealing with a classification problem</li>
</ol>
<p>This could mean that the number of features is greater than the number of
observations, or it could mean that you suspect there are noisy features that
contain little information, or anything in between.</p>
<p>Given that this is the problem at hand, you wish to accomplish two things:</p>
<ol>
<li>Reduce the number of features (i.e. reduce the dimensionality of your feature
space), and</li>
<li>Preserve (or even increase!) the “distinguishability” of your classes or the
“separatedness” of the classes in your feature space.</li>
</ol>
<p>This is the problem that LDA attempts to solve. It should be fairly obvious why
this problem might be worth solving.</p>
<p>To judiciously appropriate a term from signal processing, we are interested in
increasing the signal-to-noise ratio of our data, by both extracting or
synthesizing features that are useful in classifying our data (amplifying our
signal), and throwing out the features that are not as useful (attenuating our
noise).</p>
<p>Below is simple illustration I made, inspired by <a href="https://www.quora.com/Can-you-explain-the-comparison-between-principal-component-analysis-and-linear-discriminant-analysis-in-dimensionality-reduction-with-MATLAB-code-Which-one-is-more-efficient">Sebastian
Raschka</a>
that may help our intuition about the problem:</p>
<p><img src="https://www.georgeho.org/assets/images/lda-pic.png" alt="Projections of two-dimensional data (in two clusters) onto the x and y axes"></p>
<p>A couple of points to make:</p>
<ul>
<li>LD1 and LD2 are among the projections that LDA would consider. In reality, LDA
would consider <em>all possible</em> projections, not just those along the x and y
axes.</li>
<li>LD1 is the one that LDA would actually come up with: this projection gives the
best “separation” of the two classes.</li>
<li>LD2 is a horrible projection by this metric: both classes get horribly
overlapped… (this actually relates to PCA, but more on that later)</li>
</ul>
<p><strong>UPDATE:</strong> For another illustration, Rahul Sangole made a simple but great
interactive visualization of LDA
<a href="https://rsangole.shinyapps.io/LDA_Visual/">here</a> using
<a href="https://shiny.rstudio.com/">Shiny</a>.</p>
<h3 id="solution">Solution</h3>
<p>First, some definitions:</p>
<p>Let:</p>
<ul>
<li>$n$ be the number of classes</li>
<li>$\mu$ be the mean of all observations</li>
<li>$N_i$ be the number of observations in the $i$th class</li>
<li>$\mu_i$ be the mean of the $i$th class</li>
<li>$\Sigma_i$ be the <a href="https://en.wikipedia.org/wiki/Scatter_matrix">scatter
matrix</a> of the $i$th class</li>
</ul>
<p>Now, define $S_W$ to be the <em>within-class scatter matrix</em>, given by</p>
<p>$$
\begin{align*}
S_W = \sum_{i=1}^{n}{\Sigma_i}
\end{align*}
$$</p>
<p>and define $S_B$ to be the <em>between-class scatter matrix</em>, given by</p>
<p>$$
\begin{align*}
S_B = \sum_{i=1}^{n}{N_i (\mu_i - \mu) (\mu_i - \mu)^T}
\end{align*}
$$</p>
<p><a href="https://en.wikipedia.org/wiki/Diagonalizable_matrix">Diagonalize</a> $S_W^{-1}
S_B$ to get its eigenvalues and eigenvectors.</p>
<p>Pick the $k$ largest eigenvalues, and their associated eigenvectors. We will
project our observations onto the subspace spanned by these vectors.</p>
<p>Concretely, what this means is that we form the matrix $A$, whose columns are the
$k$ eigenvectors chosen above. $W$ will allow us to transform our
observations into the new subspace via the equation $y = A^T x$, where $y$ is
our transformed observation, and $x$ is our original observation.</p>
<p>And that’s it!</p>
<p>For a more detailed and intuitive explanation of the LDA “recipe”, see
<a href="http://sebastianraschka.com/Articles/2014_python_lda.html">Sebastian Raschka’s blog post on
LDA</a>.</p>
<h2 id="lda-as-a-theorem">LDA as a Theorem</h2>
<p><strong>Sketch of Derivation:</strong></p>
<p>In order to maximize class separability, we need some way of measuring it as a
number. This number should be bigger when the between-class scatter is bigger,
and smaller when the within-class scatter is larger. There are many such
formulas/numbers that have this property: <a href="https://www.elsevier.com/books/introduction-to-statistical-pattern-recognition/fukunaga/978-0-08-047865-4">Fukunaga’s <em>Introduction to
Statistical Pattern
Recognition</em></a>
considers no less than four! Here, we’ll concern ourselves with just one:</p>
<p>$$ J_1 = tr(S_{WY}^{-1} S_{BY}) $$</p>
<p>where I denote the within and between-class scatter matrices of the projection
vector $Y$ by $S_{WY}$ and $S_{BY}$, to avoid confusion with the
corresponding matrices for the projected vector $X$.</p>
<p>Now, a standard result from probability is that for any random variable $X$
and matrix $A$, we have $cov(A^T X) = A^T cov(X) A$. We’ll apply this
result to our projection $y = A^T x$. It follows that</p>
<p>$$ S_{WY} = A^T S_{WX} A $$</p>
<p>and</p>
<p>$$ S_{BY} = A^T S_{BX} A $$</p>
<p>where $S_{BX}$ and $S_{BY}$ are the between-class scatter matrices, and
$S_{WX}$ and $S_{WY}$ are the within-class scatter matrices, for $X$
and its projection $Y$, respectively.</p>
<p>It’s now a simple matter to write $J_1$ in terms of $A$, and maximize
$J_1$. Without going into the details, we set $\frac{\partial J_1}{\partial
A} = 0$ (whatever that means), and use the fact that <a href="https://math.stackexchange.com/questions/546155/proof-that-the-trace-of-a-matrix-is-the-sum-of-its-eigenvalues">the trace of a matrix is
the sum of its
eigenvalues</a>.</p>
<p>I don’t want to go into the weeds with this here, but if you really want to see
the algebra, Fukunaga is a great resource. The end result, however, is the same
condition on the eigenvalues and eigenvectors as stated above: in other words,
the optimization gives us LDA as presented.</p>
<p>There’s one more quirk of LDA that’s very much worth knowing. Suppose you have
10 classes, and you run LDA. It turns out that the <em>maximum</em> number of features
LDA can give you is one less than the number of class, so in this case, 9!</p>
<p><strong>Proposition:</strong> $S_W^{-1} S_B$ has at most $n-1$ non-zero eigenvalues, which
implies that LDA is must reduce the dimension to <em>at least</em> $n-1$.</p>
<p>To prove this, we first need a lemma.</p>
<p><strong>Lemma:</strong> Suppose ${v_i}<em>{i=1}^{n}$ is a set of linearly dependent vectors, and
let $\alpha_i$ be $n$ coefficients. Then, $M = \sum</em>{i=1}^{n}{\alpha_i v_i
v_i^{T}}$, a linear combination of outer products of the vectors with
themselves, is rank deficient.</p>
<p><strong>Proof:</strong> The row space of $M$ is generated by the set of vectors ${v_1, v_2,
…, v_n}$. However, because this set of vectors is linearly dependent, it must
span a vector space of dimension strictly less than $n$, or in other words
less than or equal to $n-1$. But the dimension of the row space is precisely
the rank of the matrix $M$. Thus, $rank(M) \leq n-1$, as desired.</p>
<p>With the lemma, we’re now ready to prove our proposition.</p>
<p><strong>Proof:</strong> We have that</p>
<p>$$
\begin{align*}
\frac{1}{n} \sum_{i=1}^{n}{\mu_i} = \mu \implies \sum_{i=1}^{n}{\mu_i-\mu} = 0
\end{align*}
$$</p>
<p>So ${\mu_i-\mu}_{i=1}^{n}$ is a linearly dependent set. Applying our lemma, we
see that</p>
<p>$$ S_B = \sum_{i=1}^{n}{N_i (\mu_i-\mu)(\mu_i-\mu)^{T}} $$</p>
<p>must be rank deficient. Thus, $rank(S_W) \leq n-1$. Now, $rank(AB) \leq
rank(A)rank(B)$, so</p>
<p>$$
\begin{align*}
rank(S_W^{-1}S_B) \leq \min{(rank(S_W^{-1}), rank(S_B))} = n-1
\end{align*}
$$</p>
<p>as desired.</p>
<h2 id="lda-as-a-machine-learning-technique">LDA as a Machine Learning Technique</h2>
<p>OK so we’re done with the math, but how is LDA actually used in practice? One of
the easiest ways is to look at how LDA is actually implemented in the real
world. <code>scikit-learn</code> has <a href="http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html#sklearn.discriminant_analysis.LinearDiscriminantAnalysis">a very well-documented implementation of
LDA</a>:
I find that reading the docs is a great way to learn stuff.</p>
<p>Below are a few miscellaneous comments on practical considerations when using
LDA.</p>
<h3 id="regularization-aka-shrinkage">Regularization (a.k.a. shrinkage)</h3>
<p><code>scikit-learn</code>’s implementation of LDA has an interesting optional parameter:
<code>shrinkage</code>. What’s that about?</p>
<p><a href="https://stats.stackexchange.com/questions/106121/does-it-make-sense-to-combine-pca-and-lda/109810#109810">Here’s a wonderful Cross Validated
post</a>
on how LDA can introduce overfitting. In essence, matrix inversion is an
extremely sensitive operation (in that small changes in the matrix may lead to
large changes in its inverse, so that even a tiny bit of noise will be amplified
upon inverting the matrix), and so unless the estimate of the within-class
scatter matrix $S_W$ is very good, its inversion is likely to introduce
overfitting.</p>
<p>One way to combat that is through regularizing LDA. It basically replaces
$S_W$ with $(1-t)S_W + tI$, where $I$ is the identity matrix, and $t$ is
the <em>regularization parameter</em>, or the <em>shrinkage constant</em>. That’s what
<code>scikit</code>’s <code>shrinkage</code> parameter is: it’s $t$.</p>
<p>If you’re interested in <em>why</em> this linear combination of the within-class
scatter and the identity give such a well-conditioned estimate of $S_W$, check
out <a href="https://www.semanticscholar.org/paper/A-well-conditioned-estimator-for-large-dimensional-Ledoit-Wolf/23d8219db1aff006b41007effc696fca6fbcabcf">the original paper by Ledoit and
Wolf</a>.
Their original motivation was in financial portfolio optimization, so they’ve
also authored several other papers
(<a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=433840&rec=1&srcabs=290916&alg=7&pos=6">here</a>
and
<a href="https://www.semanticscholar.org/paper/A-well-conditioned-estimator-for-large-dimensional-Ledoit-Wolf/23d8219db1aff006b41007effc696fca6fbcabcf">here</a>)
that go into the more financial details. That needn’t concern us though:
covariance matrices are literally everywhere.</p>
<p>For an illustration of this, <code>amoeba</code>’s post on Cross Validated gives a good
example of LDA overfitting, and how regularization can help combat that.</p>
<h3 id="lda-as-a-classifier">LDA as a classifier</h3>
<p>We’ve talked a lot about how LDA is a dimensionality reduction technique. But in
addition to it, you can make two extra assumptions, and LDA becomes a very
robust classifier as well! Here they are:</p>
<ol>
<li>Assume that the class conditional distributions are Gaussian, and</li>
<li>Assume that these Gaussians have the same covariance matrix (a.k.a.
assume <a href="https://en.wikipedia.org/wiki/Homoscedasticity">homoskedasticity</a>)</li>
</ol>
<p>Now, <em>how</em> LDA acts as a classifier is a bit complicated: the problem is solved
fairly easily if there are only two classes. In this case, the optimal Bayesian
solution is to classify the observation depending on whether the log of the
likelihood ratio is less than or greater than some threshold. This turns out to
be a simple dot product: $\vec{w} \cdot \vec{x} > c$, where $\vec{w} =
\Sigma^{-1} (\vec{\mu_1} - \vec{\mu_2})$. <a href="https://en.wikipedia.org/wiki/Linear_discriminant_analysis#LDA_for_two_classes">Wikipedia has a good derivation of
this</a>.</p>
<p>There isn’t really a nice dot-product solution for the multiclass case. So,
what’s commonly done is to take a “one-against-the-rest” approach, in which
there are $k$ binary classifiers, one for each of the $k$ classes. Another
common technique is to take a pairwise approach, in which there are $k(k-1)/2$
classifiers, one for each pair of classes. In either case, the outputs of all
the classifiers are combined in some way to give the final classification.</p>
<h3 id="close-relatives-pca-qda-anova">Close relatives: PCA, QDA, ANOVA</h3>
<p>LDA is similar to a lot of other techniques, and the fact that they all go by
acronyms doesn’t do anyone a favor. My goal here isn’t to introduce or explain
these various techniques, but rather point out their differences.</p>
<p><em>1) Principal components analysis (PCA):</em></p>
<p>LDA is very similar to <a href="http://setosa.io/ev/principal-component-analysis">PCA</a>:
in fact, the question posted in the Cross Validated post above was actually
about whether or not it would make sense to perform PCA followed by LDA.</p>
<p>There is a crucial difference between the two techniques, though. PCA tries to
find the axes with <em>maximum variance</em> for the whole data set, whereas LDA tries
to find the axes for best <em>class separability</em>.</p>
<p><img src="https://www.georgeho.org/assets/images/lda-pic.png" alt="Projections of two-dimensional data (in two clusters) onto the x and y axes"></p>
<p>Going back to the illustration from before (reproduced above), it’s not hard to
see that PCA would give us LD2, whereas LDA would give us LD1. This makes the
main difference between PCA and LDA painfully obvious: just because a feature
has a high variance, doesn’t mean that it’s predictive of the classes!</p>
<p><em>2) Quadratic discriminant analysis (QDA):</em></p>
<p>QDA is a generalization of LDA as a classifer. As mentioned above, LDA must
assume that the class contidtional distributions are Gaussian with the same
covariance matrix, if we want it to do any classification for us.</p>
<p>QDA doesn’t make this homoskedasticity assumption (assumption number 2 above),
and attempts to estimate the covariance of all classes. While this might seem
like a more robust algorithm (fewer assumptions! Occam’s razor!), this means
there is a much larger number of parameters to estimate. In fact, the number of
parameters grows quadratically with the number of classes! So unless you can
guarantee that your covariance estimates are reliable, you might not want to use
QDA.</p>
<p>After all of this, there might be some confusion about the relationship between
LDA, QDA, what’s for dimensionality reduction, what’s for classification, etc.
<a href="https://stats.stackexchange.com/questions/71489/three-versions-of-discriminant-analysis-differences-and-how-to-use-them/71571#71571">This CrossValidated
post</a>
and everything that it links to, might help clear things up.</p>
<p><em>3) Analysis of variance (ANOVA):</em></p>
<p>LDA and <a href="https://en.wikipedia.org/wiki/Analysis_of_variance">ANOVA</a> seem to have
similar aims: both try to “decompose” an observed variable into several
explanatory/discriminatory variables. However, there is an important difference
that <a href="https://en.wikipedia.org/wiki/Linear_discriminant_analysis">the Wikipedia article on
LDA</a> puts very
succinctly (my emphases):</p>
<blockquote>
<p>LDA is closely related to analysis of variance (ANOVA) and regression
analysis, which also attempt to express one dependent variable as a linear
combination of other features or measurements. However, ANOVA uses
<strong>categorical</strong> independent variables and a <strong>continuous</strong> dependent variable,
whereas discriminant analysis has <strong>continuous</strong> independent variables and a
<strong>categorical</strong> dependent variable (i.e. the class label).</p>
</blockquote>Portfolio Risk Analytics and Performance Attribution with Pyfoliohttps://www.georgeho.org/pyfolio/2017-12-16T00:00:00Z2017-12-16T00:00:00Z<p>I was lucky enough to have the chance to intern at
<a href="https://www.quantopian.com/">Quantopian</a> this summer. During that time I
contributed some exciting stuff to their open-source portfolio analytics engine,
<a href="https://github.com/quantopian/pyfolio"><code>pyfolio</code></a>, and learnt a truckload of
stuff while doing it! In this blog post, I’ll describe and walk through two of
the new features that I authored: the risk and performance attribution tear
sheets.</p>
<center>
<img
src="https://www.georgeho.org/assets/images/pyfolio-logo.png"
alt="Pyfolio logo">
</center>
<h2 id="risk-analytics">Risk Analytics</h2>
<p>A well-known truth of algorithmic trading is that it’s insufficient to merely
maximize the returns of your algorithm: you must also do so while minimizing the
risk it takes on board. This idea is probably most famously codified in the
<a href="https://en.wikipedia.org/wiki/Sharpe_ratio#Definition">Sharpe ratio</a>, which
divides by the volatility of the returns stream in order to give a measure of
the “risk-adjusted returns”.</p>
<p>However, the volatility of returns is a rather poor proxy for the amount of
“risk” that an algorithm takes on. What if our algo loaded all of its money in
the real estate sector? What if the algo shorted extremely large-cap stocks?
What if half of our portfolio is in illiquid, impossible-to-exit positions?</p>
<p>These are all “risky” behavior for an algorithm to have, and we’d like to know
about and understand this kind of behavior before we seriously consider investing
money in the algo. However, these formulations of risk are neither captured nor
quantified by the volatility of returns (as in the Sharpe ratio). Finally,
there is no easy, free, open-source way to get this sort of analysis.</p>
<p>Enter <code>pyfolio</code>’s new risk tear sheet! It addresses all the problems outlined
above, and more. Let’s jump right in with an example.</p>
<p><img src="https://www.georgeho.org/assets/images/pyfolio-risk-tear-sheet.png" alt="Example risk tear sheet"></p>
<p>(This example risk tear sheet came from the <a href="https://github.com/quantopian/pyfolio/pull/391">original pull
request</a>, and may therefore be
out of date)</p>
<p>The first 4 plots show the exposure to common style factors: specifically, the
size of the company (natural log of the market cap), mean reversion (measured
by the <a href="http://www.investopedia.com/terms/m/macd.asp">MACD Signal</a>), long-term
momentum, and volatility.
A style factor is best explained with examples: mean reversion, momentum,
volatility and the Fama-French canonical factors (SMB, HML, UMD) are all
examples of style factors. They are factors that indicate broad market trends
(instead of being characteristic to individual stocks, like sectors or market
caps) and characterize a particular <em>style</em> of investing (e.g. mean reversion,
trend-following strategies, etc.).
The analysis is not limited to 4 style factors, though: <code>pyfolio</code> will handle
as many as you pass in (but see below for a possible complication). As we can
see, the algorithm has a significant exposure to the MACD signal, which may or
may not worry us. For instance, it wouldn’t worry us if we knew that it was a
mean-reversion algo, but we would raise some eyebrows if it was something
else… perhaps the author <em>wanted</em> to write a wonderful, event-driven
sentiment algo, but inadvertently <em>ended up</em> writing a mean reversion algo!
One important caveat here is that <code>pyfolio</code> requires you to supply your own
style factors, for every stock in your universe. This is an unfortunately large
complication for the average user, as it would require you to formulate and
implement your own risk model — I explain this in greater detail below.</p>
<p>The next 3 plots show the exposures to sectors. This first plot shows us how much
the algorithm longed or shorted a specific sector: above the x-axis if it
longed, and below if it shorted. The second plot simply shows the gross exposure
to each sector: taking the absolute value of the positions before normalizing.
The last plot shows the net exposure to each sector: taking the long position
<em>less the short position</em> before normalizing. This particular algo looks
beautiful: it is equally exposed to all sectors, and not overly exposed to any
one of them. Evidently, this algo must be taking account its sector exposures
in its trading logic: given what we know from above, perhaps it is longing the
top 10 most “mean reverting” stocks in each sector at the start of every
week… This analysis requires no addition data other than your algorithm’s
positions: you can supply your own sectors if you like, but if not, the analysis
will default to the <a href="https://www.quantopian.com/help/fundamentals#asset-classification">Morningstar sector
mappings</a>
(specifically, the <code>morningstar_sector_code</code> field), available for free on the
Quantopian platform.</p>
<p>The next 3 plots show the exposures to market caps. In every other respect, it
is identical to the previous 3 plots. These plots look fairly reasonable: most
algos spend most of their positions in large and mega cap names, and have almost
no positions in micro cap stocks. (Quantopian actually discourages investing in
micro cap stocks by pushing users towards using the <a href="https://www.quantopian.com/posts/the-q500us-and-q1500us">Q500 or
Q1500</a> as a tradeable
universe). This analysis uses <a href="https://www.quantopian.com/help/fundamentals#valuation">Morningstar’s <code>market cap</code>
field</a>.</p>
<p>The last 2 plots show the portfolio’s exposure to illiquidity (or low trading
volume). This one is a bit trickier to understand: every the end of every day,
we take the number of shares held in each position and divide that by the
total volume. That gives us a number per position per day. We find the 10th
percentile of this number (i.e. the most illiquid) and plot that as a time
series. So it is a measure of how exposed our portfolio is to illiquid stocks.
The first plot shows the illiquid exposure in our long and short positions,
respectively: that is, it takes the number of shares held in each long/short
position, and divides it by the daily total volume. The second plot shows the
gross illiquid exposure, taking the absolute value of positions before
dividing. So it looks like for this particular algo, for the 10% most illiquid
stock in our portfolio, our position accounts for around 0.2–0.6% (<em>not</em>
0.002–0.006%!) of market volume, on any given day. That’s an acceptably low
number! This analysis obviously requires daily volume data per stock, but that’s
freely available on Quantopian’s platform.</p>
<p>That’s it for the risk tear sheet! There are some more cool ideas in the
works (there always are), such as including plots to show a portfolio’s
concentration risk exposure, or a portfolio’s exposure to penny stocks. If you
have any suggestions, please file a <a href="https://github.com/quantopian/pyfolio/issues">new GitHub
issue</a> to let the dev team know!
Pyfolio is open-source and under active development, and outside contributions
are always loved and appreciated. Alternatively, if you just want to find out
more about the nuts and bolts (i.e. the math and the data) that goes into risk
tear sheet, you can dig around <a href="https://github.com/quantopian/pyfolio/tree/master/pyfolio">the source code
itself</a>!</p>
<h2 id="risk-models-and-performance-attribution">Risk Models and Performance Attribution</h2>
<p>There are two things in the discussion of the risk tear sheet that are worth
talking about in further detail:</p>
<ol>
<li>I mentioned how the computation of style factor exposures (i.e. the first 4
plots) required your own “risk model” (whatever that is), and</li>
<li>It was nice that we can guess at the inner workings of the algo, just by
seeing its exposure to common factors. E.g., I guessed that the example algo
was a sector-neutral mean reversion algo, because it was equally exposed to
all 11 sectors, and had a high (in magnitude) exposure to the MACD signal.</li>
</ol>
<p>I’ll talk about both points in order.</p>
<p>In order to find out your exposure to a style factor, you obviously must first
know how much each stock is affected by the style factor. But how do you get
that? That is what a risk model is for!</p>
<p>At the end of every period (usually every trading day), the risk model wakes
up, looks at all the pricing data and style factor data for that day.
It then tries to explain as best it can how much each stock was affected by
each style factor. The end result is that each stock will have a couple of
numbers associated with it, one for every style factor. These numbers indicate
how sensitive the stock’s returns were to movements in the style factors. These
numbers are called <em>factor loadings</em> or <em>betas</em> (although I prefer “factor
loadings” because a lot of things in quant finance are called “beta”).</p>
<p>Even better, there’s no reason why the risk model should limit itself to style
factors! I previously made the distinction between style factors and other
factors such as sectors: theoretically, a risk model should also be able to find
out how sensitive a stock’s returns are to movements in its sector: compute a
“sector factor loading”, if you will. Collectively, all the factors that we want
the risk model to consider — be they sector, style or otherwise — are called
<em>common factors</em>.</p>
<p>Clearly, having a risk model allows us to do a whole lot of stuff! This is
because, if we want to know how style factors and other prevailing market trends
are affecting our <em>portfolio</em>, we must first know how they affect the <em>stocks</em>
in our portfolio. Or, to be a bit more ambitious, if we knew how style factors
and prevailing market trends are impacting our <em>universe</em> of stocks, then we’re
well on the way to knowing how they’re impacting our portfolio! The value of
this kind of portfolio analysis should, of course, be self-evident.</p>
<p>So, suppose we have a risk model. How do we get from a <em>stock-level</em> understanding
of how market trends are affecting us, to a <em>portfolio-level</em> understanding of the
same? The answer to this question is called <em>performance attribution</em>, and is
one of the main reasons a risk model is worth having.</p>
<p>Instead of prattling on about performance attribution, it’d just be easier to
show you the miracles it can do. Below are some (fake, made up) examples of some
analysis performance attribution can give us:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-txt" data-lang="txt"><span style="display:flex;"><span>Date: 08–23–2017
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>Factor PnL ($)
</span></span><span style="display:flex;"><span>-------------- --------
</span></span><span style="display:flex;"><span>Total PnL -1,000
</span></span><span style="display:flex;"><span>Technology 70
</span></span><span style="display:flex;"><span>Real Estate -40
</span></span><span style="display:flex;"><span>Momentum -780
</span></span><span style="display:flex;"><span>Mean Reversion 100
</span></span><span style="display:flex;"><span>Volatility -110
</span></span><span style="display:flex;"><span>Stock-Specific 480
</span></span></code></pre></div><p>The table shows that today, our algo suffered a $1000 loss, and the breakdown of
that loss indicates that the main culprit is momentum. In other words, our poor
performance today is mostly attributable to the poor performance of the momentum
factor (hence the name, “performance attribution”). The sector factors account
for very little PnL, while the other style factors (mean reversion and
volatility) drive fairly significant profits and losses, but the real smoking
gun here is the fact that momentum completely tanked today.</p>
<p>There are a few more useful summary statistics that performance attribution can
give us! Traditional computations for the alpha and the Sharpe ratio of a
strategy usually take into account the performance of the market: i.e., the
traditional alpha is a measure of how much our strategy <em>outperformed</em> the
market, and the traditional Sharpe ratio is a measure of the same, but
accounting for the volatility of returns. These may be dubbed <em>single-factor
alphas</em>, because they only measure performance once one factor has been
accounted for — namely, the market. In reality, we would like to not only
account for the market, but also any other common factors, such as style or
sector. This leads to the concept of the <em>multi-factor alpha and Sharpe ratio</em>,
which is exactly the same as the alpha and Sharpe ratio we’re familiar with, but
taking into account a lot more factors. In other words, whereas the returns in
excess of the market is quantified by the single factor alpha, the returns in
excess of the market, momentum, mean reversion, volatility etc., is
quantified by the multi factor alpha. The same goes for the single factor and
multi factor Sharpe, in the case of risk-adjusted returns.</p>
<p>Adding performance attribution capabilities to <code>pyfolio</code> is an active project! A
couple of pull requests have already been merged to this effect, so definitely
stay tuned! A new version of <code>pyfolio</code> will probably be made once performance
attribution is up and running. As always, feel free to
<a href="https://github.com/quantopian/pyfolio">contribute to <code>pyfolio</code></a>, be it by
making feature requests, issues with bugs, or submitting a pull request!</p>
<hr>
<p><strong>Update (12–16–2017):</strong> Quantopian recently launched their risk model for
anyone to use — this is a great resource that usually only large and
deep-pocketed financial institutions have access to. Check it out
<a href="https://www.quantopian.com/risk-model">here</a>.</p>
<p><strong>Update (05–11–2018):</strong> Quantopian’s now integrated pyfolio analytics into
their backtest engine! This makes it much easier to see how your algorithm
stacks up against expectations. Check out the announcement
<a href="https://www.quantopian.com/posts/improved-backtest-analysis">here</a>.</p>
<p><strong>Update (05–29–2018):</strong> Quantopian recently published a white paper on how the
risk model works! Read all about it
<a href="https://www.quantopian.com/papers/risk">here</a>.</p>
<p><strong>Update (12-16-2020):</strong> <a href="https://www.bloomberg.com/news/articles/2020-12-16/quant-trading-platform-quantopian-closes-down">Quantopian has been acquired by
Robinhood.</a>
Sorry for all the broken links to <code>www.quantopian.com</code>.</p>Modelling Hate Speech on Reddit — A Three-Act Play (Slide Deck)https://www.georgeho.org/reddit-slides/2017-11-08T00:00:00Z2017-11-08T00:00:00Z<p>This is a follow-up post to my first post on a recent project to <a href="https://www.georgeho.org/reddit-clusters/">model hate
speech on Reddit</a>. If you haven’t
taken a look at my first post, please do!</p>
<p>I recently gave a talk on the technical, data science side of the project,
describing not just the final result, but also the trajectory of the whole
project: stumbling blocks, dead ends and all. Below is the slide deck: enjoy!</p>
<h2 id="abstract">Abstract</h2>
<p>Reddit is the one of the most popular discussion websites today, and is
famously broad-minded in what it allows to be said on its forums: however,
where there is free speech, there are invariably pockets of hate speech.</p>
<p>In this talk, I present a recent project to model hate speech on Reddit. In
three acts, I chronicle the thought processes and stumbling blocks of the
project, with each act applying a different form of machine learning:
supervised learning, topic modelling and text clustering. I conclude with the
current state of the project: a system that allows the modelling and
summarization of entire subreddits, and possible future directions. Rest
assured that both the talk and the slides have been scrubbed to be safe for
work!</p>
<h2 id="slides">Slides</h2>
<blockquote class="embedly-card"><h4><a href="https://speakerdeck.com/_eigenfoo/modelling-hate-speech-on-reddit-a-three-act-play">Modelling Hate Speech on Reddit - A Three-Act Play</a></h4><p>Reddit is the one of the most popular discussion websites today, and is famously broad-minded in what it allows to be said on its forums: however, where there is free speech, there are invariably pockets of hate speech. In this talk, I present a recent project to model hate speech on Reddit.</p></blockquote>
<script async src="//cdn.embedly.com/widgets/platform.js" charset="UTF-8"></script>Hello World!https://www.georgeho.org/hello/2017-07-29T00:00:00Z2017-07-29T00:00:00Z<p><a href="https://en.wikipedia.org/wiki/Utah_teapot">The Utah teapot!</a> (Basically the
“hello world” of computer graphics).</p>
<p><img src="https://www.georgeho.org/assets/images/utah-teapot.png" alt="The Utah teapot"></p>
<p>This is the first post of what will (hopefully) be a cool and interesting blog.
Hope you like it!</p>
<p>For those who are interested, this website is based off <a href="https://mmistakes.github.io/minimal-mistakes/">the Minimal Mistakes
theme</a> by Michael Rose, generated
with <a href="https://jekyllrb.com">Jekyll</a>, hosted by <a href="https://pages.github.com/">GitHub
Pages</a> and served using
<a href="https://www.cloudflare.com/">Cloudflare</a>. I’ve had no complaints with this
blogging stack: the only thing I pay for is the custom
<a href="https://eigenfoo.xyz/"><code>eigenfoo.xyz</code></a> domain name, which costs the same as
maybe two or three cups of coffee a year.</p>
<hr>
<p><strong>Update (2022-03-06):</strong> I’ve since made several changes to this blog, which
you can read about in <a href="https://www.georgeho.org/migrating-to-hugo">a subsequent blog post</a>.</p>