Natural language processing is in a curious place right now. It’s not immediately obvious how successful it’s been, or how close the field is to viable, production-ready techniques (in the same way that, say, computer vision is). For example, Sebastian Ruder predicted that the field is close to a watershed moment, and that within a year we’ll have downloadable language models (that was six months ago, and, uh…). However, Ana Marasović points out that there is a tremendous amount of work demonstrating that:
“despite good performance on benchmark datasets, modern NLP techniques are nowhere near the skill of humans at language understanding and reasoning when making sense of novel natural language inputs”.
I am confident that I am very bad at making lofty predictions about the future. Instead, I’ll try to talk about something I know a bit about: simple solutions to concrete problems, with some Bayesianism thrown in for good measure .
This blog post will summarize some literature on probabilistic and Bayesian matrix factorization methods, keeping an eye out for applications to one specific task in NLP: text clustering. It’s exactly what it sounds like, and there’s been a fair amount of success in applying text clustering to many other NLP tasks (e.g. check out these examples in document organization, corpus summarization and document classification).
A natural question is: why is matrix factorization1 a good technique to use for text clustering? Because it is both a clustering and a feature engineering technique: it gives us a better view of the latent space! Matrix factorization lives an interesting double life: clustering technique by day, feature transformation technique by night. Aggarwal and Zhai suggest that chaining matrix factorization with some other clustering technique (e.g. agglomerative clustering or topic modelling) is common practice and is called concept decomposition, but I haven’t seen any other source back this up.
What follows is a literature review of three matrix factorization techniques for machine learning: one classical, one probabilistic and one Bayesian. Dive in!
Non-Negative Matrix Factorization (NMF)
Factorize your (entrywise non-negative) matrix as , where is and is . is the dimensionality of your latent space, and each latent dimension usually comes to quantify something with semantic meaning. There are several algorithms to compute this factorization, but Lee and Seung’s multiplicative update rule (originally published in NIPS 2000) is most popular.
Fairly simple: enough said, I think.
Probabilistic Matrix Factorization (PMF)
Originally introduced as a paper at NIPS 2007, probabilistic matrix factorization is essentially the exact same model as NMF, but with uncorrelated (a.k.a. “spherical”) multivariate Gaussian priors placed on the rows and columns of and . Expressed as a graphical model, PMF would look like this:
Note that the priors are placed on the rows of the and matrices.
The authors then (somewhat disappointing) proceed to find the MAP estimate of the and matrices. They show that maximizing the posterior is equivalent to minimizing the sum-of-squared-errors loss function with two quadratic regularization terms:
where denotes the Frobenius norm, and is 1 if document contains word , and 0 otherwise.
This loss function can be minimized via gradient descent, and implemented in your favorite deep learning framework (e.g. Tensorflow or PyTorch).
The problem with this approach is that while the MAP estimate is often a reasonable point in low dimensions, it becomes very strange in high dimensions, and is usually not informative or special in any way. Read Ferenc Huszár’s blog post for more.
Bayesian Probabilistic Matrix Factorization (BPMF)
Strictly speaking, PMF is not a Bayesian model. After all, there aren’t any priors or posteriors, only fixed hyperparameters and a MAP estimate. Bayesian probabilistic matrix factorization, originally published by researchers from the University of Toronto is a fully Bayesian treatment of PMF.
Instead of saying that the rows/columns of U and V are normally distributed with zero mean and some precision matrix, we place hyperpriors on the mean vector and precision matrices. The specific priors are Wishart priors on the covariance matrices (with scale matrix and degrees of freedom), and Gaussian priors on the means (with mean and covariance equal to the covariance given by the Wishart prior). Expressed as a graphical model, BPMF would look like this:
Note that, as above, the priors are placed on the rows of the and matrices, and that is the dimensionality of latent space (i.e. the number of latent dimensions in the factorization).
The authors then sample from the posterior distribution of and using a Gibbs sampler. Sampling takes several hours: somewhere between 5 to 180, depending on how many samples you want. Nevertheless, the authors demonstrate that BPMF can achieve more accurate and more robust results on the Netflix data set.
I would propose two changes to the original paper:
- Use an LKJ prior on the covariance matrices instead of a Wishart prior. According to Michael Betancourt and the PyMC3 docs, this is more numerically stable, and will lead to better inference.
- Use a more robust sampler such as NUTS (instead of a Gibbs sampler), or even resort to variational inference. The paper makes it clear that BPMF is a computationally painful endeavor, so any speedup to the method would be a great help. It seems to me that for practical real-world applications to collaborative filtering, we would want to use variational inference. Netflix ain’t waiting 5 hours for their recommendations.
which is, by the way, a severely underappreciated technique in machine learning ↩