Return to site

Augur's Black Swans

In my last post, I discussed how a common statistical technique called Principal Component Analysis (PCA) can function as a lie detector. While there's a fair bit of math involved, the core insight is simply that lies are more variable than truths. If you take a set of questions -- each of which has an "obvious" (objectively and easily-determinable answer -- and collect several people's answers to these questions, by cross-referencing their answers, you can figure out who is contributing most to the variability in the collection of answers. PCA provides a score that tells you precisely this information.
Or does it? One problem with PCA is that it is a second-order statistical technique. This means that it accounts for pairwise coupling of signals -- "how dispersed were each user's answers to question A? how much did user X's answers vary with user Y's answers?" -- but it is blind to three-way, four-way, etc. interactions. Now, in many cases, "cheating" or collusion between users will affect the second as well as the higher moments; for these cases, PCA should work fine. However, this is not always the case: certain kinds of feedback loops can alter the fourth-order statistics, while leaving the second-order statistics untouched. PCA cannot see these. Collusion structured in the right way is invisible to PCA.
Four-way interactions are of particular interest, because the fourth standardized moment (the kurtosis) quantifies the amount of positive feedback present in the system. "Excess kurtosis" indicates that extreme occurrences are significantly more common than you might expect, based on "intuition", or, more precisely, by assuming that the system's underlying dynamics are driven by a random walk. (Nasim Taleb's The Black Swan is an awesome, non-technical introduction to this concept.)
So, can we use a fourth-order "version" of PCA -- perhaps in combination with ordinary, second-order PCA -- to arrive at a more comprehensive measure of nonconformity? In fact, there is a (literal) higher-order version of PCA, called Independent Component Analysis, or ICA. The difficulty with ICA is that, like PCA, it relies on diagonalization to find the "best" multi-linear decomposition of a joint moment tensor. For PCA, diagonalizing the covariance matrix (i.e., the second-order joint moment tensor) gives you a set of uncorrelated axes -- these are the covariance matrix's eigenvectors, since the off-diagonal elements of the eigenvalue matrix are 0. ICA is a more powerful technique: it finds a set of axes which are mutually statistically independent (rather than merely uncorrelated). Unfortunately, diagonalizing higher-order tensors is much messier than it is for matrices, and often these calculations fail to converge. (The most widely-used implementation of ICA, FastICA, uses fixed-point iteration rather than tensor decomposition, but nevertheless has the same convergence problems.) This makes ICA impractical for use in a decentralized system, since the report matrices it must operate on will be essentially random.
Let's take a step back. Since we're really interested in the users' scores -- that is, projections onto the independent axes, rather than the axes themselves -- we should be able to avoid ICA and the troublesome issues of tensor diagonalization altogether. What if we just summed across three of the dimensions of the fourth joint moment tensor (the cokurtosis tensor)? The remaining dimension would contain very literal measures of each user's contribution to the overall cokurtosis. This simpler method does not encounter issues with convergence or numerical stability. However, it still has a problem with complexity -- it turns out the cokurtosis tensor takes a prohibitively long time to calculate.

Happily, there's a much better way of arriving at the result we want (sums over the cokurtosis tensor's dimensions). First, a few definitions:

X: centered, standardized matrix of Augur event reports
K: cokurtosis tensor
Ne: number of events
Nr: number of reporters
xi: the ith signal (column) of the centered data matrix X.

is the cokurtosis tensor, which has dimensions Nr × Nr × Nr × Nr; visually, this would be a tesseract (a 4-D cube).  K is calculated as:
broken image
where ⊗ is the tensor product. However, we're not actually interested in itself; our goal is instead to collapse K to c, the nonconformity vector:
broken image
The nonconformity vector is the heart of Augur's decentralized lie detection mechanism. The elements of c quantify each reporter's contribution to the overall cokurtosis. High values indicate a large contribution -- and therefore a high likelihood of cheating. I've recently spent some time doing quantitative comparisons between several different lie detection algorithms, and my conclusion so far is that by far the strongest lie detector is simply how much each user contributes to the cokurtosis tensor.
(My next blog post will go into detail on how different mechanisms are compared with one another. Here is the very short version: treat the lie detector algorithms as binary classifiers, run a bunch of Monte Carlo simulations, then see which algorithm has the best Matthews correlation coefficient. Spoiler: the cokurtosis method kicks butt.)
K gives Augur's lie detector a great deal of resolving power, but there's a catch: calculating the nonconformity vector by building the cokurtosis tensor, then summing over its dimensions, is prohibitively expensive. Naive construction of K requires O(Nr4) operations.
However, it should be possible to bypass this expensive operation, as we don't do anything with the elements of K aside from sum over them. (And, the final dimension, which is not summed over, is just c.) After spending some time writing out these sums, something occurred to me that was in retrospect pretty obvious: K is symmetric, so the collapsed indices should contribute the same way to every element of c.
After tinkering a bit with the algebra, it became clear that this "same way" is the sum across each row of the data matrix, with each element of the data matrix cubed:
broken image
(The method described here works for collapsing any symmetric tensor, not just fourth-order. The power to which Xij is raised is always one less than the tensor's order.)
So, how do we actually calculate the vector c? Notice that, if we break apart the brute-force calculation of c, the ith element of c is the dot product of this "sum across rows" vector with column i of the data matrix X. Since this dot product is carried out for every column, c can be calculated using a simple matrix product:
broken image
This equation's time complexity should be linear with respect to both Ne and Nr. To verify this numerically, I ran a set of Monte Carlo simulations, where the numbers of events and reporters were gradually cranked up from 10 to 100. Each data point shown here is the average time elapsed over 250 simulations, and the error bars represent +/- standard deviations of the mean:
broken image
As expected, this method of calculating c has O(Ne Nr) complexity -- a huge improvement from where we started, and fast enough for use in decentralized systems!
Originally posted at The Ramble.