Return to site

I, Robot… Will Be The Judge:
The Case for Artificial (Intelligence) Referees as Helpers

By Peronet Despeignes
Editor's Note: The following article is an overview of a speculative, long-term, hard-core research and development initiative within Augur. It's part of our ongoing efforts (1, 2) to make the existing REP holder reporting system massively more scalable than it is today -- to enahnce it, not replace it.
The outcome of this special project is unclear but it represents just one example of how Augur is thinking ahead way ahead and will continue to do so, relentlessly, as part of the ongoing work of the Forecast Foundation.

Machines are getting better than humans at figuring out who to hire, who’s in a mood to pay a little more for that sweater, and who needs a coupon to nudge them toward a sale. In applications around the world, software is being used to predict whether people are lying, how they feel and whom they’ll vote for… To crack these cognitive and emotional puzzles, computers needed not only sophisticated, efficient algorithms, but also vast amounts of human-generated data, which can now be easily harvested from our digitized world. The results are dazzling. Most of what we think of as expertise, knowledge and intuition is being deconstructed and recreated as an algorithmic competency, fueled by big data.

The capabilities [IBM’s] Watson has demonstrated using deep analytics and natural language processing are truly stunning. The technologies that will develop from this will no doubt help the world with many of its significant problems. Not least of these is dealing with the vast, escalating volumes of data our modern world generates.

When it comes to investment advice, would you trust a financial professional or a robot? A growing number of people are choosing the latter, on the belief that algorithms can provide rational and dispassionate advice at a cost well below that of traditional advisors. A handful of automated investment startups created in the past few years now have more than $4 billion in assets under management, according to Forrester Research. It’s a small segment of a trillion-dollar wealth management industry but growing at a red-hot pace.
Watching John with the machine, it was suddenly so clear.
The Terminator would never stop, it would never leave him… it would always be there. And it would never hurt him, never shout at him or get drunk and hit him, or say it couldn’t spend time with him because it was too busy. And it would die to protect him. Of all the would-be fathers who came and went over the years, this thing, this machine, was the only one who measured up. In an insane world, it was the sanest choice.
We've said it before: Prediction markets are only as good as their referees – and the confidence people place in them.
Referees are needed to decide whether any prediction happened or not by the time set. That determines who wins, who loses and whether the price signals are finely tuned to the odds that the predicted event will happen (instead of being distorted by speculation referees wont judge fairly). The referee system has to be honest, reliable and scalable – and it should render its judgements as quickly as possible.
If it’s perceived as unfair and unreliable, the market eventually loses confidence, integrity and value. And if the referee system isn’t scalable (able to cost-effectively handle rapid growth and increasingly large sets of questions to be decided) and the quickest method available, then the system’s potential for growth is limited, and future profits are at risk. The type of referee system used is a critical design feature.
That is why Augur intends to relentlessly refine and enhance the capability and scalability of its referee system, which we label Reputation (reputation being the voting clout that each referee holds depending on their financial stake in the system and their historical performance in reporting honestly and staying aligned with the group's consensus about which predictions happened or didn't happen across multiple decisions).
To better understand where we might be going with this and how, let's take a big step back and look at all the options available for judging whether a prediction happened or not, including the "lie detector" system we've been enhancing and plans for a new Artificial Intelligence-powered backup "helper" agent.
The options we have before us are:
    • a single human judge
    • a fixed group of human judges
    • a company
    • a changing group of humans whose votes are not "free," not equal and not permanent but instead come at a cost and are subjected to a formula designed to dynamically adjust each judge's voting clout on an ongoing basis according to their compliance with the consensus across multiple decisions.
    • a centralised “robot” that decides the outcome
    • a decentralised “robot” that decides the outcome
    • a hybrid of the fourth and sixth option, mixed with "machine learning" where the AI goes through a learning period of "auto-suggesting" and "learning" from the suggestions human referees accepted and rejected.
Let’s take a closer look at each option to help better understand the context for our selection.
Option 1: A lone human judge
Well, this option at least has the virtue of being simple. But individuals can cheat, especially when there’s lots of money at stake. A single judge can err or be corrupted, or simply flake out. One individual also can not reliably handle the hundreds of prediction market decisions we expect to be posted to our platform in the weeks after it goes live.
Option 2: A “jury” of human judges where every vote counts equally

This makes more sense than a single judge. Instead of “putting all our eggs in one basket” — putting all our faith in one person to judge fairly, impartially and consistently — we diversify and put our faith in a group decision. An individual might err, cheat or flake out. A group is collectively less likely to err, cheat or flake out. That’s the basic assumption behind the concept of a jury-based judicial system and democracy in general.

Another virtue of this option is that, if we allow for quorums (where it’s ok if not everyone votes on every decision and some minimum number of judges votes), it can handle more decisions.

The maximum number of judgements the system can handle is a bit higher than Option 1, but it’s still limited. Just look at the slow-moving US justice system or any event or competition involving multiple judgments by juries and panels of judges. They can get overwhelmed.

There’s an additional complication when it comes to applying such a jury system to a decentralized, cryptocurrency-based network such as bitcoin, where users are pseudonymous by default and can take measures to make themselves completely anonymous. All things being equal, no public identity means people are less accountable for their actions (or inaction).

Option 3: A company

Predictious, Fairly and other bitcoin-based betting sites are examples where a company makes the decision on whether a prediction happened. Intrade was another one. This option has the virtue of being
able to hold accountable an established company with deeper pockets than an individual or small group of individuals, and more at stake financially (future profits, brand reputation).

But it’s centralized, which means this option leaves us open to the risk error, corruption or some other kind of corporate failure that messes up the votes. With this option, all eggs are put in one basket.

This is similar to Option 1 in that one party is collectively making the decision, forcing users to trust one party. Companies can fail (like Intrade), disappear (like betsofbitco.in) or err (both Intrade and betsofbitco.in did before they suspended operations).

When there’s a lot of money at stake over whether a prediction happened or not, how do you stop any group of humans from erring, cheating or disappearing, or at least prevent this from affecting event
judgements?

The basic question for any judge or referee system is, "Who Watches the Watchmen"? That brings us to option 4.

Option 4: A group in which each judge's voting clout is not "free," not equal and not fixed, but dynamically adjusted based on each judge's financial stake in the system's integrity and their historical commitment to the group's consensus (the truth)
Sounds familiar, doesn't it (1, 2)?
Yes, that's a general description of Augur's current system.  
But we'll never be satisfied with the status quo, because we can't afford to be if this system succeeds in the way we all hope, the lives and livelihoods of a great many people all over the world will depend on the Augur platform's integrity, performance and reliability. We take that responsibility very seriously with an eye toward perpetual refinement of the platform.
We've seen plenty of evidence that machines, in a growing array of different (though still limited) contexts, can be more reliable than humans so, of course, we have our eye on potential avenues for future enhancement down the road that makes use of smart algorithms.
Option 5: A centralised Artificial Intelligence System (such as IBM’s Watson)
This would be IBM’s Artificial Intelligence system. At some point, it will be made available to scan news sources across the Internet to make judgements. IBM Watson did an incredibly good job during the 2011 Jeopardy competition – not only in its successful answer rates, but also in the rates at which it would’ve answered correctly but chose to remain silent because its “best guess” did not meet its own internal minimum confidence thresholds.

 “As impressive as Watson’s final cash score was, what I think was more remarkable was it’s answer success rate. In the first match, out of sixty clues, Watson rang in first and answered 38 correctly, with five errors. This is an 88.4% success rate. If only the 30 questions in the Double Jeopardy portion are considered, this jumps to a whopping 96%. You’ll notice I’ve left the Final Jeopardy question out of these calculations. This is because this question had tobe answered regardless of the machine’s low confidence level of 14%. It’s important to the competition, but actually indicates the success of the machine’s algorithms… While the second game (Day 3) wasn’t quite as impressive as the first, Watson still won by a significant margin. Considering it was competing against the two best human Jeopardy players of all time, it’s safe to say IBM met its goal and then some.

And that was more than four years ago.

Like all programs, NLP algorithms are automated processes guided by fixed sets of rules and instructions. They are impossible to bribe, tempt or otherwise corrupt. 

But a centrally hosted program is exposed to risks of corruption and incompetence of the individual or company that hosts it, has control over it and is responsible for securing it. A program is only as good and robust as the “rules” encoded into them, how well those rules interact with reality and who has control over the program, where it is hosted and the security of that facility.

Through existing API’s, we only know what IBM “tells” us Watson “thinks” and says. IBM has only recently made its API available, but much of the code remains closed-source with the hardware completely under IBM’s direct control. Results can not be cryptographically, or in any other way independently, verified for non-tampering – not in any way that we are aware.

The more immediate problem with IBM’s Watson is that its available knowledge domain, for now, is limited to health care and travel questions.

Option 6: A decentralized Artificial Intelligence referee system (codename: ARGUS)

Since February, I've been brainstorming with top Natural Language Processing experts about the idea of a “decentralised Watson” as a default prediction market referee option, emulating many of the inner workings and design methodologies of IBM’s Watson.

broken image

In early September, NLP expert Petr Baudis was tapped to head up the initiative. Petr is an authority on the subject as the team leader of YodaQA, a widely recognized global initiative to build a lighterweight, open-source Q&A alternative to IBM Watson. He's fluent in Computer Go, continuous black-box optimization, combinatorial game theory, Monte Carlo simulations, information extraction from unstructured text, C++, Java, Python, Perl, Linux, and Unix. He has a Masters in Theoretical Computer Science at Charles University and is currently a PhD candidate at Czech Tech University. He worked for Novartis, Novell and Sirius Labs and participates in this year's National Institute of Standards and Technology annual Text Retrieval Conference, which this year focuses on LiveQA. 

The codename of this special project within the Augur Project is ARGUS. Why "ARGUS"?
Argus Panoptes (or Argos The All-Seeing) is the name of the 100-eyed giant in Greek mythology…whose epithet, “Panoptes”, “all-seeing”, led to his being described with multiple, often one hundred, eyes.The epithet Panoptes… was described in a fragment of a lost poem Aigimios, attributed to Hesiod: “…sleep never fell upon his eyes; but he kept sure watch always.”… According to Ovid, to commemorate her faithful watchman, Hera had the hundred eyes of Argus preserved forever, in a peacock’s tail.
broken image

We’re not big fans of Greek (or Roman) mythology, but it can make for cool metaphors and codenames (like, say, Augur). That’s why we called our initiative to build an always-all-seeing news and event monitoring system ARGUS. ARGUS is currently under development. 

The system is envisoned to be transparently open source and decentralized like the rest of Augur, for anyone to review or modify the encoded rules as they see fit. It would be under no one person’s and no one company’s control  just like the rest of the system, just like Ethereum and just like bitcoin.

The challenge for such a system would be easier than the ones facing IBM Watson. IBM Watson faces questions of all sorts. The simpler challenge to our system would to determine whether something happened or not – "yes" or "no" – and triangulate the truth of what happened (or didn’t) across multiple sources and semantic variations to reinforce its findings and make them robust.

Please see the working paper we've officially released to the public today. It provides technical details and a general roadmap. This is some of the hard-core research and development we want to support with a portion of the upcoming crowdsale.

broken image

One of the things that will get us to the accuracy rates we'd like to achieve is a system (codenamed SYPHON) that restricts in real time the predictions users can input based on what the system determines it can reliably decide, which is in turn based on real-time scans of its database (corpus). The system will scan user entries for vocabulary, diction, structure and other semantic elements, auto-correct user input and propose variations for approval to guide the user in the direction of what the system believes it can reliably handle as a referee challenge, based on internal scans of its own decentralized library of  new stories, facts and word relationships.

The basic idea is that we'd look at the most common semantic structures of predictions, including field-specific structures, boil them down to their essential semantic elements, make sure that “jibes” with the elements commonly found in the facts/news database/corups regarding that subject matter and restrict user submissions to that content and structure.

In our initial version, we count on several well-founded assumptions about how journalists are conditioned to prepare news stories (note the AP Style Book and Economist Style Guide):

    • The most important content of a story (the “who, what, where, when and why”) is typically included within the first three paragraphs — often mostly in the first paragraph
    • Sentences tend to be broken down into standard core elements — a subject, verb and the rest of the predicate (including direct and indirect objects) — along with the date of the article.
    • A widely reported event will be written about in similar but different ways, using different diction and slightly different sentence structures, that established a kind of “Bell Curve” around the truth that makes it more easily identifiable as truth.

Below is a sample of how various news organizations reported Barack Obama’s presidential victory. We color-code commonalities in how the sentence elements are structured and would be identified by our system:

broken image
ARGUS would scan at least 100 top news sources for each event decision, never relying on just one, making it more robust against errors, hacks or other information-distorting issues at any one individual news source. Each article is scanned (web scrapped) at least three times (at random intervals) over the course of two weeks after the event’s decision date has elapsed in order to confirm the integrity of initial readings.

To succeed in breaking our system, hackers have to figure out:

    • which papers will be the judges for the event – a randomly selected group no one will know in advance,
    • they have to hack enough of those organizations,
    • they have to sustain the hack for an entire two-week period because
      they wont know which three specific random times the system will be
      checking any particular reference for tampering.

The system would checks each source not once but three times — at random times — over the course of a 1-week period to reduce the impact of news story tampering by hackers or others. Hackers would need to maintain a consistent false story in several major newspapers for a full week to thwart the system.

According to security solutions provider Prolexic, the average duration of a DDOS attack shrank by more than half to only 17 hours in the second quarter 2014 from 34 hours a year ago. As DDOS attacks grow in sophistication, more organisations keep hardening their defenses using Cloud Flare, Prolexic and other services.

The ARGUS concept essentially proposes pushing back the “circle of trust” from an outer layer of human “middleman” judges interpreting and reporting the news to the system to an inner layer that is closer to the primary sources themselves — the central idea here is that automated algorithms can be relied upon to scan and faithfully interpret primary news sources in a consistent manner.

In our systems, multiple news sources populated by multiple parties interested in preserving the credibility of the news source (reporters, editors, publishers, investors) are essentially the “judges” – our algorithm is merely charged with interpreting their “judgements.”

But what about the risk of a conspiracy of newspapers?

In the news business, credibility is most important asset and the most important currency. Less credibility equals less circulation. You would have to get a large number of people, just in a single newspaper, from the reporter whose name is on the article and whose reputation is on the line, to the final editor, whose job is on the line, and sometimes the publisher, whose job and/or investment is on the line and who will be mindful of shareholders watching the bottom line — to go along with something like that.

It’s possible at one newspaper, but virtually impossible at multiple papers at the same time over an extended period – there is too much competition for credibility. Killing it is the kiss of death for any news organisation, especially in the Internet Age.

Option 4 places its faith in a limited number of judges – maybe 20, maybe 50, maybe 1000 or more at first, plus a statistical tool a very good one to sift truthtellers from fictiontellters.

ARGUS would put its very guarded skepticism in at least that many ‘judges’ at each newspaper multiplied by the number of interested parties at each newspaper, modified by the lie detector tool we currently use, to elevate the voting clout of the most consistently “truthtelling” newspaper sources.

But there's a problem with this concept  the core tech just isn't ready for prime time yet.
Our ARGUS crew believes it can achieve at least a 66% success rate. That's just not good enough the minimum target we'd be aiming for is at least a 95+% success rate over time. Hence the need for R&D and for a seventh option that takes the best of option four and option sixth and re-learns the lessons learned by the IBM Watson team's early failings.
Option 7: Help Out While You Learn, Hal... (and open the pod bay doors, please)
Before delving into this option, we need to provide some context. If you have a spare hour, see below a video about the incredible story of human imagination, tenacity...and some old-fashioned shortcut-taking that lies at the heart of the story of IBM Watson's birth. If you don't have an hour, just jump to the most relevant bit that starts at 5:40 and ends at 22:50. If you don't have more than 5 minutes jump to 17:47 to 22:50. No? Gotta head out soon? Ok, then just jump to 19:27 already  it's the humdinger. It's how IBM "cracked the code."
Instead of trying to program the endless array of rules (many of them arbitrary, many of which we, as humans, are not even conciously aware of) and interactions among rules that define the English language and relationships between words, the Watson Team just dumped a massive amount of raw examples and definitions from which Watson could find patterns, infer rules and train itself. That's an oversimplified definition of machine learning.
This opens the door to a possible seventh option that would give the AI referee system time to learn while it serves a useful role to reduce the workload on our human referees.
We feel it's best to quickly summarize this option by pulling an excerpt from the ARGUS working paper:
Before ARGUS goes fully live as a possible independent referee option, it could be part of a hybrid in which human referees report on the outcome with the aid of ARGUS. ARGUS scans news sources, provides its determinations on which predictions are true / false / uncertain with links to supporting sources (perhaps with annotated excerpts) that make it clear how ARGUS came to its determinations. ARGUS would then cumulatively learn via machine learning based on which of its decisions were accepted and which were rejected by the human referees.
Basically, as the Augur platform (we hope) massively grows in traffic and the number of markets and predictions to referee balloons, ARGUS could play a very helpful role in massively reducing the time demands on Augur's human referees.
Again, to see full the working paper, click here. Development of the system is expected to be supported by the upcoming Reputation token crowdsale. Depending on the crowdsale results, a very "early draft" or demo of ARGUS system would be slated to be released within two months from the end of the sale.