Machines are getting better than humans at figuring out who to hire, who’s in a mood to pay a little more for that sweater, and who needs a coupon to nudge them toward a sale. In applications around the world, software is being used to predict whether people are lying, how they feel and whom they’ll vote for… To crack these cognitive and emotional puzzles, computers needed not only sophisticated, efficient algorithms, but also vast amounts of human-generated data, which can now be easily harvested from our digitized world. The results are dazzling. Most of what we think of as expertise, knowledge and intuition is being deconstructed and recreated as an algorithmic competency, fueled by big data.
The capabilities [IBM’s] Watson has demonstrated using deep analytics and natural language processing are truly stunning. The technologies that will develop from this will no doubt help the world with many of its significant problems. Not least of these is dealing with the vast, escalating volumes of data our modern world generates.
The Terminator would never stop, it would never leave him… it would always be there. And it would never hurt him, never shout at him or get drunk and hit him, or say it couldn’t spend time with him because it was too busy. And it would die to protect him. Of all the would-be fathers who came and went over the years, this thing, this machine, was the only one who measured up. In an insane world, it was the sanest choice.
- a single human judge
- a fixed group of human judges
- a company
- a changing group of humans whose votes are not "free," not equal and not permanent but instead come at a cost and are subjected to a formula designed to dynamically adjust each judge's voting clout on an ongoing basis according to their compliance with the consensus across multiple decisions.
- a centralised “robot” that decides the outcome
- a decentralised “robot” that decides the outcome
- a hybrid of the fourth and sixth option, mixed with "machine learning" where the AI goes through a learning period of "auto-suggesting" and "learning" from the suggestions human referees accepted and rejected.
This makes more sense than a single judge. Instead of “putting all our eggs in one basket” — putting all our faith in one person to judge fairly, impartially and consistently — we diversify and put our faith in a group decision. An individual might err, cheat or flake out. A group is collectively less likely to err, cheat or flake out. That’s the basic assumption behind the concept of a jury-based judicial system and democracy in general.
Another virtue of this option is that, if we allow for quorums (where it’s ok if not everyone votes on every decision and some minimum number of judges votes), it can handle more decisions.
The maximum number of judgements the system can handle is a bit higher than Option 1, but it’s still limited. Just look at the slow-moving US justice system or any event or competition involving multiple judgments by juries and panels of judges. They can get overwhelmed.
There’s an additional complication when it comes to applying such a jury system to a decentralized, cryptocurrency-based network such as bitcoin, where users are pseudonymous by default and can take measures to make themselves completely anonymous. All things being equal, no public identity means people are less accountable for their actions (or inaction).
Predictious, Fairly and other bitcoin-based betting sites are examples where a company makes the decision on whether a prediction happened. Intrade was another one. This option has the virtue of being
able to hold accountable an established company with deeper pockets than an individual or small group of individuals, and more at stake financially (future profits, brand reputation).
But it’s centralized, which means this option leaves us open to the risk error, corruption or some other kind of corporate failure that messes up the votes. With this option, all eggs are put in one basket.
This is similar to Option 1 in that one party is collectively making the decision, forcing users to trust one party. Companies can fail (like Intrade), disappear (like betsofbitco.in) or err (both Intrade and betsofbitco.in did before they suspended operations).
When there’s a lot of money at stake over whether a prediction happened or not, how do you stop any group of humans from erring, cheating or disappearing, or at least prevent this from affecting event
judgements?
The basic question for any judge or referee system is, "Who Watches the Watchmen"? That brings us to option 4.
“As impressive as Watson’s final cash score was, what I think was more remarkable was it’s answer success rate. In the first match, out of sixty clues, Watson rang in first and answered 38 correctly, with five errors. This is an 88.4% success rate. If only the 30 questions in the Double Jeopardy portion are considered, this jumps to a whopping 96%. You’ll notice I’ve left the Final Jeopardy question out of these calculations. This is because this question had tobe answered regardless of the machine’s low confidence level of 14%. It’s important to the competition, but actually indicates the success of the machine’s algorithms… While the second game (Day 3) wasn’t quite as impressive as the first, Watson still won by a significant margin. Considering it was competing against the two best human Jeopardy players of all time, it’s safe to say IBM met its goal and then some.
And that was more than four years ago.
Since then, IBM Watson has only gotten better, more widely used, more battle-tested and more sophisticated and improving the system has only become a bigger financial priority for Big Blue (heck, a necessity), while the entire booming science of natural language processing (NLP) gets put to great use for everything from analyzing equity market sentiment to refining search results to powering personal digital assistants like Apple’s Siri and Google’s Now.
Like all programs, NLP algorithms are automated processes guided by fixed sets of rules and instructions. They are impossible to bribe, tempt or otherwise corrupt.
But a centrally hosted program is exposed to risks of corruption and incompetence of the individual or company that hosts it, has control over it and is responsible for securing it. A program is only as good and robust as the “rules” encoded into them, how well those rules interact with reality and who has control over the program, where it is hosted and the security of that facility.
The more immediate problem with IBM’s Watson is that its available knowledge domain, for now, is limited to health care and travel questions.
Since February, I've been brainstorming with top Natural Language Processing experts about the idea of a “decentralised Watson” as a default prediction market referee option, emulating many of the inner workings and design methodologies of IBM’s Watson.
In early September, NLP expert Petr Baudis was tapped to head up the initiative. Petr is an authority on the subject as the team leader of YodaQA, a widely recognized global initiative to build a lighterweight, open-source Q&A alternative to IBM Watson. He's fluent in Computer Go, continuous black-box optimization, combinatorial game theory, Monte Carlo simulations, information extraction from unstructured text, C++, Java, Python, Perl, Linux, and Unix. He has a Masters in Theoretical Computer Science at Charles University and is currently a PhD candidate at Czech Tech University. He worked for Novartis, Novell and Sirius Labs and participates in this year's National Institute of Standards and Technology annual Text Retrieval Conference, which this year focuses on LiveQA.
We’re not big fans of Greek (or Roman) mythology, but it can make for cool metaphors and codenames (like, say, Augur). That’s why we called our initiative to build an always-all-seeing news and event monitoring system ARGUS. ARGUS is currently under development.
The system is envisoned to be transparently open source and decentralized like the rest of Augur, for anyone to review or modify the encoded rules as they see fit. It would be under no one person’s and no one company’s control — just like the rest of the system, just like Ethereum and just like bitcoin.
The challenge for such a system would be easier than the ones facing IBM Watson. IBM Watson faces questions of all sorts. The simpler challenge to our system would to determine whether something happened or not – "yes" or "no" – and triangulate the truth of what happened (or didn’t) across multiple sources and semantic variations to reinforce its findings and make them robust.
Please see the working paper we've officially released to the public today. It provides technical details and a general roadmap. This is some of the hard-core research and development we want to support with a portion of the upcoming crowdsale.
One of the things that will get us to the accuracy rates we'd like to achieve is a system (codenamed SYPHON) that restricts in real time the predictions users can input based on what the system determines it can reliably decide, which is in turn based on real-time scans of its database (corpus). The system will scan user entries for vocabulary, diction, structure and other semantic elements, auto-correct user input and propose variations for approval to guide the user in the direction of what the system believes it can reliably handle as a referee challenge, based on internal scans of its own decentralized library of new stories, facts and word relationships.
In our initial version, we count on several well-founded assumptions about how journalists are conditioned to prepare news stories (note the AP Style Book and Economist Style Guide):
- The most important content of a story (the “who, what, where, when and why”) is typically included within the first three paragraphs — often mostly in the first paragraph
- Sentences tend to be broken down into standard core elements — a subject, verb and the rest of the predicate (including direct and indirect objects) — along with the date of the article.
- A widely reported event will be written about in similar but different ways, using different diction and slightly different sentence structures, that established a kind of “Bell Curve” around the truth that makes it more easily identifiable as truth.
Below is a sample of how various news organizations reported Barack Obama’s presidential victory. We color-code commonalities in how the sentence elements are structured and would be identified by our system:
To succeed in breaking our system, hackers have to figure out:
- which papers will be the judges for the event – a randomly selected group no one will know in advance,
- they have to hack enough of those organizations,
- they have to sustain the hack for an entire two-week period because
they wont know which three specific random times the system will be
checking any particular reference for tampering.
The system would checks each source not once but three times — at random times — over the course of a 1-week period to reduce the impact of news story tampering by hackers or others. Hackers would need to maintain a consistent false story in several major newspapers for a full week to thwart the system.
According to security solutions provider Prolexic, the average duration of a DDOS attack shrank by more than half to only 17 hours in the second quarter 2014 from 34 hours a year ago. As DDOS attacks grow in sophistication, more organisations keep hardening their defenses using Cloud Flare, Prolexic and other services.
The ARGUS concept essentially proposes pushing back the “circle of trust” from an outer layer of human “middleman” judges interpreting and reporting the news to the system to an inner layer that is closer to the primary sources themselves — the central idea here is that automated algorithms can be relied upon to scan and faithfully interpret primary news sources in a consistent manner.
In our systems, multiple news sources populated by multiple parties interested in preserving the credibility of the news source (reporters, editors, publishers, investors) are essentially the “judges” – our algorithm is merely charged with interpreting their “judgements.”
But what about the risk of a conspiracy of newspapers?
In the news business, credibility is most important asset and the most important currency. Less credibility equals less circulation. You would have to get a large number of people, just in a single newspaper, from the reporter whose name is on the article and whose reputation is on the line, to the final editor, whose job is on the line, and sometimes the publisher, whose job and/or investment is on the line and who will be mindful of shareholders watching the bottom line — to go along with something like that.
It’s possible at one newspaper, but virtually impossible at multiple papers at the same time over an extended period – there is too much competition for credibility. Killing it is the kiss of death for any news organisation, especially in the Internet Age.
Option 4 places its faith in a limited number of judges – maybe 20, maybe 50, maybe 1000 or more at first, plus a statistical tool — a very good one — to sift truthtellers from fictiontellters.
ARGUS would put its very guarded skepticism in at least that many ‘judges’ at each newspaper multiplied by the number of interested parties at each newspaper, modified by the lie detector tool we currently use, to elevate the voting clout of the most consistently “truthtelling” newspaper sources.