The Center for Data Innovation spoke with David Rothschild, an economist at Microsoft Research in New York City and creator of an event forecasting tool called PredictWise. Rothschild discussed how market data can be used to forecast political and sporting events, and how the polling industry has been slow to adopt modern technologies that can greatly improve analytics and event forecasting.
This interview has been edited.
Joshua New: At first glance, PredictWise seems to be just another polling service, but that is not really the case, is it? Could you talk about what is going on underneath the surface that makes this so different?
David Rothschild: With PredictWise, there are two main considerations—data collection and data analytics. When it comes to data collection, I’ve done a lot of experimental polling and prediction game design. In 2012, we developed a system called “widetube” which was a prediction game where any given set of probabilities for any given outcome was attached combinatorially, meaning as one value moves, all the other values move in response. At the same time as this, I was working with Xbox doing polling on the narrow demographics that make up Xbox players and producing extremely accurate results for the general population. We accomplished this by combining general survey questions with really strong analytics and adjustments to the raw data.
Predictwise incorporates these approaches and, more generally, tries to answer the questions of “what is all the data out there, what are the raw numbers saying, and can we actually turn these into forecasts?” A lot of what we see out there in polls of the general population is not really forecasts—it’s just the raw data. I don’t care about just the number of search results numbers or retweets, I care about how these correlate with outcomes. PredictWise tries to make this jump for you—too often, you’re seeing just counts and raw data, and we translate that into results about things you care about.
New: How did you decide which sources to pull data from? Why these sources, and not others?
Rothschild: If you try to forecast any given event, you need to consider several things. First, what is the exact question you really want to know? For example, when predicting an election, you want to know the probable electoral college votes, rather than the outcome of the national popular vote. Sometimes this means moving away from approaches that have historically been used for the sake of expediency.
Second, what is the most accurate set of data you want to tackle? This can range from market prices to Twitter posts to polls to search results. With these datasets, you need to see how well that data correlates with outcomes to your question, and how robust those correlations are over time. Then, you need to figure out how timely this data is and how important timeliness is in what you’re trying to deliver. Finally, you need to understand how scalable these sources are.
For some predictions, I build large models that factor in a lot of historical data, or sometimes I focus in on polling or Twitter data. In general though, market data is accurate, timely, flexible, and most importantly it’s very scalable—there’s usually a lot of it, and it tends to look similar regardless of the domain. It’s a lot easier to use this as the basis of a forecast than other data which requires a lot of extra work to make it comparable or accurate. Sometimes these sources are better, but there can be pretty huge tradeoffs. That’s why PredictWise uses market data so heavily.
New: Currently, PredictWise focuses on four categories—politics, sports, entertainment, and the economy/finance, with the first two being the largest. Does each topic require different analytical methodologies?
Rothschild: Ideally, we want to have the same underlying process regardless of the domain. As I mentioned, one of my big concerns is scalability. You want to approach an economic indicator in the same way you approach an election or the world series. So, the approach is the same, but forecasting different outcomes puts different weights on the data. An optimized forecast in one domain will have different weights on different variables and even different variable entirely, and you have to figure out what the right balance is for each question.
New: Interestingly, with most questions, PredictWise assigns a monetary value to each potential outcome. Where does this come from and what does it mean? When we met at the Microsoft TechFair, you mentioned your interest in the gamification of polling. Is this related?
Rothschild: When I look at data from markets or bookies or any situation where people are trading and thinking in market terms for outcomes, I try to shift these values into something like a contract that is worth one dollar if it happens, and zero dollars if it doesn’t happen. So, regardless of what’s actually going on in the market or in the game, this data translates into the question, “what is the derived price that people in this market are willing to pay for such a contract?” It’s a tool to make a bunch of disparate data actually comparable. It’s fascinating to compare and contrast and it helps aggregate these outcomes together.
I also think there’s huge potential to bring this kind of gamification to prediction markets and I think it can bring a lot of value to games. Think of polling: every day, hundreds of thousands of people every day opt in to some type of polling. By asking the right questions and gathering the data efficiently, could these actions be useful? It’s the same with fantasy sports and video games—could you ask more meaningful questions in these activities to make users’ actions forecast value? I’m really interested in seeing how this technology will develop.
New: How do these predictions stack up compared to more traditional polls, such as a Gallup poll? If PredictWise’s algorithms add such insight, I imagine this would be very valuable to journalists, political analysts, sports fans, and so on. Can you talk about any plans PredictWise may have to work with these groups?
Rothschild: PredictWise isn’t involved in partnerships at this point and we’ll see what happens in the future, but what I’m most interested in is being part of the conversation in how the underlying data collection and analytics will change dramatically. For generations, forecasting has really centered around theoretically perfect data collection without much thought for analytics, how to aggregate this data, or how to turn this into actionable market intelligence. The polling industry is really based on taking random samples of Americans and stopping there. They publish these snapshots that are nothing more useful than raw data. Of course this data is important, but there’s so much room for improvement. I want to develop new methods of data collection and analytics that can actually be used by stakeholders to make more efficient decisions. Essentially, I want to disrupt an industry that, as far as I’m concerned, has been quite slow to innovate in a world full of computers and the Internet that allows for faster, cheaper, and more flexible ways to reach people and analyze data.