Want to know who will win the next US election, which films will bomb or which stocks will rise? Researchers are mining social media to find the answers.
Peter Gloor does not own a crystal ball. Nor does he read tea leaves, the lines on your hand or claim to speak to people “from the other side”. Yet, the research scientist at the Massachusetts Institute of Technology (MIT) claims he can predict the future.
Take the ongoing Republican nomination race. Late last year, Newt Gingrich was surging in the polls and some pundits thought he may well overtake frontrunner Mitt Romney. But Gloor predicted he would not.
It was true that chatter on Twitter seemed to be giving an edge to the former House speaker, but analysing edits to Wikipedia, Gloor predicted that Romney would beat him. Gloor ended up being right: Romney beat Gingrich by a wide margin on the night.
Of course, you could argue his prediction was just a fluke. But Gloor says his forecasts have worked time and again. In some cases, his analysis can even help predict electoral results where polls fail. For example, in 2009 Switzerland voted to bar the construction of new minarets on mosques. The polls wrongly indicated that people would reject the measure, while Gloor’s analysis, based on an analysis of social media, correctly predicted it would pass. “People lied to the pollsters, because they didn’t want to appear racist,” says Gloor.
His research is part of a growing body of scientific work, often called predictive analytics, which involves using software and computer algorithms to mine emails, social media and other public websites to help make predictions about the future. It is a field that has caught the attention of everyone from movie moguls, who want help identifying which films will be popular, to captains of industry, who want advance information on which stocks are winners. And for those interested in politics, it is being used to forecast who will win in upcoming elections.
‘Enter the Swarm’
The idea of mining publicly-available data to help predict the future is nothing new, but past efforts focused primarily on the news media. In the run-up to World War II, the US and British governments monitored world media to help predict the course of hostilities, and during the Cold War, the US government sponsored academics to come up with mathematical models that would analyse media reports to help predict the actions of the Soviet Union.
Over the past decade, however, the advent of social media sites like Facebook and Twitter has created a real-time flood of news about the world. Kalev Leetaru, a computer scientist at the University of Illinois, describes social media as something akin to “a gateway for humanity” because of the sheer abundance of data.
“There are three billion items posted to Facebook every day and 200 million tweets,” he says. “One of my favourite figures is that right now, every day, there are more words posted to Twitter than were in the New York Times over the last 60 years.”
The trick is in understanding what that data means, and which data is important for what topics.
For example, Gloor says that Twitter is very good for predicting behaviour that is influenced by the crowd – the general public. When it comes to seeing a new film, for example, people are heavily influenced by what the crowd seems to say: is it a good or bad movie?
It is a finding that has been put to the test by computer scientist Hsinchun Chen at the University of Arizona. Chen and his group have looked at data from hundreds of Hollywood movies and have attempted to correlate ticket sales with online data generated by social media users. The model worked “beautifully”, suggests Chen.
His lab’s models also correctly predicted that the 2011 Mel Gibson movie The Beaver, about a man who speaks through a hand puppet, would bomb; so too would Jim Carey’s Mr. Popper’s Penguins. Neither received much pickup on social media, according to Chen.
The lesson learned from looking at social media data and movies is that it does not matter who the star is, how much you spend on the movie, or even how good the movie is: what matters is what people are saying about it online. “The movie industry is buzz,” Chen suggests. “It’s not content.”
Whilst this crowd approach works nicely for films, it starts to break down when applied to political events. Instead, researchers have to look for different types of data. This is where the “swarm” comes in, says Gloor. The swarm is a group of more neutral experts, such as those who regularly edit Wikipedia. “There are lots of Wikipedians, but just two or three thousand do most of the work,” he says. “We track those, how well respected they are, who’s editing what.”
In the case of the Republican primaries, Gloor’s analysis in December noted that even though the masses on Twitter seemed to indicate a Gingrich victory, the swarm on Wikipedia was pointing to Romney.
While those working in predictive analytics acknowledge that the wealth of information provided by social media is important, they are circumspect about its ability to be applied to all areas. Some events cannot be predicted well using social media, namely those which people simply don’t talk about online. “We probably cannot find any crimes,” says Gloor. “They will not be discussed in public.”
Leetaru is even more wary of overly relying on social media to make predictions, arguing that in many cases even seemingly public events, like protests, have a hidden side to them. “If you look at the UK riots [in 2011], the first thing everyone said was [look at] Facebook and Twitter. But when they checked further, they realised that actually the rioters were using encrypted peer-to-peer Blackberry messages.”
While Leetaru is also involved in forecasting social and political events, his current work focuses more on culling information from traditional media, including a retroactive analysis of news reports, which he said located Osama bin Laden’s hideout within a radius of 125 miles (200km).
Social media, while providing a wealth of data, does not necessarily provide first-hand information, or better information. “All the work that is coming out seems to suggest that social media is more of a sounding box,” says Leetaru. “Something happens and social media reports on it.”
For example, in looking at Kenyan election violence several years ago, Leetaru suggests that most of the social media messages coming out of the country were not necessarily first-hand reports about what people personally saw, but rather people retweeting or rebroadcasting news media reports. “So it wasn’t that they were reporting that they saw a tank heading down the street, they were basically using it almost as a real-time bulletin board,” he says.
In many of these more complex cases, researchers are combining social media with news reports and other public data to help hone their forecasts. For example, Swedish-American firm Recorded Future, now based in Cambridge, Massachusetts, trawls hundreds of thousands of pieces of data, from government filings to social media, hunting for clues to the future.
According to Christopher Ahlberg, the CEO of the firm, the company’s proprietary software grew out of a decade of previous work on large datasets. He and his coworkers grew interested in looking at ways to organise data in a time-based manner, allowing people to do Google-like searches on future events.
“Media is full of these facts about the future,” suggests Ahlberg. “So, what we asked ourselves is could we build a machine – we used to jokingly call it a DVR for everything spoken and written about the future – and organise that data meticulously into a dataset, and set it up as a cool user experience?”
The company did just that, and today, Recorded Future’s software is being used by private customers and government clients interested in data-mining the future.
One example of how Recorded Future’s technology could be used, according to Ahlberg, might be someone following pharmaceutical stocks. The user clicks on a specific date, and Recorded Future shows all of the data mined from public information about things that are supposed to take place on that date, such as a review by the US Food and Drug Administration, the release of a new drug, or the expiration of a drug patent.
All this information can hold vital clues for those in the know. And it could be a step towards predictive analytics’ ultimate prize: picking winning stocks. But Ahlberg and other researchers know this is a long way off. “Sometimes I find people want magic predictions,” Ahlberg says. “That doesn’t reflect reality.”
Others agree this is still a tricky area. Gloor, who has used his models to try to outsmart the markets, has so far had only limited success. However, the work has thrown up at least one intriguing result: the models work best with alternative energy stocks. “There we have the tree-huggers, and tree-huggers are honest,” he says. “They talk about new developments that correspond to alternative energy.”
Chen also concedes that financial markets are harder to predict than movies. “It’s not easy to predict a stock return: You can predict the movement, volume and volatility,” he says. “The return is still the Holy Grail.”
Even if such analysis cannot always make precise forecasts, its potential for forecasting trends and events has increasingly attracted interest from the US government. Recorded Future, whose technology can also mine social media to forecast political protests, such as Occupy Wall Street, or track criminal activity, such as cyber attacks, has already attracted interest from the national security community. In-Q-Tel, the venture capital firm founded by the CIA, has invested in the company.
In fact, the intelligence community’s interest in predictive modelling, particularly based on social media, has been growing over the past few years, especially in light of the Arab spring protests, which were at least partly fuelled by social media. Last year, the Intelligence Advanced Research Projects Activity (Iarpa), a research and development arm of the US intelligence community, launched a project called Open Source Indicators, designed to mine information from social media and other public data and to come up with predictions.
The Pentagon has also launched a number of forecasting projects in recent years, hoping, for example, to predict insurgent behaviour. Mark Maybury, the US Air Force chief scientist, likens this sort of human-data collection – whether from social media, foreign news, or elsewhere – as something akin to the images collected by drones flying over Afghanistan. Only instead of information about bombs, it is collecting information about how people behave.
“At the strategic national level, you’d like to do things like predict state failure,” said Maybury of the modelling work the Air Force is doing. “More tactically, you’d like to be able to do things like discover illicit shipping routes, human trafficking routes, and narcotics routes.”
Yet one question that is only just starting to be discussed is whether the public is aware that their tweets, their Facebook posts and their Wikipedia edits are being sucked up by academics, private companies, the Pentagon, and even the CIA. In the modern world, people often do not think twice about tweeting their dinner plans, broadcasting their political opinion, or posting updates about a street protest.
Even people who avoid Twitter or Facebook may be contributing information in ways they didn’t realise: a product review on Amazon, a comment on a news site or even something a simple as a search on their smart phone can all be collected and analysed.
“People have become jaded to the information overload,” says Mark Abdollahian, a political scientist at Claremont Graduate University, who also works with a private company that creates political forecasting models. And it is not just social media: Abdollahian points to the latest iPhone’s popular new Siri voice application, which sends data back to Apple, where it is analysed along with other user information.
All of this data, whether from social media or smart phone searches, is being analysed by someone for some purpose, whether to predict a political protest, a stock price, or even just to figure out what restaurant you’re trying to find.
“Your queries go in there to make a better user experience. What do people do with the information?” Abdollahian says. “That’s the question we all need to ask.”