You can learn a lot about the world from Wikipedia, sometimes without reading the articles.
Kalev Leetaru, a researcher at the University of Illinois, has been looking at the capacious volunteer-written encyclopedia as a Big Data resource, concentrating on the connections between cities around the globe over time. To understand these connections, he focuses on the type of language used to talk about a particular place, to see whether the writers have a generally positive or negative sentiment toward the place at that time.
The result is an interesting historical atlas of the rise of globalization and warfare. His technical sponsor in data mining, Silicon Graphics International, hopes the work is also an advertisement for S.G.I.’s decidedly noncloud style of technical computing for some kinds of number crunching.
Mr. Leetaru scanned Wikipedia’s 37 gigabytes of data in English, securing mention of 80 million locations and 40 million dates, scattered across four million pages of copy. (Wikipedia has many more pages than that, but most of the rest are redirects.)
“I put every coordinate on a map with a date stamp,” he said, adding that he then linked it with every other location mentioned that year. “It gave a map of how the world is connected.” He then color-coded it for the sentiment used to describe a place. Red meant the writer was describing something bad, green meant something good, or at least neutral.
The result, which was later laid out as a time-lapse movie, was generated in 30 minutes, he said, then followed by a day of tweaking. That is a fast return, made possible, he explained, because the S.G.I. machine, which has 4,096 computing cores and can store 64 terabytes of data in its main memory, does not send data and computing resources to several different locations for cloud-style parallel processing. Such a computer, which costs from $30,000 to more than $1 million, could be cost-effective for certain data-intensive tasks.
Mr. Leetaru’s research is as interesting for what it says about Wikipedia as for what it says about the world. For one thing, the connections between places build slowly, tracing the course of immigration and empire, up to the present era, where globalization makes so many lines that the map is an unreadable solid green.
In addition, the areas of red are notably few: a little bit at the Napoleonic Wars, a lot at the Civil War, and less during World Wars I and II. That seems to show a United States preoccupation: as bad as the Civil War was, in terms of loss of life it does not even rank in the top 20 conflicts worldwide since 1800, and it had a relatively small effect on other nations. World War II has far more red in the United States, which had virtually no fighting on its territory, and more in Europe than in Asia, except for the Philippines, where English is more commonly spoken and there were strong ties to the United States.
Mr. Leetaru said his work was meant to be a snapshot of how Wikipedia writers of today view the world, and not a decisive verdict of history. Examining the content of books and other print media written over a longer period might well register different, and changing, sentiments about historical events over time.
He has done similar analyses with books, online news and social media, which he calls Culturomics 2.0. The long-term goal, he said, is to create a new way of making economic and political judgments. “If the tone of description is stable, the country is stable,” he said. “The world is too complex to model. But we can crowdsource the global pulse.”
Source: bits.blogs.nytimes.com
0 σχόλια:
Post a Comment