Tuesday, September 06, 2011

Who writes Wikipedia?

I hacked together a tool to determine the size and date of each past version of any Wikipedia article, and to chart size against date to determine the growth of the article through time. Then I looked at a sample of articles from the Time ‘100 best English language novels’to determine how these grew through time. The study would need to be formalised and extended if continued to publication, but the initial results are surprising.

I wanted to test the idea discussed here. The official ‘crowdsourcing’ doctrine of Wikipedia is that editors are easily replaceable units of work, each of whose contributions are equally valuable. Supposedly, large numbers of small edits will, over time, make an article 'drift' towards quality and accuracy, even if each individual edit only improves the article imperceptibly. This philosophy has determined the way Wikipedia is administered. Those who perform purely administrative work – categorising, formatting and (mostly) vandal fighting – are rewarded by promotion within the hierarchy. Those who produce content, by contrast, receive no formal recognition, on the crowdsourcing assumption that no one person can be identified with any single article, and that content producers are replaceable anyway.

My study flatly contradicts this official doctrine. Growth in a genuinely crowdsourced article would look like Brownian motion with upward drift, as thousands of minor edits gradually ‘stick’ in a Darwinian competition for survival. This is by no means the case.  In the majority of articles sampled, there is a pronounced ‘staircase’ appearance to the growth of the article. The size increases rapidly, often within a day and a handful of edits. Then it flattens as the changes stabilize, with minor growth for months of years. Then another editor (or often the same one) adds more content and the size grows rapidly, to be followed by another flat period and so on. It is not unusual for an article that has had thousands of edits to have reach its current size through only a handful of real edits. The majority of the other edits are vandalism followed by reversion of vandalism, or minor formatting changes or adding of categories. Many articles have effectively only one editor.

Another observation is that most of the growth occurred in the period from 2004 until 2007-8. What explains this? It is well known that the overall number of Wikipedia editors has been decreasing since then. One theory is that Wikipedia is ‘full up’. Most of the ‘useful knowledge’ has already been captured. So is it that each of the articles about the ‘100 greatest novels’ reached its optimum or ideal length in 2007, and no further work is needed? No. Most of the articles in this series are short – about 10k bytes. But some are longer, and a handful are as large as 80k (which is the longest length an article should be, for practical purposes). So most of the articles are well below the length they could be: Wikipedia articles on great novels are not ‘full up’.

Then could it be that articles about novels have an optimum size, determined by their notability? Well, no. One of the longer articles (60k) is about Hemingway’s classic The Sun Also Rises. This is indisputably a great work, possibly Hemingway’s greatest work. But is it any more notable than the article on Great Expectations, which weighs in at a mere 40k? Or Pride and Prejudice, universally acknowledged to be one of the great classics of English literature (a paltry 36k)? Of course not. The article on Hemingway’s book was written by a single Wikipedia editor, and was a mere stub before he or she got to work on it. Given the small number of editors who work on these articles, a large article reflects an interest by some editor who put in a lot of work to make it that way, rather than genuine notability. A small article is the result of mere chance.

I shall publish some of these charts in subsequent posts.

No comments: