/blag/

Linguistic vitality on the web

wed2aug2006—31w214d58%— 14h19m00s—0utc

As I said on a previous post, I believe Spanish, my mother tongue, has a low status on the web. And as I laid there pondering the subjectivity of my assessment, I remembered Mihaly Csikszentmihalyi^WP^‘s fascinating account of how (and why) he became a scientist (it appears in John Brockman’s excellent Curious Minds AM, a compilation of similar tales by top-notch scientists and a sure recommendation to anyone).

The particular anecdote that came to mind was when he and a friend quarrelled over whose neigborhood was the more communist (the matter was relevant because he was living in Italy and the country was then in political turmoil). Their brilliant analytic idea to try to settle the question was to count out the circulation of the left- and right-leaning newspapers in each of their neighborhoods’s newsstands. This of course sent them into all sorts of interesting statistical considerations, but it put them on the path of finding the subtle answers to their question, and it was certainly better than “the hocus-pocus most adults rely on to bolster their arguments”.

So I want to try to do something similar with my question — what is the linguistic vitality in the web of 14 languages? — and this post will be the beginning of my investigation. For reasons of practicality and personal bias, the 14 languages I’m going to settle to are: English WP, German WP, French WP, Polish WP, Japanese WP, Dutch WP, Italian WP, Swedish WP, Portuguese WP, Spanish WP, Farsi WP, Chinese WP, Esperanto WP, and Hindi WP.

The first investigation I conducted (and the one that set the order of the above list of languages) is how many articles the Wikipedia of each language had. It’s interesting to compare these data with those for language population, both total and online. The anomalies, Spanish, Chinese, and Farsi on one side, Swedish and Dutch on the other side, jump out immediately.

As anecdotal evidence, the English language article on Vicente Fernandez, a very popular Mexican singer of Spanish folk songs, is almost ten times as large as the Spanish one as of August 1, 2006.

And even for such a mediated event as the recent Mexican elections (July 2, 2006), 8 days after the election the English language article was still longer, more detailed, and more polished than the Spanish one (as I’ve already chronicled).

My second investigation explores a different path: what’s the interest of outside webizens in learning the language? To try to answer this, I resorted to 43Things, a popular English website were people publicly carry lists of 43 things they want to accomplish. One can search through them and with a query like “learn french” get a reasonably accurate estimate of the number of people interested in learning the language inside the website. Things are not so simple though, since people can put whatever title they want to each thing they want to do, many synonyms occur (like “learn french” and “learn to speak french”). Fortunately, most of them are retrieved with the simple query of “learn french” and now it’s just a matter of counting the number of people under each heading (for this research, I stopped at the 100th synonym) and adding them together. The interesting results follow.

As opposed to what I would have expected, Spanish is the language most people are interested in learning inside 43Things. The results must be taken with a grain of salt, to be sure. Since 43Things is an English community to begin with, the number of members that say they’re interested in learning English is naturally low (and yet not so). But still, it’s interesting Spanish turned out first.

If someone wants to keep investigating down this path, here’s the tiny script I used to spider 43Things (Ain’t Ruby just lovely?).

require 'open-uri' def counter(lang, page) ar = [] open("http://www.43things.com/search/query?page=#{page}&q=learn+#{lang}&type=goal&list=1 ") do |ws| ws.each_line do |line| line.scan(/(\d+) *(?:(?:people)|(?:person)))/)) {|n| ar+=[n[0].to_i] } end end ar.inject(0) {|sum, n| sum+n} end %w{English German French Polish Japanese Dutch Italian Swedish Portuguese Spanish Chinese Esperanto}.each do |lang| puts lang+": "+(1..4)).inject(0) {|sum, n| sum+counter((lang, n)}.to_s+" people" end

This shall do for now (there’s plenty else to do!). I’ll look further into my original question in a couple of weeks. In the meanwhile, if you have ideas on analytic investigations to try to measure linguistic vitality on the web, I’d be thrilled to hear them.

Follow me on Twitter!  |  Back to ELZR.com