Word-boundaries across languages

Guadalajara, Mexico
sun4sep2016—35w248d67%— 23h52m13s—-5utc

A simple exploration into the beginnings & endings of words across languages.

Alphabets are remarkably powerful & discrete symbolic systems (~30 symbols to capture all the sounds in a language!). You can easily see evidence of deep linguistic features (like how meaning is broken down into pieces or how sound is encoded) in something as simple as the beginning and ending letters of words.


Letter clouds for the ENDING letters of the most common words in English, Spanish, French, Italian, Portuguese, Latin, Esperanto, German, Swedish, and Russian

Already in these clouds you can tell that there’s something different in the English, German and Russian ones: they’re more crowded and letters don’t dominate as strongly as in the other clouds.

The clouds were built from the following ending-letter percentages:

ENDING-letter percentages of the most common words in English, Spanish, French, Italian, Portuguese, Latin, Esperanto, German, Swedish, and Russian

It turns out 62% of Spanish words do end in -a or -o! (Endings which, of course, determine the grammatic gender of nouns in Spanish.) As a native speaker of Spanish I had never really noticed such astounding regularity. Adding an -a or -o to an English word to turn it into Spanish, as so many Americans do in jest, turns out to be a very reasonable thing to do.

A lineplot of cumulative percentages of the letter endings allows us to compare the distribution shapes better across languages. Notice how the Romance languages (in yellow-reds) cluster together towards the upperleft corner of extreme regularity. English and Russian (and to a lesser extent German & Dutch) stand out clearly.

The Romance languages (those derived from Latin) seem to be inordinately fusional, that is, fond of inflections, tenses and all sorts of suffixes, instead of word composition. This seems to be the reason why they impose such strong ending regularities. Italian (with only 21 letters, the smallest alphabet in the sample) is the more regular language in the sample (other than Esperanto). Swedish, a North Germanic language is surprising in its Romance-like regularity (why?).

The theory that Romance languages have regular endings because they’re gendered is belied by German, Dutch & Russian which are also (tri-) gendered and yet have diverse endings.

Esperanto, a constructed language that deliberately seeks consistency and regularity (to do mindexpanding things with it, similar to how Arabic numbers revolutionized arithmetic with Roman numbers), is accordingly the language with the most regular endings: just the 3 top letters are 97.63% of all endings. In fact in Esperanto endings are always meaningful and consistent: every noun ends in -o (-j is the plural), every adjective in -a, every infitive verb in -i (conjugated verbs end in -s or -u), (almost) every adverb in -e (or -ux). Esperanto is a fascinating experiment in agglutination and synthesis from Indo-European roots.

I don’t know any Russian to make a guess as to why its endings are such an extreme anomaly. All I can think of is that its cyrillic alphabet with 33 letters is the longest one in the sample.

With English on the other hand, I have many theories to explain its ending diversity. It has shed most cases & tenses with its evolution but it does keep a lot of frozen morphological inflections it imported from Latin via French. It follows naturally the Germanic penchant for word composition (agglutination) and likes to import words voraciously like no other language, often preserving their spelling. Finally, English uses 26 letters to represent 44 phonemes (sound pieces) which it does through (very inconsistent) letter combinations.


Letter clouds for the BEGININING letters of the most common words in English, Spanish, French, Italian, Portuguese, Latin, Esperanto, German, Swedish, and Russian

Beginnings are much less structured than endings and that’s immediately apparent in their letterclouds but even more so in the lineplot of cumulative percentages of letter beginnings: there’s much less variance and the shapes tend to approximate (with a bend) the diagonal of even distribution.

Morphology is by far a suffix game in Western languages (almost all meaning inflections are specified by varying the ending of words, NOT their beginnings — why?) and this seems to mean that beginnings are under few constraints.

Words come from lists of common words from Mathematica’s WordList and come in a dictionary form. A deeper research would source from real-life, representative corpora exhibiting all the morphological permutations words allow (plurals, tenses, genders, cases, inflections…).

This was all done in Mathematica 11. I’m finally starting to grok it after flirting with it since college (almost a decade ago!). It’s valuable to me specially as a tool for exploration, while the architecture and brilliance of such a complex system is a constant source of joy and inspiration.

Other than presentation (which was as painful as it was rewarding), this was all the Mathematica code needed to compute this:

Follow me on Twitter!  |  Back to ELZR.com