"Inatissentis sentere querum" is a Latin expression meaning, of course, nothing, because those aren't real Latin words. In fact, they are fake Latin words generated by training an algorithm on a Latin text. These words look to me, a non-Latin-knower, like realistic Latin words, and hopefully they do to you, too. Of course, if you know Latin, they probably look terrible, and you're probably already mad at me. Instead, take a look at some of the other languages that I've modeled!
This project explores using Markov chains to model word structure in an alphabet-based language, after being trained on a text in that language. This model is then used to generate realistic-looking fake words and to detect "foreign-looking" words.
What is a "word?"
There are a lot of ways to define a "word." For this project, I'll define a word to be a sequence of letters and certain approved punctuation marks. As far as approved punctuation, I have included the apostrophe in languages that use them for contractions. This means that a contraction like "don't" will be counted as "don't", rather than being counted as "dont" or as two separate words, "don" and "t."
For languages with accented letters, I've chosen to model accented letters as if each one were its own distinct letter. So, my French "alphabet" includes 'E,' 'É,' 'È,' 'Ê,' and 'Ë' as separate "letters." Confusingly, it is possible for a dictionary or text to contain letters that are not in the alphabet! For example, the letters 'J,' 'K,' 'W,' 'X,' and 'Y' are not considered to be part of the Italian alphabet, but they occur frequently in loan words. I use the '?' character to model all non-alphabet characters in each language. I have omitted ligatures because I just don't like them.
Below are the alphabets that I've defined for each language.
Modeling with a Markov chain
What constitutes a good fake word? "Mait" seems like a good fake French word to me, whereas "xkuq" does not seem like a good fake French word. How can we differentiate between these words?
There are a couple good clues in here. One important aspect is letter frequency. 'R,' 'A,' 'I,' and 'T' are all commmon letters in French, whereas 'X,' and 'K' are not. Another important thing is sequences of letters. The letter sequences "XK," "UQ," "XKU," and "KUQ" are all exceedingly uncommon. It's also realatively uncommon for a word to start with "X," or to end with "Q."
This can be used to create a simple language model wherein words with common sequences of letters are judged to be likely, and words with uncommon sequences are judged to be unlikely. I have chosen to use a Markov chain to model the interactions between letters. Instead of modeling each letter individually, now each word is modeled as a sequence of states, where each state corresponds to a letter. This means that there is a parameter for the probability of each transition between any two states. These parameters can still be estimated from the training document. Note that this is just a vanilla discrete Markov chain where the states are the letters themselves; there is no hidden state.
First, I have a letter frequency model that looks at each letter individually. For this, the likelihood of a word is the product of the likelihood of each letter in the word. This can be thought of a degenerate version of a Markov model, i.e. a Markov chain of order 0.
For an N-letter alphabet, there are N parameters, which are estimated using maximum likelihood estimates from training documents.
For this model, I treat the word as a simple Markov chain where each state corresponds to a single letter. There is a parameter from the probability of transitioning from any given state to any other state. For an N-letter alphabet, there are now N² state transition parameters to learn. This will create a lot of cases where the parameters aren't covered by the training data. For example, A Tale of Two Cities doesn't contain the letter sequences "XZ", "ZX," or "VF," amongst many others. In order to prevent this from zeroing out the likelihoods, I've used add-one smoothing, adding one to any bigram that doesnt appear in the training text. A slightly more principled way to do this would be to add a smoothing parameter α, instead of 1, to each count. This parameter α could be learned by doing cross-validation on another text. If I weren't too lazy to do it, that is.
I'm handling the starts and ends of words specially in order to make the model generate more realistic words. For example, the model should capture that it's extremely uncommon for an English word to end with the letter "Q." To do this, I added Markov chain states for start and end tokens. This idea is taken from computational linguistics, where it is common to put these tokens at the start and end of sentences when modeling sentence structure. These tokens behave more or less like letters, except that they must appear at the start and the end of each word, and they may not appear in the middle of the words. So, the word "in" is represented as, ["start token", "I", "N", "end token"].
For this model, I treat the word as a Markov chain of order 2, where each state corresponds to a single letter, but it depends on the two previous states, not just the one previous state. This lets the model know about common and uncommon trigrams.
Final multiplicative model
I'm using the two Markov models alongside the letter frequency model from the previous section. To calculate the proportional likelihood of a word this way, I take the likelihood of the word in each of the three models above. This results in a function that is not an actual likelihood, but is proportional to the likelihood, which works fine for what I'm trying to do. To get the actual likelihood, it would be necessary to divide this proportional likelihood by a very nasty constant.
Multiplying these three models together provides a smoothing effect that counteracts the sparsity of training data for the bigram and trigram parameters. For example, even if the training document didn't have any instances of the trigram, "ERN," we would still be able to see that its consitituent letters and bigrams are fairly common, so the sequence "ERN" would have a relatively high likelihood. This idea is stolen from Eugene Charniak's computational linguistics class.
Here's the training document for each language, and its length.
The code below prints out the highest-frequency n-grams in each language. Start and end tokens are shown with the ⋅ character. So, "e⋅", the most frequent English bigram, occurs when a word ends with the letter "E."
Since a unigram is the same as a letter, the estimated unigram frequencies should match generally-accepted letter frequencies. The English letters with highest estimated frequencies are, in order, "ETAONI." This is close to the generally-accepted list of most frequent English letters, "ETAOIN." The discrepancy may be because I'm using a slightly old text (A Tale of Two Cities), or just because the text is too short. Overall, the letter frequency ordering looks good for most languages.
Below is a short method to print out the most likely words for each language, for each length of word. Since this model calculates word likelihood as a product of letter likelihoods, shorter words will tend to have higher likelihoods. In any language, the most likely word will be the empty string.
Since an M-letter alphabet can form $M^N$ words of length N, exploring the search space of all words is expensive when N is large. I have (slightly) optimized this search by pruning the search if the likelihood of the word I'm at is already lower than the likelihood of one of the top words that's already been found. I tried to create a heuristic pruning strategy, but it did not work well. With my current search, though, I can only generate words of up to four letters in a sane amount of time.
One strange fake word that I notice here is "st," in fake-German. This doesn't look like a good German word to me, because it does not have any vowels. Since the model never looks at the word as a whole, it has no way of "counting" the vowels, consonants, or any other category of letter to ensure that they are occuring in appropriate proportions. Adding in a counting mechanism can make the model a lot more difficult to perform computation on, because it means that every letter has a dependency on every other letter. If you add a vowel at the start of the word, you may need to remove a vowel somewhere else in the word.
Generating fake words, finally!
A functional system to model word probabilities can be used for many different things. To start with, we can find fake words, which are sequences of letters that look realistic, but aren't found in a dictionary. I'll just repeat the previous exercise, but using a dictionary for each language to eliminate sequences of letters that are already real words.
For each language, I have found an online dictionary. There are some discrepancies in things such as how accents are treated, and what does or doesn't count as a word. Some of these things are unavoidable and language-specific, e.g. in German the "ß" character is used only after short vowels except in Switzerland, Lichtenstein, and Namibia, and you don't even want to know what the rules were before 1996.
The less familiar I am with a language, the better the fake words look.
Sampling words progressively
What's the fun of making fake words if they can only be four letters long? Especially with German? As I mentioned above, the challenge is that the search space of N-letter words is too big when N > 4. With Latin's measly 23-letter alphabet, there are only six million possible five-letter words. French, though, with 50 total letters, allows 312 million five-letter words! So, instead of trying to cover the entire search space of long words, why not sample?
For my first sampler, I sample words progressively from left to right, beginning with the start token and sampling each successive letter conditioned on the letters to its left. It should then be possible to generate N-letter words by doing rejection sampling, i.e. throwing away all samples that are not of length N.
Unfortunately, this sampler is spending a lot of time generating short words, and also many repeated words. I suspect that the distribution of word length has an exponential tail. This means that generating long words with rejection sampling will be tragically inefficient.
Sampling words with MCMC
How can we cover our sample space more effectively? This question always seems to have the same answer: Markov chain Monte Carlo. Note that the "Markov chain" in "Markov chain Monte Carlo" is a Markov chain where each state is a sampled word, and the chain is a series of samples. This is different than the Markov chain used to model word likelihood, which is a Markov chain where each state is a letter, and the chain of letters forms a word.
As such, I made a Metropolis-Hastings sampler. I'm initializing each run with an N-letter word comprised of randomly uniformly chosen letters. The proposal distribution makes a 50/50 random choice between the following two options: it can replace a single letter of the previous sample with a randomly uniformly chosen letter, or it can swap two letters. Note that both of these moves keep the word at length N.
Below, the "sample" words represent samples, taken from the sampler at regular intervals (not sequentially, in order to reduce autocorrelation). They are presented in the order that they were sampled. The "top" words represent the highest-likelihood words that the sampler finds. They are presented with highest likelihood first. I restart the sampler from random initializations many times, in case it tends to get stuck in certain areas of the distribution.
Encouragingly, the top-likelihood sampled words match the exhaustively-searched words well! Also encouragingly, the "top" words found by the sampler tend to be somewhat consistent between runs. These clues indicate that the sampler is probably not completely broken.
These words look pretty good. Once they get long, though, certain three-letter sequences of letters start to repeat (e.g. "dendendenden" in fake-German). This makes a lot of sense, since the model is only aware of the three-letter neighborhood next to each letter.
Using a sampling algorithm to search for high-likelihood points is a slight abuse of the sampler, since that's not what a sampler does. It works decently here, probably because the areas it's exploring aren't very large. To get better top-likelihood words, it would be best to use a real search algorithm.
Detecting foreign-looking words
Time for some more fun! For this tangent, I'll go through the dictionary of each language to find words that seem to belong to other languages. Some of these are obviously loan words, but other ones (e.g. for Latin) just look foreign.
In order to get a ranking, I've simply taken the ratio of the likelihood in the destination languages to the likelihood in the source language. Note that this causes longer words to be chosen, which is ranking the words by a kind of statistical significance. A more Bayesian approach would entail dividing this ratio by word length, which would more closely measure effect size.
All right, now I'm having fun!
I always feel slightly offended when I see the words that look English. "Whiskey." "Sweepstake." "Branchad." "Spathad." "Overbooking!" Is this what English looks like to non-English-speakers? To make myself feel better, I imagine a heavy-set German man with a disproportionately large mustache saying, "braunschweiger," "breitschwanz," and "weltschmerzes" very seriously.
I also enjoy that the algorithm classifies "jeans" as French. Which is true, just ask any pair of Jeans.
Detecting distinctive words
A similar application of this model is to detect distinctive-looking words, i.e. words that are likely in their source language but very unlikely in all other languages.
To model this, I'm ranking each word with the ratio between its likelihood in its source language to its likelihood in any other language. As above, since I'm not dividing by word length, this tends to select longer words.
Note that these distinctive-looking words are inherently relative to the other languages modeled. Adding another Germanic language, for example, might change the words that were reported as most distinctively German.
"Geschwindigkeitsbeschränkung," WOOOO! That means "speed limit."
Hopefully this article has illustrated the strengths and limitations of using a Markov model for word structure. It seems to work nicely for short words, but for long words it might need to be augmented by 4-grams or 5-grams, or it might need to be replaced by a more complex model.
This is a good start, but this is just the tip of the iceberg in terms of what could be done with a word structure model. Here are a couple ideas:
* Easily and automatically generate plausible wrong answers for multiple-choice spelling tests. This could be useful for language learning.
* Train the model on a fake language, and then sample words to create new words in that language.
* Generate a name for your startup (I believe that Wordoid may already be using a Markov model!).
There are also many NLP problems that could make use of word structure modeling: optical character recognition, speech recognition, swipe typing, spell checking, web search, and machine translation come to mind. I haven't found any examples of word structure modeling, perhaps because people communicate primarily using a vocabulary that can be easily stored in a dictionary. A word structure model could be useful to help to keep up with changing vocabularies, and to handle things such as names, which follow common structural patterns but are too numerous to fit into a dictionary.
I didn't do much research on the topic before writing this, so I'm probably ignorant of a lot of great work out there. If you know of something that does word structure modeling, please let me know!
A super fun paper from the Malaviya National Institute of Technology in Jaipur that uses a letter-level HMM to translate SMS abbreviations into their standard English equivalents.
Peter Norvig wrote a classic blog post on how to write an extremely simple spelling correction algorithm.
Here is a paper on word structure modeling for OCR applications, by Kolak, Byrne, and Resnik.
Andrej Karpathy wrote a blog post about using recurrent neural networks to do character-level language modeling.
Thanks to Eric Kernfeld, Victor Jakubiuk, Brandon Reiss, Kristen Tracey, Rémi Carton, Chris Norris-LeBlanc and Max Livingston for proof-reading and suggesting many improvements to this! Also to all the people who compiled and made available so many nice word lists and texts.