In this article, I will present a simple algorithm, based on Markov Chains, to generate random sentences. At the end, I provide the full source code and you will be able to test it directly in your browser.
Let's be honest, the sentences won't make a lot of sense in most cases, but it's entertaining. This kind of algorithm have been used with success in IRC chat rooms or on Twitter. There are many bots that take as input Trump speeches and generate tweets that he could have said...
Here's the definition of a Markov Chain according to the Oxford Dictionary:
A Markov chain is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.
The two main properties are:
- From a given state, a probability is associated to each possible event;
- The distribution of probability only depends on the current state, the past events are not taken into consideration (memoryless).
Markov Chains can be used to model a lot of systems. This is helpful to compute statistical properties on such systems. However, for our application, we only need to walk through a Markov Chain.
From the initial state, we know the probability distribution to reach any other state. Let the probability to reach state , with from to . To select the next state, we can pick a random number between and and apply the discrete Inverse Transform Method to find the corresponding state: if then the selected state is .
Markov Chain Chatbot
First, from a set of meaningful sentences, we construct a Markov Chain. This step is only done once. Depending on the size of the input dataset, this can be quite slow and could consume a lot of memory.
Once we have a Markov Chain, we will simply simulate the markov chain to produce a sequence of states. From this sequence, we deduce the generated sentence.
Building the Markov Chain
The first step is to build the Markov Chain from a list of sentences. To illustrate this, let's use these 3 sentences:
- This is an example of a Markov Chain.
- To illustrate this, we need an example.
- We saw a video of a cute cat.
Here is the Markov Chain that corresponds to this example:
As you can see, each state correspond to two consecutive words and the outgoing arrows indicate the next possible states. For example, "of a" can be followed by the state "a Markov" and "a cute". The empty state represents the end of a sentence.
To represent this chain, we store for each consecutive pair of words the list of next possible words associated with the probability to pick that word. Actually, instead of storing the probability, I've decided to store for each state the number of times I've encountered that state and the number of times I've encountered each successive word (the probability is the ratio between the two).
Note that we use couple of words in each state. However, it's also possible to use triplets. This gives better sentences but it also requires a bigger dataset to avoid to fall into paths with too few alternatives (sparse graph).
Generating a sentence
To generate a sentence, we randomly pick a state and then we choose a random successor in the chain until the sentence is long enough or if we reach a terminal state.
An example of a random sentence for this Markov Chain is the following:
We need an example of a cute cat.
Of course, we would need a bigger Markov Chain to avoid reusing long parts of the original sentences.
The only difficult part here is to select a random successor while taking into consideration the probability to pick it. The algorithm is the following: I pick a random integer
ran between 0 and the number of times I've encountered this state. Then I iterate over all the successors and subtract the number of occurrences of that successor while
ran is greater than it. See the following diagram and code:
ran = random.randint(0, self._total - 1)
for key, value in self._successors.items():
if ran < value:
ran -= value
Show Me The Code
Here's the full code to generate random sentences:
In order to improve the quality of the generated sentences, we should increase the amount of input data. Here I've only used ~200 quotes from Douglas Adams, Aristotle, Einstein, JFK, Shakespeare and Darwin (from Wikiquote).
When used in chat rooms (e.g. IRC), it's a good idea to feed the bot with the conversation logs.