Recently, I saw a Kalshi market trading on which words the chair of the FED Jerome Powell would say in his next press conference in January. Interested at the thought of retiring my parents young, I began thinking of ways of calculating the probabilities of words being said.
Dirichlet Distribution
At first, I thought of creating the Dirichlet distribution (commonly referred to the "generalization" of the Beta distribution) over the individual words that appear throughout the press conference transcripts. Modelling the event that any given word appears from their underlying probability parameter allows us to calculate the probability of the event that a specific word appears over the length of a press conference (~6898 words).
So, starting with a uniform prior across all the distinct words from the historical FED transcripts, the associated underlying probability of each word was updated accordingly (the words that Powell says are assumed to come from a Multinomial distribution). With the updated probabilities, calculating the probability that a word appears is simple by considering the complement case.
Results
Here are the results based on the Dirichlet posterior probabilities, ordered from the largest to the smallest discrepancy between the computed probabilities and the market probabilities (using Manhattan norm).
| Phrase | Market Probability | Computed Probability | Absolute Difference |
|---|---|---|---|
| Good Afternoon | 0.98 | 0.0016 | 0.98 |
| Balance of Risk | 0.95 | 0.00013 | 0.95 |
| Balance Sheet | 0.85 | 0.0043 | 0.85 |
| Layoff | 0.83 | 0.063 | 0.77 |
| Artificial Intelligence | 0.67 | 3.9×10⁻⁷ | 0.67 |
| Goods Inflation | 0.61 | 0.016 | 0.59 |
| Median | 0.43 | 0.963 | 0.57 |
| Unchanged | 0.92 | 0.378 | 0.54 |
| Shut Down | 0.49 | 0.00025 | 0.49 |
| Recession | 0.24 | 0.667 | 0.49 |
Obviously, these results are not fantastic. Our model does not consider many important factors such as the dependencies between word appearances and real world context. Moreover, the posterior probabilities were dominated by common words like "the", "a", "and", which made the probability of a target phrase generating extremely small.
Dependency on words
One shortcoming of using a Dirichlet distribution over individual words is that word generation is a very dependent process. For example, the word that comes after the sentence prefix "With the increase in inflation, the attitude towards the economy was ______" is most likely an adjective, and an adjective with negative connotation (something like "pessimistic"). This means that for some words, there is a higher probability of appearing than others words given the current context we have seen so far.
Having taken a NLP course this semester, this inspired me to consider using a LLM. At an LLM's core is a conditional probability distribution which spits the most likely next token given context i.e. where is the given context. Armed with this, my strategy was as follows:
- First take an open source model (Qwen 30B Instruct model). This is our prior probability, akin to taking a random English speaker from the planet.
- Finetune the open source model on FED press conference transcripts. This would take our random English speaker and make them sound like Jerome Powell talking.
- Run a BFS algorithm to find the probability that the model would say a specific phrase across the length of the a transcript.
The algorithm is as follows:
The tree search expands as follows: