Salmon Run: Coding A Better Strategy for Hangman

Happy New Year! In this post, I will describe a fun little project that came about while going through the lectures in The Caltech-JPL Summer School on Big Data Analytics course on Coursera (really a dump of the lecture notes and videos from the actual week-long sessions rather than a formal course), I came across an interesting assignment - implementing the ideas outlined in this lifehacker article - A Better Strategy for Hangman. Hangman is a simple word-guessing game played by two players, one of whom (the Proposer) thinks of a word and the other (the Solver) tries to guess it. The guessing is done one letter at a time - the strategy described in the article uses letter frequencies to inform these guesses. The article goes into much more detail, and you should read it to get a full understanding of the strategy.

In real life, the game is played between two people, each of whom have their own (probably different sized) vocabularies. In our code, we use the words from our /usr/share/dict/words file (in Ubuntu, and likely available for other Unix based systems as well). So both Proposer and Solver use the same list of words as their "vocabulary". The Proposer selects a random word from the list and the Solver uses letter frequencies to make letter guesses until he guesses the word or the allowed number of guesses (11) run out. Here is the code for the game.

# Source: src/hangman/game.py
# -*- coding: utf-8 -*-
import operator
from random import random

MAX_GUESSES_PER_GAME = 11
MAX_WORDS_IN_SUBSET = 25

########################### Preprocessing ############################

non_ascii_mappings = list([
    ('', 'a'), ('', 'a'), ('', 'a'), ('', 'a'), ('', 'a'), ('', 'a'),
    ('', 'c'), ('', 'c'),
    ('', 'e'), ('', 'e'), ('', 'e'), ('', 'e'),
    ('', 'i'),
    ('', 'n'),
    ('', 'o'), ('', 'o'), ('', 'o'), ('', 'o'),
    ('', 'u'), ('', 'u'),
    ('', '\''), ('', '"')
])

def ascii_fold(s):
    for x in non_ascii_mappings:
        s = s.replace(x[0], x[1])
    return s

def preprocess(dictfile):    
    fwords = open(dictfile, 'rb')
    wset = set()
    for line in fwords:
        word = line.strip().lower()
        word = ascii_fold(word)
        if word.endswith("'s"):
            word = word[:-2]
        word = word.replace(" ", "").replace("'", "").replace("\"", "")
        wset.add(word)
    fwords.close()
    return list(wset)

############################# Proposer Side #############################
    
def select_secret_word(words):
    widx = int(random() * len(words))
    return words[widx]
    
def find_all_match_positions(secret_word, guess_char):
    positions = []
    curr_pos = 0
    while curr_pos < len(secret_word):
        curr_pos = secret_word.find(guess_char, curr_pos)
        if curr_pos < 0:
            break
        positions.append(curr_pos)
        curr_pos += 1
    return positions    
    
def update_guessed_word(guessed_word, matched_positions, guessed_char):
    for pos in matched_positions:
        guessed_word[pos] = guessed_char
    
def is_solved(guessed_word):
    chars_remaining = len(filter(lambda x: x == '_', guessed_word))
    return chars_remaining == 0
    

############################# Solver Side ###############################

letters = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", 
           "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z"]

def most_frequent_char(words, previously_guessed):
    cmap = {x:0 for x in letters}
    for word in words:
        cset = set()
        for c in word:
            if c not in previously_guessed:
                cset.add(c)
        for c in cset:
            cmap[c] += 1
    return sorted(map(lambda x: (x, cmap[x]), cmap.keys()), 
                 key=operator.itemgetter(1), reverse=True)[0][0]
    
def best_guess(words, word_len, bad_guesses, good_guesses, guessed_word):
    temp_words = filter(lambda x: len(x) == word_len, words)
    if len(bad_guesses) > 0:
        for bad_guess in bad_guesses:
            temp_words = filter(lambda x: x.find(bad_guess) == -1, temp_words)
    if len(good_guesses) > 0:
        for good_guess in good_guesses:
            temp_words = filter(lambda x: x.find(good_guess) > -1, temp_words)
    previously_guessed = set(filter(lambda x: x != '_', guessed_word))
    return temp_words, most_frequent_char(temp_words, previously_guessed)
    
def init_guess(wordlen):
    initial_guess = []
    for i in range(wordlen):
        initial_guess.append("_")
    return initial_guess

def match_words_against_template(words, guessed_word):
    if len(words) > MAX_WORDS_IN_SUBSET:
        return words
    matched_words = []
    for word in words:
        word_chars = [c for c in word]
        merged = zip(guessed_word, word_chars)
        diff = len(filter(lambda x: x[0] != x[1], 
                          filter(lambda x: x[0] != '_', merged)))
        if diff == 0:
            matched_words.append(word)
        if len(matched_words) > 1:
            break
    return matched_words

def replace_guessed_word(guessed_word, matched_word):
    matched_chars = [c for c in matched_word]
    for i in range(len(matched_chars)):
        guessed_word[i] = matched_chars[i]
    
################################# Game ###################################
    
def single_round(words, debug=False):
    solver_wins = False
    secret_word = select_secret_word(words)
    if debug:
        print "secret word:", secret_word
    word_len = len(secret_word)
    bad_guesses = set()
    good_guesses = set()
    guessed_word = init_guess(word_len)
    for num_guesses in range(MAX_GUESSES_PER_GAME):
        filtered_words, guess_char = best_guess(words, word_len, bad_guesses, 
                                                good_guesses, guessed_word)
        if debug:
            print "guessed char:", guess_char
        matched_positions = find_all_match_positions(secret_word, guess_char)
        if len(matched_positions) == 0:
            bad_guesses.add(guess_char)
        else:
            good_guesses.add(guess_char)
        update_guessed_word(guessed_word, matched_positions, guess_char)
        matched_words = match_words_against_template(filtered_words, guessed_word)
        if len(matched_words) == 1:
            replace_guessed_word(guessed_word, matched_words[0])
        if debug:
            print "#", num_guesses, "guess:", " ".join(guessed_word)
        if is_solved(guessed_word):
            solver_wins = True
            break
    return len(secret_word), solver_wins, num_guesses
    
def multiple_rounds(words, num_games, report_file):
    fdata = open(report_file, 'wb')
    for i in range(num_games):
        word_len, solver_wins, num_guesses = single_round(words, False)
        fdata.write("%d,%d,%d\n" % (word_len, 1 if solver_wins else 0, num_guesses))
    fdata.close()

################################# Main ###################################

words = preprocess("/usr/share/dict/words")

#single_round(words, True)

multiple_rounds(words, 10000, "hangman.csv")

The /usr/share/dict/words file has 99,171 English words, and contains several letters with diacritics for words that are adopted from other languages. We intially normalize these words so they are all in lowercase, and replace letters with diacritics with their corresponding ASCII letters from the range 'a' through 'z'. We also replace words ending with 's with their base forms, and remove duplicates. At the end of this, we are left with 72,167 words.

At the beginning of the game, the solver knows the number of letters in the word, so in order to maximize the chances of getting a letter, it makes sense to guess the most frequent letter for that size of word in the dictionary. If we get the letter, we are obviously one step closer to a win. However, even if we don't get a letter, it still gives us some information which we can use for subsequent guesses - ie, our subset of word candidates do not contain this letter.

This approach of consistently choosing the most frequent letter (minus any letters known to not exist in the word) works upto a point. Often, it is possible to use the word template generated from previous guesses to match against the word corpus. If the template matches to a single word, we can just guess that word and win. I have implemented this strategy in the match_words_against_template() function in the code above.

An optimization for cases where the template matches multiple words would be to select a letter that resolves the ambiguity (such as a guess of "s" for the pattern "_ e t t l e" where the word could be either "settle" or "kettle") rather than the most frequent character.

Here are traces from a few games. You can run a single game by calling the single_round() function with debug set to True. The word is selected randomly from the word list, but it is quite simple to make the change to take in a specific word instead if you want to do that.

secret word: undergo
guessed char: e
# 0 guess: _ _ _ e _ _ _
guessed char: s
# 1 guess: _ _ _ e _ _ _
guessed char: r
# 2 guess: _ _ _ e r _ _
guessed char: a
# 3 guess: _ _ _ e r _ _
guessed char: i
# 4 guess: _ _ _ e r _ _
guessed char: o
# 5 guess: _ _ _ e r _ o
guessed char: d
# 6 guess: _ _ d e r _ o
guessed char: l
# 7 guess: _ _ d e r _ o
guessed char: n
# 8 guess: _ n d e r _ o
guessed char: u
# 9 guess: u n d e r _ o
guessed char: b
# 10 guess: u n d e r g o

secret word: steakhouses
guessed char: e
# 0 guess: _ _ e _ _ _ _ _ _ e _
guessed char: i
# 1 guess: _ _ e _ _ _ _ _ _ e _
guessed char: r
# 2 guess: _ _ e _ _ _ _ _ _ e _
guessed char: s
# 3 guess: s _ e _ _ _ _ _ s e s
guessed char: a
# 4 guess: s _ e a _ _ _ _ s e s
guessed char: l
# 5 guess: s _ e a _ _ _ _ s e s
guessed char: t
# 6 guess: s t e a _ _ _ _ s e s
guessed char: n
# 7 guess: s t e a _ _ _ _ s e s
guessed char: o
# 8 guess: s t e a k h o u s e s

secret word: coffers
guessed char: e
# 0 guess: _ _ _ _ e _ _
guessed char: s
# 1 guess: _ _ _ _ e _ s
guessed char: r
# 2 guess: _ _ _ _ e r s
guessed char: a
# 3 guess: _ _ _ _ e r s
guessed char: i
# 4 guess: _ _ _ _ e r s
guessed char: o
# 5 guess: _ o _ _ e r s
guessed char: t
# 6 guess: _ o _ _ e r s
guessed char: c
# 7 guess: c o _ _ e r s
guessed char: u
# 8 guess: c o _ _ e r s
guessed char: k
# 9 guess: c o _ _ e r s
guessed char: p
# 10 guess: c o _ _ e r s

secret word: soho
guessed char: a
# 0 guess: _ _ _ _
guessed char: e
# 1 guess: _ _ _ _
guessed char: o
# 2 guess: _ o _ o
guessed char: s
# 3 guess: s o _ o
guessed char: t
# 4 guess: s o _ o
guessed char: l
# 5 guess: s o _ o
guessed char: b
# 6 guess: s o _ o
guessed char: n
# 7 guess: s o _ o
guessed char: p
# 8 guess: s o _ o
guessed char: d
# 9 guess: s o _ o
guessed char: w
# 10 guess: s o _ o

Now that the game had been reduced to closed-world (ie vocabulary drawn from the same list) interactions between two entities which had computers for minds, I wanted to see how successful the Solver was against the Proposer. For this, I ran a small experiment and let the code run for 10,000 games and recorded their results. The data is generated by calling the multiple_rounds() function with the number of runs desired and the output filename.

Here is some code that reads the resulting file and generates the necessary charts for visualization.

# Source: src/hangman/gamestats.py
# -*- coding: utf-8 -*-
from __future__ import division
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv("hangman.csv", header=None, 
                 names=["WORD_LEN", "SOLVER_WINS", "NUM_GUESSES"])

# wins vs losses
nloss = df[df["SOLVER_WINS"] == 0].count()["SOLVER_WINS"]
nwins = df[df["SOLVER_WINS"] == 1].count()["SOLVER_WINS"]
print "probability of winning=", nwins / (nwins + nloss)
print "probability of losing=", nloss / (nwins + nloss)

## probability of winning and losing for different word lengths
df2 = df.drop("NUM_GUESSES", 1)
df2_wins = df2[df2["SOLVER_WINS"] == 1].groupby("WORD_LEN").count().reset_index()
df2_losses = df2[df2["SOLVER_WINS"] == 0].groupby("WORD_LEN").count().reset_index()
df2_losses.rename(columns={"SOLVER_WINS": "SOLVER_LOSES"}, inplace=True)
df2_merged = df2_wins.merge(df2_losses, how="inner", on="WORD_LEN")
df2_merged.plot(kind="bar", stacked=True, x="WORD_LEN", 
                title="Win/Loss Counts by Word Length") 
plt.show()
df2_merged["NUM_GAMES"] = df2_merged["SOLVER_WINS"] + df2_merged["SOLVER_LOSES"]
df2_merged["SOLVER_WINS"] = df2_merged["SOLVER_WINS"] / df2_merged["NUM_GAMES"]
df2_merged["SOLVER_LOSES"] = df2_merged["SOLVER_LOSES"] / df2_merged["NUM_GAMES"]
df2_merged.drop("NUM_GAMES", axis=1, inplace=True)
df2_merged.plot(kind="bar", stacked=True, x="WORD_LEN", 
                title="Win/Loss Probabilities by Word Length") 
plt.show()

# how number of guesses to win varies with word length (winning games only)
df3 = df[df["SOLVER_WINS"] == 1]
df3.drop("SOLVER_WINS", 1)
df3_grouped = df3.drop("SOLVER_WINS", 1).groupby("WORD_LEN").mean().reset_index()
df3_grouped.plot(kind="bar", x="WORD_LEN", 
                 title="Avg Guesses for Different Word Lengths")
plt.show()

The empirical probability of winning the game, based on this data, is about 0.66. As expected, the distribution of word lengths thought up by the Proposer follows a roughly normal distribution. However, shorter words have a lower probability of getting solved than longer words. This is more easily seen in the chart showing win/loss probabilities. So treating win probability as a proxy for difficulty, winning at Hangman is actually harder for shorter words than for longer words.

I also plotted the average number of guesses the solver made to guess different sized words thought up by the Proposer. This time the distribution is more uniform, although it falls off to the right for long words.

This is all I have for today. I haven't been posting these past few weeks because I was out of town on vacation. Hopefully I should get back to a more regular schedule in the New Year.

Salmon Run

Sunday, January 11, 2015

Coding A Better Strategy for Hangman

No comments:

Post a Comment