Playing Codenames in Rust with word vectors

    Playing Codenames in Rust with word vectors

    In a previous post I implemented the game of Codenames in Rust, allowing a human player to interact with the computer playing randomly. Now let's implement a smarter computer agent, using word vectors.

    Word vectors (or word embeddings) are a way of converting words into a high dimensional vector of numbers. This means that each word will have a long list of numbers associated with it and those numbers aren't completely random. Words that are related usually have values closer to each other in the vector space as well. Getting those numbers from raw data takes a long time, but there are many pretrained embeddings on the internet you can just use and there are also libraries that help you find other words that are close to a target word.

    Word vectors in Rust

    Machine learning has embraced the Python programming language, so most ML tools, libraries and frameworks are in Python, but some are starting to show up in Rust as well. Rust's performance focus attracts people, because ML is usually computationally intensive.

    There is one library in Rust that does exactly what we want: FinalFusion. It has a set of pretrained word vectors (quite fancy ones, with subword embeddings) and it has a library to load them and to make efficient nearest neighbor queries.

    The pretrained embeddings come in a 5 GB file, because they pretty much literally have everything, including the kitchen sink, so the download will take a while. Let's start using the library (after adding it to our Cargo.toml file) to get the nearest neighboring words for "cat":

    use std::io::BufReader;
    use std::fs::File;
    use finalfusion::prelude::*;
    use finalfusion::similarity::WordSimilarity;
    
    fn main() {
        let mut reader = BufReader::new(File::open("resources/ff.fifu").unwrap());
    
        // Read the embeddings.
        let embeddings: Embeddings<VocabWrap, StorageViewWrap> =
            Embeddings::read_embeddings(&mut reader)
                .unwrap();
        println!("{:?}", embeddings.word_similarity("cat", 5).unwrap());
    }

    After running it we get the following output:

    [WordSimilarityResult { similarity: NotNan(0.81729543), word: "cats" },
    WordSimilarityResult { similarity: NotNan(0.812261), word: "kitten" },
    WordSimilarityResult { similarity: NotNan(0.7768222), word: "feline" },
    WordSimilarityResult { similarity: NotNan(0.7760824), word: "kitty" },
    WordSimilarityResult { similarity: NotNan(0.7667354), word: "dog" }]
    Sidenote:

    Loading a 5GB file from disk will take some time. If you have enough RAM, it should be in the OS's file cache after the first run, so it will load faster. Also, compiling this program with --release (turning on optimizations and removing debug information) will speed it up significantly.

    One of the rules of Codenames is that hints can't be any of the words on the board or direct derivatives of them (such as plural forms). The finalfusion library has support for masking some words out, but to get plural forms I resorted to another library called inflector which has a method called to_plural, which does exactly what's written on the box.

    use std::io::BufReader;
    use std::fs::File;
    use finalfusion::prelude::*;
    use finalfusion::similarity::EmbeddingSimilarity;
    use std::collections::HashSet;
    use inflector::string::pluralize::to_plural;
    
    fn main() {
        let mut reader = BufReader::new(File::open("resources/ff.fifu").unwrap());
    
        // Read the embeddings.
        let embeddings: Embeddings<VocabWrap, StorageViewWrap> =
            Embeddings::read_embeddings(&mut reader)
                .unwrap();
        let word = "cat";
        let embed = embeddings.embedding(word).unwrap();
        let mut skip: HashSet<&str> = HashSet::new();
        skip.insert(&word);
        let pluralized = to_plural(word);
        skip.insert(&pluralized);
        let words = embeddings.embedding_similarity_masked(embed.view(), 5, &skip).unwrap();
        println!("{:?}", words);
    }

    This is a slightly lower level interface, where we first have to obtain the embedding of the word, we build the set of words to skip and then we search for the most similar words to the vector that we give it. The output is:

    [WordSimilarityResult { similarity: NotNan(0.812261), word: "kitten" },
    WordSimilarityResult { similarity: NotNan(0.7768222), word: "feline" },
    WordSimilarityResult { similarity: NotNan(0.7760824), word: "kitty" }, 
    WordSimilarityResult { similarity: NotNan(0.7667354), word: "dog" }, 
    WordSimilarityResult { similarity: NotNan(0.7471396), word: "kittens" }]

    It's better. Ideally, we could also somehow remove all composite words based on words from the table, but that's a bit more complicated.

    This can be wrapped in a function, because it's a common use case:

    fn find_similar_words<'a>(word: &str, embeddings: &'a Embedding, limit: usize) -> Vec<WordSimilarityResult<'a>> {
        let embed = embeddings.embedding(&word).unwrap();
        let mut skip: HashSet<&str> = HashSet::new();
        skip.insert(&word);
        let pluralized = to_plural(&word);
        skip.insert(&pluralized);
        embeddings.embedding_similarity_masked(embed.view(), limit, &skip).unwrap()
    }
    

    Implementing the first spymaster

    Let's implement our first spymaster which uses word vectors! First, let's define a type alias for the embedding type, because it's long and we'll use it many times.

    type Embedding = Embeddings<VocabWrap, StorageViewWrap>;

    Our spymaster will have two fields: the embeddings and the color of the player. The Spymaster trait requires only the give_hint function to be implemented.

    pub struct BestWordVectorSpymaster<'a> {
        pub embeddings: &'a Embedding,
        pub color: Color,
    }
    
    impl Spymaster for BestWordVectorSpymaster<'_> {
        fn give_hint(&mut self, map: &Map) -> Hint {
            let enemy_color = opposite_player(self.color);
            let remaining_words = map.remaining_words_of_color(enemy_color);
            let mut best_sim = NotNan::new(-1f32).unwrap();
            let mut best_word = "";
            for word in remaining_words {
                let words = find_similar_words(&word, self.embeddings, 1);
                let hint = words.get(0).unwrap();
                if hint.similarity > best_sim {
                    best_sim = hint.similarity;
                    best_word = hint.word;
                }
            }
            return Hint{count: 1, word: best_word.to_string()};
        }
    }

    This spymaster uses a simple greedy algorithm. It takes each word that has to be guessed and find's the most similar word to it, while keeping track of the similarity. It returns as hint the word that had the highest similarity to any of the words that belong to the opposite team.

    How does it do? I drew some random boards with a fixed seed and ran this spymaster on them. If you hover over the hint, it shows what are the words it's based on.

    We have a problem: the word embeddings we use are a bit too noisy. The word embeddings are usually trained on large text corpora crawled from the internet, such as Wikipedia, the Common Crawl project or from CoNLL 2017 dataset (this is the one used above). The problem with these large corpuses is that they are not perfectly cleaned. For example "-pound" is considered a word. Let's try the CC embeddings:

    Unfortunately, the CC embeddings give even worse results.

    Cleaning up the embeddings

    My fix for this was to write a script to prune down the embeddings to only "real" words (ones made of only letters). First, I had to get a set of all these words.

        let words = embeddings.vocab().words();
        let mut total = 0;
        let mut lowercase = 0;
        let mut select = HashSet::new();
        for w in words {
            total += 1;
            if w.chars().all(char::is_lowercase) {
                lowercase +=1;
                select.insert(w.clone());
            }
        }
        println!("{} {}", total,  lowercase);

    Then I had to get the embeddings and the norms for each of these words:

        let mut selected_vocab = Vec::new();
        let mut selected_storage = Array2::zeros((select.len(), embeddings.dims()));
        let mut selected_norms = Array1::zeros((select.len(),));
    
        for (idx, word) in select.into_iter().enumerate() {
            match embeddings.embedding_with_norm(&word) {
                Some(embed_with_norm) => {
                    selected_storage
                        .row_mut(idx)
                        .assign(&embed_with_norm.embedding);
                    selected_norms[idx] = embed_with_norm.norm;
                }
                None => panic!("Cannot get embedding for: {}", word),
            }
    
            selected_vocab.push(word);
        }
    

    And finally write the now much smaller embedding file:

        let new_embs = Embeddings::new(
            None,
            SimpleVocab::new(selected_vocab),
            NdArray::from(selected_storage),
            NdNorms::new(selected_norms),
        );
        let f = File::create("resources/smaller_embs.fifu").unwrap();
        let mut reader = BufWriter::new(f);
        new_embs.write_embeddings(&mut reader);

    On the embeddings trained on the CoNLL dataset the reduction is about 6x: from 1336558 to 233453.

    Let's give our Spymaster another shot with these embeddings, simply by changing the file from which we load the embeddings:

    "superheroic" and "marched" look kinda against the rules, being too close to one of the words on the board, but "movie" is a really good one word hint.

    Implementing a field operative

    Now let's implement the other part of the AI team: the field operative which has to guess which words from the board belong to the enemy, based on the hints the spymaster gave.

    pub struct SimpleWordVectorFieldOperative<'a> {
        embeddings: &'a Embedding,
    }
    
    impl FieldOperative for SimpleWordVectorFieldOperative<'_> {
        fn choose_words<'a>(&mut self, hint: &Hint, words: &[&'a str]) -> Vec<&'a str> {
            let hint_emb = self.embeddings.embedding(&hint.word).unwrap();
            let hint_embedding = hint_emb.view();
            let mut similarities = vec![];
            for w in words {
                let new_embed = self.embeddings.embedding(&w).unwrap();
                let similarity: f32 = new_embed.view().dot(&hint_embedding);
                similarities.push((w, similarity));
            }
            similarities.iter()
                .sorted_by(|(_, e), (_, e2)| e.partial_cmp(e2).unwrap())
                .rev().take(hint.count).map(|x| *x.0).collect()
        }
    }

    The field operative is even simpler: we go through all the words that are still on the board and get a similarity score between them and the hint. Sort the words by the similarity and take the top "count" ones.

    Let's see how it does on the same maps as before. If you hover over the field operative, you can see the guesses it makes.

    It looks like the two AI players are a good fit for each other: the field operative always guesses the word that the spymaster based the hint on. Now, let's try to make the spymaster give hints for multiple words.

    Improving the spymaster

    My first idea would be to generate the top n closest embeddings for all words of a color and see if there are any which are in common. n will be a tuneable parameter: lower values will give hints that are closer to the words, but they will match fewer words, higher values will match more words, potentially worse.

    impl Spymaster for DoubleHintVectorSpymaster<'_> {
        fn give_hint(&mut self, map: &Map) -> Hint {
            let enemy_color = opposite_player(self.color);
            let remaining_words = map.remaining_words_of_color(enemy_color);
    
            let mut sim_words = HashMap::new();
            for word in remaining_words {
                let words = find_similar_words(&word, self.embeddings, self.n);
                for w in words {
                    let count = sim_words.entry(w.word).or_insert(0);
                    *count +=1;
                }
            }
            let mut best_word = sim_words.iter()
            		.max_by_key(|(&x, &y)| y).unwrap();
    
            return Hint{count: *best_word.1 as usize, word: best_word.0.to_string()};
        }
    }

    We store how often each suggested word is found in all suggestions in a hashmap, with the key being the word and the value being the occurence count. After we add all words there, we simply read out the maximum by occurence count. In Rust, this sort is nondeterministic, so if there are multiple words that occur the same number of times, which one will be returned is not guaranteed.

    Around n=40, we start seeing hints for multiple words. At n=80 we have hints for two words for all three maps. At n=400 we have triple hints for two of the maps. But starting with n=80, the field agent no longer guesses correctly all of source words. Sometimes it's because the associations in the words is weird, but more often it's because the spy master only takes into account the words to which it should suggest related hints and it doesn't take into account the words from which the hint should be dissimilar.

    There are several ways to address this issue, from simply seeing if the hint word is too close to a bad word and rejecting it, to more complex approaches such as finding the max-margin plane that separates the good words from the bad words and looking for hint words near it. But this post is already long, so this will come in part 3.

    The whole code I have so far can be seen on Github.