Pride & Prejudice & Word Embedding Distance

A "quickie" word2vec/t-SNE vis by Lynn Cherny (@arnicas)
An experiment: Train a word2vec model on Jane Austen's books, then replace the nouns in P&P with the nearest word in that model. The graph shows a 2D t-SNE distance plot of the nouns in this book, original and replacement. Mouse over the blue words!

I downloaded the texts for all Jane Austen novels from Project Gutenberg and reduced the files to just the main book text (no table of contents, etc.).

I trained a word2vec model on the full text in gensim gensim.

Then I replaced all nouns inside Pride and Prejudice with their closest match according to the model's similarity function. This means closest based on use of words in the whole Austen oeuvre!

I used a Python t-SNE library to reduce the 200 feature dimensions for each word to 2 dimensions and plotted them in matplotlib. I saved out the x/y coordinates for each word in the book, so that I can show those words on the graph as you mouse over the replaced (blue) words.

The UI uses the novel text preprocessed in Python (where I wrote the 'span' tag around each noun), a csv file for the word locations, and a PNG with dots for all word locations with a transparent background. The D3 SVG works on top of that (this is the coolest hack in the project, IMO).

No, it's not a good read. I stuck to "top match" in the model, deciding this is about Jane's oeuvre themes, not a good novel mashup. Also, the vis illustrates the occasionally indirect relationship between the closeness in word2vec and the t-SNE 2d plot, although frequently (with this model) the word pairs appear near each other on the graph.

For more on how/why I got here and my sillier mistakes, visit my blog post and the git repo.