Word Embeddings – Ethical Consumption

What Are Word Embeddings?

Word embeddings are a way to turn words into mathematical representations of themselves. Words, and language in general, are subjective entities. Word embeddings seek to turn them into numerical versions so that mathematical operations can be performed on them.

Typically, this mathematical representation is created in the form of a vector. By transforming words into vectors, word embedding seeks to show that the more linguistically similar words are (i.e. the more they are used in the same contexts), the closer they are in the vector space. For example, when adding two word vectors together, the resultant (the result of adding them together) will be the vector for the word most linguistically similar to the combination of the two words.

The most classic example of this would be that inside a well-trained model:

king – man + woman = queen

Word embeddings are specific to the corpus they are trained on, therefore the associations we discovered represent the specific language deployed in our corpus. So, the relationship demonstrated above is not always replicable in certain models given the texts inputted, though the same fundamental idea is in play.

For example, in our Virginian texts model, when we add the vectors for ‘powhatan’ and ‘daughter’ together, the highest-ranking resultant in similarity is that of the vector for ‘pocahontas’. Historically, this is true, given that we know Pocahontas (her real name was Matoaka, though the English did not know this name) was the daughter of the paramount chief, Powhatan, of the Powhatan confederacy. However, this vector addition lets us know that our mathematical representations of words accurately reflect their meaning in context.

Word embeddings make it easy to track the relationship between our keywords and also the connotations that they are being used in. Utilizing the Gensim library, and its Word2Vec package vectorized our entire corpus and trained eleven separate models, reflecting distinct five to ten-year time periods. By turning the language in these texts into vectors, we can track how similar words are in context to one another, whether that be through Euclidean distance or Cosine similarity. This way, we can track how the language surrounding the different groups we are interested in analyzing changes over time and pinpoint exactly where it changes, but also find valuable information that would be difficult to find. For our project, we are interested in how language has not only developed over time but the context of the language being used, and whether or not they are being described and represented positively or negatively. This is why word embeddings were particularly useful in our analysis.

Cosine Similarity

Using word vectors we can calculate the cosine similarity of two vectors. Cosine similarity measures how similar two vectors are to one another using their direction and orientation. Cosine similarity disregards magnitude, which can instead be captured by measuring the Euclidean distance between two vectors. Its score ranges from -1 to 1, with values closer to one indicating higher cosine similarities. As for what a negative cosine similarity means, that is the subject of much debate in computational linguistics and far beyond our capabilities to answer, but a larger negative score does not mean words are necessarily unrelated.

Essentially, the more similar the angles of the vectors, the more similar the two words are within the set of texts being analyzed. The similarity between two keywords is assessed without taking into account the frequency of the word, but rather the contexts it is used in. This way, we were able to see how strongly associated certain keywords were with each other and analyze how the language has developed.

To visualize these scores, two visualization types were utilized: heatmaps and cosine similarity over time graphs. To see all of our different cosine similarity graphs, check out our R-shiny app. To learn more about how they were generated and the documentation on specific graphs, scroll down.

Heatmaps

Heatmaps are a way that we can see the similarities between multiple different keywords across a corpus, and see how strongly they are correlated with each other. The darker the shade of the square that the two keywords share, the more similar they are to one another. Here is an example of a heat map:

Here, we can see the correlation between multiple different keywords in our Irish subcorpus. Some notable observations are the associations of Irishman with very negative terms, like ‘beggarly’ and ‘uncivil’, and the link between ‘Tyrone’ and ‘rebellion’ as two closely related keywords. This would make sense, as a major conflict happening in Ireland during our period was Tyrone’s Rebellion. Heat maps are a great visualization tool because it makes it easy for the user to decipher how similar certain terms are to each other in a visually appealing way.

Cosine Similarity Graph

With cosine similarity graphs, we can see the correlation between a keyword and multiple different words and how much that correlation changes over time. This way, we can see whether or not the language surrounding this keyword is developing negatively or positively. Here is an example:

Here, we can see how similar the word “Spanish” is to multiple different words and how that correlation either rises or drops over time. What is interesting about this graph is that during 1605-1609, the correlation between “Spanish” and “Enemy” has a high cosine similarity score. This is interesting because during this period King James I was actually trying to alleviate relations with Spain. However, if we look at cosine similarity scores for key terms related to the Spanish in the whole corpus overtime, rather than specifically texts that discuss Spain, we can observe a different trend:

The trend here is that the terms “tyranny”, “enemy”, “catholic”, and “force” all declined in the immediate period after James I’s coronation. This tells us that the texts that deal exclusively with Spain in the focused subcorpus, must be more anti-Spanish in character than in all texts available for that time period.