N-grams are word pairs that are found directly next to each other in a text. They can range from uni-grams (a single word), bi-grams (two words in conjunction), or tri-grams (three in conjunction), and they are used to see what words are most often closely used with keywords. For our purposes, we utilized uni-grams and bi-grams.
Bi-grams
To calculate bi-grams, we utilized a Python script that was written by the previous ECBC 2022 cohort, and edited to fit our purposes. Bi-gram models were then trained on sub-corpora separated into separate time periods to see how these word associations changed over time. Another reason for splitting up by time period is so that we could track if negative word associations or positive word associations rise or decline over time. We specify which keywords we would like to search for n-grams in our corpus, and then our script generates a list of all of the word associations with that keyword and how frequent they are as well.
Keywords were selected based on how frequently we had seen them in the texts that we had close-read and how important they were in the sub-corpora overall. Here is an example of n-grams from our Spanish sub-corpus in the period of 1615-1619 using Spanish-specific keywords:
Above are just a few of the unique bi-gram pairs from this sub-corpus. For visualizations, we picked the most interesting N-gram pairs to track overtime. We used R-shiny to generate as well as add context for some of our graphs. Below is the N-gram frequency graph for Ireland:
Here, we can see a major rise in the bi-gram “conquest_ireland” because of Tyrone’s Rebellion, which was a rebellion by Irishman against English colonization. Other notable bi-gram occurences are “wild_irish” and “war_irish”.
Uni-grams
Since Unigrams are just single word frequency, these were generated with just the most important keywords we found either from TF-IDF, close reading of certain texts, or historical analysis. The Uni-gram visualizations were also generated from sub-corpora to sub-corpora and split into distinct five-year time periods. Below is a uni-gram graph over time:
To see more N-gram visualizations for every sub-corpora, and also visualizations for certain categories in the sub-corpora, check out our R-shiny app!
Limitations
The sub-corpora we used to generate these n-grams was significantly smaller than the one for word embeddings because it was generated just from TF-IDF, as the n-grams code ran significantly better on a smaller corpus. A limitation to this was that because this sub-corpora was in five year-time ranges, and some of the TF-IDF sub-corpora had less than thirty texts, many of the time periods had few, if any, texts in them pertaining to the sub-corpora. And since we are also tracking specific word pairs, not every word pair has a high frequency or is used a lot over time. When we began to generate visualizations of each of our n-grams, this resulted in some time periods having a sharp decrease for every keyword or all resulted in zero mentions. This did not mean that English writers weren’t writing about these groups of people or that they weren’t being affected significantly, but just that a specific word pair wasn’t used in the limited corpus we generated.