Skip to content

Our Sub-Corpora

All of the texts that we used for data analysis are from the Early Print Library. There are around 52k texts in the whole database, and these texts were then uploaded to the Duke Compute Cluster (DCC). However, we are examining just the period from 1590-1639, so we then reduced the corpus to fit in our time period and were left with around 8.1k texts. These texts were separated into further time period ranges, batching them by five to four-year periods (depending on the historical significance surrounding certain dates) and also ten-year periods so we were able to track language over time.

 

Since we are looking at how language has changed over time for specific groups of people and places, we separated our corpus even further into nine distinct groups based on their relation to the Virginia Company. To achieve this, we used a combination of using the meta-data keyword tags that were in the Early Print metadata sheet provided as well as utilizing TF-IDF by running a base text that had just keywords associated with each group. Since these sub-corpora were very specific, each had less than 100 texts. These sub-corpora are as follows:

 

  1. Enslaved Africans

  2. Native Virginia Nations

  3. Ireland

  4. The Netherlands

  5. Indentured Servitude

  6. The West Indies

  7. The Spanish

  8. The Portuguese

  9. Bellarmine/Anti-Catholic ideology