Skip to content

Part of Speech Tagging

Overview

From the Early Print (EP) Text-Creation Partnership (TCP) XML files, we were able to extract the parts of speech (POS) of the surrounding ten words of any keyword we chose to examine in all of its instances inside the text of our corpus. We were able to discover what the most common adjectives, verbs, and nouns were that were used around keywords for our groups of interest. We tracked shifts overtime in the most common adjectives, verbs, and nouns by segmenting our corpus into 10-year windows of time. We tracked these parts of speech for 14 key words of interest, generating 215 visualizations total. We found what words were the most popular in each decade, beginning in 1590 and stopping in 1639. In our visualizations, the words are ranked in descending order in terms of prevalence with a color gradient. The darker the color of a word, the more often it was found beside the keyword in question. Here is an example of a visualization generated for the nouns found in the text around the word ‘Virginia’ inside our corpus:

To see more visualization for an array of keywords, check out our R-Shiny app!

Methodology

Standardization and Lemmatization

Much like our text cleaning process that we used to find N-grams, conduct TF-IDF analysis, and find cosine similarities, we standardized and lemmatized the XML files from which these parts of speech were extracted. This process included standardizing variant spellings found through manual analysis of key texts, as well as lemmatizing words that were not already partially lemmatized in the EP TCP files. It also included the removal of stop words and lowercasing of all text to simplify for code processing.

Through this, we managed to catch most transcription errors and omissions (where ASCII carets appear in lieu of letters) in relevant vocabulary, though this brute force method is not perfect. You can see where some missing transcriptions were not completely standardized in a few of our visualizations, like this one: 

 As can be seen, the tenth most popular verb in that period is “^nd.” This is likely because “Virginia” had few instances in this time period prior to the founding of Jamestown, and this indeterminate word form showed up in the few instances that did exist.

NUPOS Tagging

The POS tags provided by EP are machine-generated. They are generated by the MorphAdorner algorithm using the NUPOS tagging set devised by Martin Mueller. MorphAdorner takes XML files, designates sentence and word boundaries, and then tags words with their NUPOS identification. Though NUPOS has 241 total tags for the various parts of speech that can exist in the most granular analyses of English, we chose to focus only on 3 of the 17 available major word classes: verbs, adjectives, and nouns. MorphAdorner is trained using a blend of cultural touchstone texts, like the complete works of Shakespeare, as well as popular 19th century texts to identify parts of speech. MorphAdorner is trained on ~6 million words of text from the following:

  • The complete works of Chaucer and Shakespeare
  • Spenser’s Faerie Queene
  • North’s translation of Plutarch’s Lives
  • Mary Wroth’s Urania
  • Jane Austen’s Emma
  • Dickens’ Bleak House and The Old Curiosity Shop
  • Emily Bronte’s Wuthering Heights
  • Thackeray’s Vanity Fair
  • Mrs. Gaskell’s Mary Barton
  • Frances Trollope’s Michael Armstrong
  • George Eliot’s Adam Bede
  • Scott’s Waverley
  • Harriet Beecher Stowe’s Uncle Tom’s Cabin
  • Melville’s Moby Dick

Click here to read more about MorphAdorner and NUPOS!

Limitations and Complications

There are, however, a few limitations to this method. Due to the nature of the texts we are dealing with, standardization and lemmatization is an endless uphill battle, for there are always new and bizarre variant spellings to be discovered in 17th century texts. There are also always transcriptions errors or omissions due to the fact that these texts are digitized by humans dealing with centuries-old paper and ink. 

MorphAdorner is also trained on majority 19th century texts, which has a significant gap in language similarity from 17th century texts. While 6 million words sounds like a large number, this is likely not enough to create a perfectly accurate POS-tagging algorithm given the complexity of language, even for the 19th century texts MorphAdorner is trained on. Applying an imperfect algorithm to imperfect texts, of course, yields imperfect results. MorphAdorner was not designed to assess 17th century language, and thus has several errors.

For example, while EP has assigned POS tags for each words in the XML files, there are several times when certain words are tagged incorrectly, like “Walter” being tagged as an adjective instead of a noun. In general, a sizeable amount of nouns appear as adjectives and vice versa. 

This limitation is likely caused by MorphAdorner’s accuracy threshold and the fact it is not trained to deal with 17th century texts. We look forward to addressing this limitation in the future to fine tune our POS tagging even further. 

We have also noticed that there is quite a prevalence for the words “Good”, “Great”, and “Certain” to dominate the adjectives of any word. We do not yet have an answer as to why these words are so dominant across the corpus for all the keywords we examined or if this is simply a byproduct of how common they are in the English language in general.