What is TFIDF?
TF-IDF, or Term Frequency Inverse Document Frequency, is a method that allows us to see the similarity between texts in a large corpus based on keywords that they share. A base text is chosen that the user wants to test for similarity across a corpus. What TF-IDF does is that it takes the most frequent words in the text (for our initial purposes, we took the top 150), and then assigns a score for how frequent they are in that text alone. Then, it finds how many time each of these frequent keywords are used in other texts throughout the corpus, and then assigns a score of how often they are used in the non-base text compared to the base text. More often than not, most texts have barely, if any, of the keywords from the base text. However, if a text shares a significant amount of keywords, and those keywords have a high TF-IDF score, we can presume that those texts share a similarity in content and usage. TF-IDF is also a great tool for locating hard-to-find but important keywords that may not have been standardized or lemmatized.
Here is an example of what a TF-IDF spreadsheet looks like, using the base text 13201, or “Orders and directions, together with a commission for the better administration of iustice, and more perfect information of His Maiestie how, and by whom the lawes and statutes tending to the reliefe of the poor…”, with just a few of the many terms used to check for compatibility over a corpus: For further in-depth reading on TF-IDF and also the coding process, check out Early Print’s documentation on the process.
How we used TF-IDF for our sub-corpus
TF-IDF was utilized to generate our sub-corpora. After clustering methods proved unsuccessful, we decided that the best way to proceed was to instead manually gather our subcorpora. To do this, we first found strong base texts that related to the group we wanted to analyze so that we could find the keywords not only most often seen in that text, but also to see if those keywords were very frequent across our whole corpus for texts relating to the sub-corpora. For example, a text used to find terms relating to Virginia was John Smith’s “A true relation of such occurrences and accidents of noate as hath hapned in Virginia since the first planting of that collony, which is now resident in the south part thereof, till the last returne from thence written by Captaine Smith [Cor]one[ll] of the said collony, to a worshipfull friend of his in England”.
From there, we then gathered topical keywords that would describe each sub-corpus from these TF-IDFs, and also words that we had found either from our close-readings of texts or sent by our Project Leads, and put them in an empty text file to serve as a base text. That way, we would be working with a much smaller, more focused set of words that pertain specifically to the groups we hope to analyze the language surrounding. After these TF-IDF spreadsheets were created, we went through each keyword and found the TCP IDs that had the highest TF-IDF score (usually around 4-8 depending on how significant the word is), and added the ID to a list. After we had gone through each keyword and documented the highest scoring TCP IDs, we then went through the list we created and counted how many times the IDs were repeated. The more times an ID was repeated, the more likely we were that its contents were dealing with the subcorpora we wanted to examine. If an ID repeated enough times (around 6-8 times depending on how many keywords were used to run TF-IDF), it was then added to our subcorpus.