Biblical Citations in Sermon Marginalia

By Amy Weng

About Our Corpora: 

Our primary dataset of texts comes from the EarlyPrint Lab (EP), a joint project between Northwestern University and Washington University in St. Louis which produced over 52,000 lemmatized and partially spelling-standardized versions of the XML files from the Text Creation Partnership (TCP). The TCP, a collaboration between ProQuest and several libraries and non-profits, transcribed over 60,000 texts from Early English Books Online (EEBO), a comprehensive database of English texts dating from 1470 to 1700 found in libraries worldwide. In this project, we use both EP and TCP files—the former for text analysis and the latter for marginalia extraction.

      • Bitbucket Repository of EP XML files with download instructions: Link 
      • TCP’s Dropbox folder: Link (Download the P4 XML files in both Phase 1 and Phase 2) 

Marginalia Extraction & Standardization: 

Out of 60,327 TCP XML files, there are 5801 texts that have either the word “sermon” in its title or subject heading or some variation of “preached” + preposition in its title. Out of these, 5348 are found in EP’s repository of enhanced XML files, and 4370 of them have at least one instance of a Biblical citation in their marginal notes.

Why would one care to analyze sermon marginalia? The margins of early modern printed sermons can include abundant information from both the preacher and publisher, serving as a valuable “pedagogical resource” about the sermon itself and occasionally as an advertising space for the publisher (Rigney). Marginalia added by the preacher include citations of Biblical passages quoted or invoked in the sermon, additional Greek or Latin quotations, and translations of foreign quotations within the sermon. Because sermons are theological in nature and preachers’ arguments rest upon the elucidation of one scriptural verse or topic using related texts, we can garner vital information about a sermon’s topics of concern just by examining the scriptural citations it contains. However, marginalia may not be present in all printed sermons, and most importantly, they are prone to errors: 

margins are one of the least stable parts of a text, peculiarly resistant to textual analysis. Marginal references are frequently missing, routinely erroneous, and quite often to be found wandering errantly around the page, either higher or lower in the margin or within the text as opposed to in the margin (Rhatigan 437).  

These defects pose limitations to the usefulness of studying sermons by the distribution of their marginal scriptural citations. There is nothing that helps with the problem of erroneous citations, but at least the positioning of these marginalia is irrelevant to our investigation. Nevertheless, we have found that marginal notes are ample in quantity and form one avenue through which we can analyze the contents of sermons using computational tools.

We are focusing only on a subset of 70 printed sermons that were preached by six notable early modern English clergymen, but there is a wealth of information to be explored in the broader set of sermons, as discussed later in this article. See the “Sermons and the Public” page for more details on these sermons and their preachers, as well as our efforts at text analysis.

Our particular set of sermons (hereafter referred to as the charity sermons dataset) was identified as having some engagement with the topic of charity through research done beforehand by Professor Giugni. One such example is Thomas Gataker’s The decease of Lazarus Christ’s Friend, a sermon he preached at the funeral of a London merchant. As we can see in the lower margins of the left page and the upper right corner of the right page below (within the added gray boxes), the printers of this sermon annotated that Gataker refers to certain passages of Matthew 25 in this part of his sermon. 


In the full text version of this sermon on ProQuest’s EEBO interface, these marginal notes are represented as footnotes, as this following shows:

In the TCP XML files, these marginalia are encoded within <note> tags:

The extraction and spelling standardization of these Biblical citations is a tedious and inefficient process, heavily reliant on pattern matching using regular expression rules and a standardizer dictionary of Bible book abbreviations. I compiled and revised the rules and dictionary manually through many reiterations of the procedure, but I still cannot guarantee that the output is completely faithful to the original in the scanned pages.  

I only process the annotations within the sermons themselves, which means that I exclude the various citations that occur within any dedicatory material.  The formats of citations within the marginalia of these books are also highly variable, so I standardize them into single “<book> <chapter> <line>” citations. For example, I convert “2 King. 6. 22.—9. 24.—13 15.” into “2 Kings 6:22”, “2 Kings 9:24” and “2 Kings 13:15” and “Genesis 3 9-11” into “Genesis 3:9”, “Genesis 3:10”, and “Genesis 3:11”. This approach excludes book-only and chapter-only citations, as well as citations that have illegible characters. 

Feel free to visit the commented source code on the extraction, preprocessing, and formatting of marginal notes here: bibleMarginalia.py

      1. Input: The file path to a single TCP XML sermon file 
      2. Procedure: After preprocessing each XML <note> tag, the code finds and properly formats Biblical citations.  
        1. Preprocessing involves making everything lower-case and removing all characters except for alphabetical letters, numbers, commas, ampersands, and hyphens. I remove periods because they complicate the regex search expressions, and they are unnecessary for interpreting the majority of citations. However, for overly ambiguous cases, I had to either hard code the citations after making guesses based on the original phrasing found in the margins of the PDF version of the sermon, or entirely exclude the instance. The last step of preprocessing is standardizing the spellings of all Bible book abbreviations.
      3. Outputs: (1) a list of the properly formatted singular line citations, (2) a list of the citations that cannot be converted to singular citations at the moment, and (3) a list of possible Biblical abbreviations that are currently missing from the standardizer dictionary. The latter two outputs are useful in continuously updating the rules and dictionaries for greater accuracy.

Marginal Citations in the Charity Sermons Dataset: 

Within the charity sermons, we encounter 7,391 citations of New Testament passages and 8,329 citations of lines in the Old Testament. The three books most commonly cited in the Old Testament are Psalms, Isaiah, and Genesis, whereas Matthew, Romans, and Luke are the most frequently occurring New Testament books. Psalms is overwhelmingly the most cited book with more than twice the number of the next most commonly cited book, Matthew. Of all the books of the Bible, only Philemon, 3 John, and Obadiah do not appear in any citations within this dataset.  

Top 10 Occurrences of Cited Books:

1) Psalms: 2339 instances
2) Matthew: 1103 instances
3) Romans: 771 instances
4) Isaiah: 691 instances
5) Genesis: 668 instances
6) Proverbs: 608 instances
7) Luke: 559 instances
8) 1 Corinthians: 532 instances
9) John: 494 instances
10) Hebrews: 432 instances


Top 10 Occurrences of Cited Chapters:

1) Psalms 119: 224 instances
2) Romans 8: 162 instances
3) Matthew 5: 114 instances
4) Matthew 26: 102 instances
5) James 1: 100 instances
6) Matthew 6: 97 instances
7) Romans 5: 90 instances
8) 1 John 3: 89 instances
9) 2 Corinthians 5: 89 instances
10) Philippians 3: 75 instances

Top 10 Occurrences of Cited Lines:

1) James 1:17: 21 instances
2) Revelation 20:6: 19 instances
3) Romans 14:17: 17 instances
4) Hebrews 13:5: 17 instances
5) 1 Peter 5:8: 17 instances
6) Job 13:15: 17 instances
7) Romans 5:3: 16 instances
8) Genesis 2:7: 16 instances
9) Philippians 3:8: 15 instances
10) Matthew 26:41: 15 instances

We can then explore the unsupervised clustering of documents that have marginal citations in this corpus using those citations as features, which are converted into numerical vectors based on the frequency of each feature relative to the entire corpus. This form of vectorization is called TF-IDF for term frequency – inverse document frequency, meaning that a citation which occurs multiple times in only one sermon will have a higher score to signify that it is a highly distinctive feature of that sermon. We further make sure that we use the logarithm, instead of the raw count, of a term’s frequency in each document to reduce its score and thus lower the significance of the most commonly occurring words within that document, which are most likely to uninformative about what makes a document distinct from others in the same corpus. Given numerical representations of all the citations in each sermon, we can then compare and contrast the output of different clustering techniques to identify sermons that have more similar marginal citations to each other. A file of all the marginal citations for each document can be accessed here: marginalia.all.sermons.txt

K-means clustering assigns input samples, which are the documents in our case, to a user-defined number of clusters by minimizing the sum of the variances of each point to the mean of the cluster it belongs to. This algorithm is not guaranteed to define the optimal clusters, so one often needs to run multiple trials to estimate the number of clusters to look for, if any well-defined clusters exist at all. In our case, plotting the document vectors in 2D space using the dimensionality reduction technique of principal component analysis allows us to see that there are obvious groups of sermons that have more similar distributions of marginal citations to each other than to the sermons in other groups: 


Each point in the plot above represents a document in our corpus, and we can summarize each group with the document IDs, preachers, scholarly-assigned subject headings from the English Short Title Catalogue, and distribution of overall and charity-related citations: clustering_output.md. There are 53 documents with marginal citations in total. Moreover, we can look at the most frequent bigrams and trigrams involving keywords related to charity or poverty within the sermons themselves to get a glimpse at the phrases used to discuss money, material goods, labor and industriousness, the less fortunate demographics, and forms of giving. N-grams are sequences of n words that occur in order within a document, and phrases with four words “give_to_the_poor” are included as trigrams when one of the words is a stopword (i.e., a very commonly occurring word that is unessential to a phrase’s meaning, e.g.,“the”). Because these n-grams are produced from EP files, the words in the n-grams below have been lemmatized to their dictionary forms. See the “Sermons and the Public” page for a more comprehensive overview of the textual contents of these sermons. 

From the plot, we find that most documents with marginal citations are grouped tightly together in the cluster labeled 0, meaning that the distribution of their most significant citations closely resembles each other. We can infer that overall themes and topics of the sermons in this subset are also highly related, although these sermons are preached at a variety of locations and for different occasions, from Parliament fast-day sermons to one preached at an elite academy to ones preached at funerals and marriage ceremonies. What is more interesting are the clear outliers: the other groups that have only one or two documents. Particularly, we find that the documents in these groups feature incredibly large amounts of marginal citations, and four out of five of them come from the same preacher. 



For a better view of which sermons relate to each other in terms of their most distinctive marginal citations, we explore the results of hierarchical clustering, which seeks to produce a pairwise grouping of the inputs in a way that resembles a phylogenetic tree in biology. The diagram above is the product of agglomerative clustering of the marginal citations, which is a bottom-up approach that lets each document start as its own cluster and then successively merge them into larger clusters. For consistency with the parameters for k-means clustering, we use ward linkage to minimize the variances within each cluster. The benefit to computing a hierarchy of these sermons is to better explore the similarity in citations between individual documents which the flat scatter plot above cannot reveal. Indeed, from the labels on the left, we see that the sermons by each preacher are usually clustered close together to each other. The results for the outliers are generally consistent except for Gataker_A01529, which was grouped together with Gataker_A01554 in the k-means output but is shown to be in subtree that is noticeably distant from the part of the tree that contains the latter. Again, we see that Gataker_A72143 and Gouge_A01979 are very dissimilar to each other and to the other documents in the corpus. 


Charity-Related Citations 

To gain insight into discourses around charity and poverty from Biblical citations, we queried through the citations with a collection of relevant passages, a list which is named ‘bible_charity’ in this file: charity_citations.py. As seen in this code file, we organized the hits by six broad themes, which provide a broad survey of the ideas occurring around topics of charity and the poor:

For annotations on the citations in each category, feel free to visit this document:  Charity-Related Citations.

    • Helping: On helping the needy, examples of such good deeds, or explicit commands to help the poor. These include commands to sell all of one’s possessions to give alms, giving lodging to strangers, feeding the hungry, helping neighbors, and giving clothes to the naked. One should help widows, orphans, people with disabilities and illnesses, and those who are imprisoned. Moreover, acts of kindness need to be done with cheerfulness and liberality. 
      • [‘Acts 2:45′,’Matthew 19:21′,’Deuteronomy 15:7′,’Galatians 2:10′,’Isaiah 58:7′,’Isaiah 61:1′,’James 2:15′,’James 2:16′,’Job 29:12′,’Luke 11:41′,’Luke 12:33′,’Luke 14:12′,’Matthew 25:35′,’Philippians 2:4′,’Proverbs 3:27′,’Romans 12:8′,’Romans 12:13′,’Titus 3:14′,’1 John 3:17′,’Matthew 6:20′,’1 Corinthians 16:1′,’Luke 4:18′,’James 1:27′,’1 Corinthians 16:2′,’Matthew 25:36′,’Proverbs 3:28′,’Acts 11:29′,’Acts 11:30′,’2 Corinthians 8:3′,’Galatians 6:10′,’2 Corinthians 8:11′,’Isaiah 1:17’],
    • Performativity: On avoiding performative, insincere acts of kindness. One should not expect monetary rewards or valuable goods in return for generosity to the poor, nor should one do public displays of charity for praise. Charitable giving should be through actions, not words. 
      • [‘Matthew 6:1’, ‘Matthew 6:2’, ‘Matthew 6:3’, ‘Matthew 6:4′,’1 John 3:18′,’1 Corinthians 13:3′,’Romans 2:8’]
    • Punishment: On punishments for not helping when one is able to. Bodily pain during life and eternal torment after life awaits those who deliberately refuse to help the poor but have the means to help when entreated upon by others.    
      • [‘Matthew 25:42’, ‘Job 31:21’, ‘Proverbs 21:13′,’Matthew 25:41’, ‘Matthew 25:43′,’Ezekiel 16:49′,’Matthew 25:46’, ‘Job 31:22’, ‘Proverbs 17:5’]
    • Kinship: On all people being kin regardless of socioeconomic status. The rich and poor are both equal before God, and one should be ready to sacrifice oneself to help one’s brethren. 
      • [‘Isaiah 58:7’, ‘Proverbs 22:2′,’1 John 3:16′,’Hebrews 13:2’]
    • Solutions: On solutions to poverty and crime, such as reforming thieves by putting them to manual labor so that they can benefit others. 
      • [‘Ephesians 4:28’]
    • Godly: On charitable giving being a feature of godly character, as well as on notions of divine rewards or blessings for generosity. Attending to the poor is a form of thanksgiving to God, and one will be divinely rewarded for one’s cheerful liberality. 
      • [‘Acts 10:2′,’Hebrews 13:16′,’Matthew 10:42’, ‘Proverbs 14:31’, ‘Proverbs 11:25’, ‘Proverbs 19:17’, ‘Proverbs 11:24’, ‘Proverbs 11:26’, ‘Acts 20:35’, ‘2 Corinthians 9:6′,’2 Corinthians 9:7′,’2 Corinthians 9:9′,’2 Corinthians 9:8′,’2 Corinthians 9:10′,’2 Corinthians 9:11’, ‘2 Corinthians 9:12′,’2 Corinthians 9:13′,’Luke 14:14’]
There are 147 citations of charity-related passages.
      • Lines cited only once: [‘1 Corinthians 16:1’, ‘Galatians 2:10’, ‘Ephesians 4:28’, ‘Ezekiel 16:49’, ‘1 Corinthians 13:3’, ‘Acts 10:2’, ‘James 2:15’, ‘1 Corinthians 16:2’, ‘2 Corinthians 9:6’, ‘Job 29:12’, ‘Hebrews 13:2’, ‘Proverbs 3:28’, ‘Acts 11:29’, ‘Acts 11:30’, ‘2 Corinthians 8:3’, ‘Luke 14:14’, ‘Luke 11:41’, ‘Proverbs 21:13’, ‘Proverbs 11:25’, ‘Luke 12:33’, ‘Acts 2:45’, ‘2 Corinthians 8:11’, ‘1 John 3:16’, ‘Luke 14:12’, ‘Isaiah 1:17’, ‘Titus 3:14’, ‘Luke 4:18’, ‘Job 31:22’, ‘Proverbs 14:31’, ‘Proverbs 17:5’, ‘Proverbs 11:26’, ‘Deuteronomy 15:7’, ‘Proverbs 3:27’, ‘Matthew 19:21’, ‘Acts 20:35’]

 

When we cluster the documents by their charity citations, we find that only 27 documents have charity-related scriptural citations, but the distribution of the clusters is only slightly different from the results from clustering all citations. The charity citations for each document can be found here:  marginalia.charity.sermons.txt. Again, we have noticeable outlier documents, whose summary information of location, audience, n-grams, and charity-related citations can be found here: charity_citations_clustering_outliers.md. The most distant outliers are all located within one of subtree of the dendrogram, clearly dissimilar from the others because they are the ones with the most instances of charity-related citations. 

Recommendations for further research

    • One can refine the formatting rules and abbreviation dictionary on different datasets. The most ambitious project would be to process the over four thousand sermons with marginal Biblical citations in TCP. With a larger dataset, the value of computational techniques becomes clearer: one can trace Biblical citations over time, analyzing trends in the popularity and distribution of Old versus New Testament passages and topics. Moreover, many sermon titles contain information about their original preaching locations, which means that we can also analyze the spatial distribution of Biblical citations in printed English sermons.
    • Instead of treating each TCP xml file as its own document, we can split up sermon anthologies and only compare individual sermons. 
    • One can analyze the marginalia for more than just instances of scriptural citations because sermon margins often contain accurate structural indicators of the sermon’s divisions, which is the preacher’s argumentative breakdown of a target scriptural text or topic (Rhatigan 443). Thus, applying text analysis on the English portions of a sermon’s marginalia may reveal important details about the structure and argument of the sermon itself.

References:

Early English Books Online (EEBO) TCP – Text Creation Partnership. https://textcreationpartnership.org/tcp-texts/eebo-tcp-early-english-books-online/

“EarlyPrint.” EarlyPrint, https://earlyprint.org/

GENEVA BIBLE 1599. http://www.genevabible.org/Geneva.html

Rhatigan, Emma. “Margins of Error: Performance, Text, and the Editing of Early Modern Sermons.” The Library: The Transactions of the Bibliographical Society, vol. 21, no. 4, 2020, pp. 423–44.

Rigney, James. “Sermons into Print.” The Oxford Handbook of the Early Modern Sermon, edited by Hugh Adlington et al., Oxford University Press, 2011, p. 0. Silverchair, https://doi.org/10.1093/oxfordhb/9780199237531.013.0011.

Scikit-Learn: Machine Learning in Python — Scikit-Learn 1.2.2 Documentation. https://scikit-learn.org/stable/

The Decease of Lazarus Christ’s Friend A Funerall Sermon on Iohn. Chap. 11. Vers. 11. Preached at the Buriall of Mr. John Parker Merchant and Citizen of London. By Tho. Gataker B. of D. and Rector of Rotherhith. – Early English Books Online – ProQuest. https://www.proquest.com/eebo/docview/2248543073/99835846/2C321355CC224EECPQ/2?parentSessionId=IfuDMCjoCnAzHpZTAJ9nvIkUsY%2BoMgpPJp7i8csNAuE%3D.