Like most of the tough questions we have attempted to tackle so far this semester, I think the answer is both yes and no when it comes to the issue is literature data. The first thing that needs to be done when addressing this question is defining what exactly we mean when we say data. Our discussion in class today revealed that there is even some discrepancy amongst ourselves about how we should define data in this context. I think  data in this consideration should refer to information (raw material) that is capable of being inputted into an algorithm, or in other ways be systematically processed, so that generalized conclusions about a set can be reached.

I think the Google Books services alone are enough to prove that literature must to some extent be data. Digitizing books has allowed Google to analyze large amounts of literature in complex ways, revealing all kinds of interesting trends. One of the more basic examples, which we have talked about extensively in class, is the Ngram Viewer. Ngram quantifies writing and allows users to investigate the frequency of a word’s usage in books over time. This tool is useful because it can assist in analyzing all sorts of historical events and trends. However, Ngram is really only turning words in isolation into data, not literature per se.

The forensic analysis of J.K. Rowling’s secret book provides better examples of how literature is data. Two forensic linguistic experts examined the book in question, one of Rowling’s known works, and three other British crime novels and then compared the results. They used a generic word frequency test, similar in idea to Ngram. However, the experts also employed more complex tests such as ones focusing on concepts and syntactical style. The ability to run literary elements like these through a computer algorithm is evidence enough for me that literature has to be thought of as data in ways. This kind of technology could augment scholarship in the future by giving us the potential to assign new sorts of quantified characteristics to writers. For example, rather than just saying, “this writer is known for his gloomy style, long prose, etc etc” we will be able to say, “this writer falls into this particular category because he uses this certain arrangement of related words with x frequency”.

At the same time, I don’t think that computers will ever be capable of fully analyzing the more sophisticated elements of good literature. Computers might become good predictors of symbolism, for example, just by analyzing past associations and running probabilities. But they will never be able to output anything that can represent the aesthetics and more intangible qualities and emotions of deep writing or poetry.