Big Data: Big Promises (with Big Problems?)
On the wonderful Tuesday evening of October 9th, the Unis present were treated to two lectures by two of our own graduate student Unis: James Johndrow, graduate student in Statistics and Allen Riddell, graduate student in Literature. The topic of interest was “big data” and we’re talking BIG (James casually mentions that a 15 minute neuroscience experiment using an fMRI will generate, oh, about 1 billion data points).
The first lecture by James was titled “Big Data and the Future of Human Knowledge” and, lest you think this was going to be an optimistic paean to the endless possibility of expanding human knowledge, he started us off with a quote from Carl Sagan’s The Demon Haunted World. Through Sagan’s writing, we immediately receive a warning that perhaps our understanding is lagging while our ubiquitous technological innovations such as the iPod increase in complexity. But, before the “future of human knowledge”, we get a quick overview of the history of “big data”. Big Data requires big storage space and our capacity for storing information has increased quite rapidly from the punch-card to the floppy disk to (skipping a few years) the storing of information ‘in the clouds’ (on networks maintained through enormous server farms and accessible everywhere, think Dropbox). Finding themselves in possession of this treasure trove of information and of increased computational capacity though computers, smart people started looking for ways of using it (aka making money off of it). The first people to find a way to profit from the new information: you guessed it, Wall Street (the ‘quants’). Soon after, a number of online businesses, such as Amazon, started intelligently using information on people like us to sell us goods. But the real explosion in the use of Big Data came in the first decade of the second millennium, when one by one the fields of human inquiry started joining the bandwagon. James’ interdisciplinary is visible in the survey he provides of a number of these fields (he must be a Uni): biology, neuroscience, health care, government, business, Netflix, etc.
Despite the ever-expanding amount of data that we have been collecting, however, there’s a catch: our computation power is behind our data storage capacity! In fact, quite far behind. James provided all the relevant formulas and I’m sure he could provide them to any Uni wanting to dig deeper at the limits of human interaction with these data monstrosities. Long story short, we can only looked at the relationships between a fairly limited set of variables and that’s bad news if your information is on any large subset of the 30,000 human genes and you’re dying to get at the measure of “dependence”. With this caveat and many others, we finally receive a glimpse of the future and the future is about prediction, not estimation. Yes, dear fellow Unis, the demand for revealing the future to us in the present is as high as it was when seers and sooth-sayers were. We want to know what stocks will do in the future, what will happen to the price of oil, which people are going to need treatment for which diseases in the near future and well, pretty much everything else! However, James warns us that people tend to ruin even the most beautiful prediction by learning about it and then acting with that knowledge in new and unpredictable ways. This serves as a natural limitation to how much we can anticipate human behavior. Additionally, the most important to predict events are the really rare ones and most of our models have assumptions built in that tend to deal badly with the occasional financial crisis or oil shock (some of the blame goes to our affection for the beautiful-to-work-with normal distribution function). James ended his talk on a note of realism, dividing our questions about the future into categories according to the difficulty of prediction (easy, medium and hard).
The second lecture by Allen was titled “Big Data and the Humanities: How to Read 22,198 Journal Articles”. You may expect that the answer to the ‘how to read’ part is ‘very slowly’, but Allen has been working out a very interesting mechanism to overcome the great difficulties of this task. There are many very large collections in the great human cultural edifice, from collections of ancient texts or religious manuscripts to collections of novels, letters and diaries from particular times and places. These large collections have previously demanded that scholars dedicate an entire life to reading them in order to absorb the information collected within. One famous example of someone performing this task of ‘direct reading’ (as Allen refers to it) is Laurel Ulrich, whose book “A Midwife’s Tale: The Life of Martha Ballard, Based on Her Diary: 1785 – 1812” won a Pulitzer Prize. The book was based on the author’s close to 10 years long reading of the diary of Martha Ballard, a woman who had written a diary entry during every single day of her life. While such gargantuan tasks are to be admired, one can easily see how they can be frustrating to undertake. An alternative strategy developed by scholars is called ‘collaborative reading’ and has been a part of “GENRE: Evolution Project” – a database studying portrayals of technology in various short stories and classifying them according to broad categories (positive-negative portrayal of technology, for ex.). Collaborative reading again involves the actual act of reading the short stories, but a group of researchers are assigned partly overlapping and partly independent reading lists, significantly reducing the time. However appealing these first two methods might be if the works under consideration are fascinating literature, there are many examples when scholars either simply can’t physically read or probably should not dedicate the time to read the enormous amounts of literature required to answer a question.
This very impediment was present in Allen’s own research on trends in the history of German Studies. Collating the four main German Studies journals with articles going from around 1928 to the present, he was faced with the task of reading 22,198 journal articles. And, as if this weren’t bad enough, the kind people from JSTOR (for copyright and other reasons) have provided him with only word counts of various words in each article, in a completely unordered pile and frequently with disastrous transcriptions of the original German. He was faced with an enormous ‘bag of words’. Fortunately, Allen represents a rare breed of the humanist with quantitative skills and so he implemented a sophisticatedly named model (I believe it was a ‘Latent Dirichet Allocation’) in order to make sense of this very big dataset of words. As an example, Allen masterfully demonstrated how one can use the chapter-by-chapter word counts of Elizabeth and Darcy in “Pride and Prejudice” to get a sense of the importance of the characters as we move through the novel. Mind you, Allen is not advocating that we substitute the word count method for ‘direct reading’ on a mass scale! It is merely to serve as a useful tool for uncovering trends and supporting one’s arguments empirically. In the case of studying the trends in German Studies, the dataset is significantly vaster than two words, which is where the previously named Dirichet comes in. Quite miraculously for the uninitiated into these statistical mysteries, Allen succeeds in obtaining angles between different words represented as vectors, and then plots for us various trends in the history of German Studies, showing us a marked decline in the study of the German Language and a marked rise in literary analysis. The Grimm Brothers also make an appearance, showing a large upward swing in importance during their bicentennial and confirming that the method works quite well.
The lectures were great, but unfortunately question time had to be cut a little short. The themes raised there included quantum-computing, protein structures, stock prices and academic writing. Just what you would expect from an interdisciplinary group like the Unis!