Skip to content

Topic Modeling for beginners

Ever since I attended the Digital Design Research workshop, I’ve been hearing frequently about Mallet. So when I wrapped on my previous topic, I decided to choose Mallet as my next topic of experiment. When I talked to Will about this, he told me that “Mallet is an useful tool when you want to do topic modeling on a large corpus of data – say you have 1000’s of documents and you want to find the recurring topics in it, Mallett is you best option” and then he goes on to say “But the documentation of Mallet is too technical and slightly complex to understand for a non-technical person” After hearing this, my goal was to create a tool guide in layman terms but before that I had to understand the basics.

What is Topic Modeling? Who is it useful for?

Topic Modeling is a way to analyze large volumes of unlabeled text to discover “topics”. A “topic” consists of a cluster of words that frequently occur together. Topic Modeling is predominantly used by scholars and researchers. For example when there is a large data set and you do not have the time or resources to manually read through them, using topic modeling will help you discover the recurring topics and you can explore these topics rather than reading throughout the entire document.

After reading this, I was thinking to myself “How does Mallet do this?” So I started reading through the Mallet tutorials and there were instances when I didn’t understand how things worked; for example when I was looking into Text Representation, I came across the Hidden Markov models. I was updating Liz on what I’d found and she went “Wait. What is this model? How does it work?” I was at a loss because I couldn’t quite explain this in simple terms. So I looked it up and I found that it is a simple statistical model that has some hidden states and  makes predictions based only on the current instance and not the past instances.  The official Mallet documentation assumes that everyone is aware of these mathematical models.

Working with Mallet

Now that I have hang of how Mallet functions theoretically, I wanted to experiment with the tool and gauge my level of understanding. Since Mallet is command-line only program, it requires some level of familiarity with Unix shell or Windows command prompt. Here’s a nice tutorial that I came across which is specifically designed for those with no previous experience using the command line. This tutorial will give you a step by step rundown of how to run your first topic Modeling routine.

Basic topic modeling using Mallet
Basic topic modeling using Mallet

While following these steps I hit a few snags which required me to look up a bunch of documents to figure to a how to proceed. I found this process to be very time consuming and strenuous and this got me thinking, “How difficult would this be for researchers who don’t have an in depth knowledge of Mallet and never worked with command line before?” I threw this question out and Liz was like “Is there a tool with some kind of an interface that might be helpful for first time Mallet users?” I spent the next 15 minutes looking for this “tool” and I came across a graphical user interface tool for topic Modeling

 

Topic Modeling GUI tool
Topic Modeling GUI tool

Topic Modeling using an interface

This is a relatively simple tool for topic modelling. After downloading the tool, you can specify a document or directory of documents on which you want to do topic Modeling and the place where you want the results to be stored. The tool offers default values, but you can also choose the number of topics you want to display, the number of iterations (say it is set to 5, then it will run through 5 loops before narrowing down on the topic) and topic proportion threshold (say the threshold is set to 0.05, while choosing topics the tool will pick only those which occur more than 0.05 times in the documents). Now all you have to do is hit the “Learn Topics” button and the tool will automatically start training the documents and provide you with a HTML/ CSV output file with a list of topics. By varying the number of topics and topic proportion threshold, the resulting topics are refined. I found this method to be easier than working with Mallet because there is no command line programming involved and it does not require you to remember any specific instructions.

List of topics
List of topics generated using the tool

Ultimately if you are a dealing with topic modeling for the first time, I would recommend that you work with the Topic Modeling Tool rather than Mallet; at least until you are familiarized with how topic modeling works.