Beck Center Grad Fellow Sara Palmer tells us how MALLET works and what you can (and cannot) do with Topic Modeling
This past summer I have enjoyed working with a tool called MALLET (MAchine Learning for LanguagE Toolkit) that can generate a series of potential topics in a corpus of texts. Created by Andrew McCallum at University of Massachussetts, it is an open source toolkit and that is fairly easy to download and install. While the setup instructions on the website are minimal, Shawn Graham’s has written a helpful overview of how to get started.
Topic modeling is based on the idea that within a set of related texts, certain words will occur near each other with statistically significant frequency. MALLET works on documents like its acronymic namesake by shattering texts into an array of words. It then applies a statistical method called Latent Dirichlet Allocation to put related words into clusters.
These related groups of words, or ‘word bags’ as they are often called, can be interpreted as making up a topic. For example, the group of words ‘needle,’ ‘stitching,’ and ‘fabric’ could reasonably be labeled as the topic ‘sewing.’ Of course, scholars familiar with a corpus can readily name a list of topics but what if you’re working with a large set of texts and don’t have time to read all of them carefully? MALLET allows you to quickly obtain a general idea of what topics may be present.
Perhaps the most prominent example of topic modeling with MALLET is Cameron Blevins’s work on the diary of Martha Ballard, an American midwife who wrote daily entries for 27 years. Blevins’s topic model identified key themes in the diary such as ‘death,’ ‘shopping’ and ‘gardening’ and graphed their prominence over the course of Ballard’s life. These charts often corresponded with known biographical details. For example, the topic ‘emotion’ peaked between 1803 and 1804 when her husband was imprisoned for debt and her son was indicted for fraud.
The field of DH itself has also been put under the MALLET “macroscope.” Elijah Meeks’s work visualizes the network of associated ideas in the self-definitions of digital humanists while Matt Jockers analyzes the work performed in the field by culling themes from the blog posts of 117 digital humanists on a single day (March 18, 2010). The Maryland Institute of Technology in the Humanities (MITH) has an excellent overview of how topic modeling has been used in the humanities. Other prominent topic modeling projects featured on the MITH blog include Travis Brown’s work on Austen and Byron and Jeff Drouin’s work on Proust.
Interest in topic modeling has grown at DiSC since it hosted a talk by Robert Nelson from the University of Richmond this past January. In his research on the American Civil War, Nelson ran MALLET on the issues of Richmond Daily Dispatch newspaper from 1860-1865. He was able to graph and contextualize topics such as fugitive slave ads and military recruitment, the results of which are illustrated beautifully on the project’s website. For each topic Nelson has a separate page displaying the graphs of its prevalence over time, the most prominent key words, as well as the articles, ranked by compositional percentage, that most clearly exemplify the topic.
These two components, keywords and compositional percentage, can be respectively derived from the MALLET output files ‘topic-keys’ and ‘doc-topics.’ At the Beck Center we wanted to see how this output might be used for understanding the journal Southern Changes, published monthly by the Atlanta-based Southern Regional Council from 1979 through 2003. Comprised of 110 issues containing 978 articles, it is considerably smaller than the data set Nelson worked with but still large enough to make it a good fit for MALLET’s method. We hoped to identify key terms with which to catalogue the collection and also to generate topical groups of articles that could be featured on the website. The results were encouraging as I found the themes suggested by the topic keyword lists to roughly correspond with the subjects provided by Allen Tullos, editor of Southern Changes from 1982 through 2003. Tullos writes:
“The articles in Southern Changes range across many subjects: racial justice and the freedom struggle, voting rights, educational opportunity, economic democracy, social equality and inclusion, women's rights, environmental justice, critical regional studies, regional-global issues, and popular culture.”
After discovering a ten-topic run of MALLET to be a bit broad, I opted for 20 topics. In THIS TABLE I have labeled the topic keyword lists with my overall impressions in black. The generally correspondent Tullos subjects—labeled in green—are included for most of the MALLET topics. This is, of course, a very rough sketch but it does appear to suggest correlation between MALLET data and expert knowledge.
As a model built on sampling and probability, MALLET naturally does not generate the same exact output each time it is run. The topics and keywords do vary but are generally similar from one run to the next. Below is a table comparing the keywords from the ‘Voting Rights’ topic in the previous example to those produced in five additional runs. The original run was based on 10,000 iterations but for the sake of time, I used 1,000 iterations on the additional trials. VIEW THOSE RESULTS HERE
While the topic-keys outline the word composition of the topics, the doc-topics indicates the topic composition of each document. It simply lists, in descending order, the ID number of each topic and the percentage of the document that it makes up. CLICK HERE for a table of the documents in which ‘Voting Rights’ comprises over 40% of the composition.
From this one can get a clearer picture of what kind of content is picked up by MALLET for the ‘Voting Rights’ topic and see that the topic is most prevalent in the journal around the 1980 and 2000 election cycles.
In sum, MALLET did what we wanted it to do: we obtained key terms and a general overview of topics present. My own attempts to graph the topics over time did not illustrate trends with nearly as much clarity as I would have hoped. However, my objective is not to prove that MALLET is some kind of magic bullet, ripping through a corpus to the heart of its intelligibility. Rather, I found that the output, when interpreted cautiously, can offer an excellent starting point for more detailed content analysis.
Topic modeling is based on the idea that within a set of related texts, certain words will occur near each other with statistically significant frequency. MALLET works on documents like its acronymic namesake by shattering texts into an array of words. It then applies a statistical method called Latent Dirichlet Allocationto put related words into clusters.
In the Blog
- Summer Reading EBooks and AudioBooks
- The Extraordinary World of MARBL: Charles H. Herty Turpentine Cup
- Postcolonial Studies @ Emory
- A Beautifully Illustrated Book in the Seydel Collection
- The Extraordinary World of MARBL: Medical Formulas from the Reed Family
- New tech e-books:Safari Books Online
- The Extraordinary World of MARBL: Resurrection City Street Signs
- The Extraordinary World of MARBL: Ralph McGill's Paper Bag Letter
- Sisyphus: Patron Saint of the Stacks
- Cake Sprinkles, Cigarettes, Pasta, and Rusty Razor Blades: Preservation Challenges in MARBL