In my last post I discussed how Paper Machines, the text analysis add-on for Zotero, can help you visualize your research. Some of Paper Machines' features are pretty self-explanatory, but others are less intuitive. Here I've tried to expand on some of the potentially complicated aspects of Paper Machines to supplement the documentation available on the developer's site.
Paper Machines is available for Zotero Standalone and Mozilla Firefox. To install the Paper Machines add-on in Firefox, download the XPI file, then load it by navigating to tools → add-ons → install add-ons → get add on from file. In Zotero Standalone, navigate to tools → add-ons → gear icon → install add-on from file.
Once you've installed the add-on, you can adjust various default settings.
You can analyze the contents of your Zotero library by right clicking on any collection and selecting “Extract Text for Paper Machines.” Once the text is extracted, you have the option of running various processes and viewing the corresponding visualizations.
Paper Machines’ default word cloud is automatically displayed at the lower left corner of the Zotero pane. You can also compare sets of text using multiple word clouds, which can be divided either chronologically or by subcollection. This option requires that you select among multiple filter methods:
- None produces a simple word cloud based on raw frequency.
- Tf*idf eliminates words that are deemed unimportant to the corpus.
- Dunning’s log-likelihood measures the probability of a word occurring in one corpus of text versus another.
- Mann-Whitney U assesses how consistently a given term appears in one corpus versus another. Here’s a good post about the differences between Dunning’s log-likelihood and Mann-Whitney U.
Using the MALLET toolkit, Paper Machines can determine what topics (derived from groups of words that appear together) arise most frequently in your text. Topics can be charted over time (in days), within specific subcollections, or by mutual information. You can also adjust the topic modeling settings, including:
- Tf*idf (See above.)
- Porter stemming modifies words by removing their suffixes."Worked” and “working,” for example, would both be counted under the word “work.”
- JSTOR for Data Research uses data from JSTOR to supplement the data in your Zotero library. You must have a JSTOR account to use this function.
- Number of Iterations (under "Advanced Options"): Paper Machines defaults to 1000; the larger the number of iterations, the longer the sampling will take; smaller numbers will produce lower-quality models.
There are a number of other adjustable fields under "Advanced Options," but the default settings should work well for almost everyone. If you're interested in delving into the mechanics of topic modeling, I'd suggest starting with this post from The Programming Historian 2, as well as"A Whirlwind Tour of Automated Language Processing for the Humanities and Social Sciences," a book chapter by Douglas Oard.
Certain Paper Machines functions—for example, Periodical PDF Import and Classifier—are still in the experimental phase, so I'll explore them after they've been updated further. Be sure to select "automatically update" under the add-on preferences so you can benefit from the added functionality that's being continually added to Paper Machines.
How to use Paper Machines, the add-on that incorporates a range of text visualizaiton tools into your Zotero library.
In the Blog
- The Extraordinary World of MARBL: Tusks and Teeth
- The Extraordinary World of MARBL: Civil War Cannonballs
- The Extraordinary World of MARBL: Three Dimensional Poetry
- MARBL Launches Artists' Books Showcase
- The Extraordinary World of MARBL: The McCord Latin Prize Medal
- Announcing the Emory Center for Digital Scholarship
- Summer Reading EBooks and AudioBooks
- The Extraordinary World of MARBL: Charles H. Herty Turpentine Cup
- Postcolonial Studies @ Emory
- A Beautifully Illustrated Book in the Seydel Collection