Digital Scholarship Commons

Software Training Series: Lynda.com Working Group

We covered Microsoft Excel during our firsy Lynda.com Working Group Series


 Share Share

Related Story:

Software Training with Lynda.com

 Join the discussion

5:00pm usually signifies the end of the workday, but for dozens of Emory faculty, staff, and students, the early evening has become an opportunity to squeeze in some much-needed supplemental software training.

During the month of October, 40 people from across the university gathered in DiSC weekly for the first ever Lynda.com working group.

Lynda.com is a video learning site stocked with tutorials on hundreds of websites, software applications, and computing functions. Emory has maintained a paid subscription to Lynda.com for some time, but only a handful of people have become regular users.

For the inaugural goup, we chose to focus on Microsoft Excel, an application useful in areas ranging from humanities and social science research to human resources and administration. Though dozens of hours worth of Excel training are available on Lynda.com, the sessions focused on broadly applicable functions like sorting, filtering, pivot tables, and database management.

"I've known that Lynda is availabe since I started working at Emory," notes one attendee, "but I never got around to using it until I signed up for the scheduled sessions. It's easier to stick to a training schedule this way."

Emory is currently transisioning to a workstation-based subscription to Lynda.com. Beginning later this month, anyone with an Emory ID will be able to access the site through designated workstations in the Woodruff Library (in ECIT and the Research Commons) and at the Health Sciences Library.

We're currently accepting suggestions for our next training series. If you'd like focused training on a particular piece of software, let us know.

Authored By: 

Sarita Alami is a Graduate Fellow at DiSC.

Lynda.com is a video learning service that provides tutorials for a wide range of software.

Supercharge Your Zotero Library Using Paper Machines: Part II


Visualization of the first ten topics extracted from a Zotero collection.


 Share Share

Related Story:

Supercharge your Zotero Library Using Paper Machines: Part I

Related Links: 

Paper Machines

Zotero

 Join the discussion

In my last post I discussed how Paper Machines, the text analysis add-on for Zotero, can help you visualize your research. Some of Paper Machines' features are pretty self-explanatory, but others are less intuitive. Here I've tried to expand on some of the potentially complicated aspects of Paper Machines to supplement the documentation available on the developer's site.

Getting Started
Paper Machines is available for Zotero Standalone and Mozilla Firefox. To install the Paper Machines add-on in Firefox, download the  XPI file, then load it by navigating to tools → add-ons → install add-ons → get add on from file. In Zotero Standalone,  navigate to tools → add-ons → gear icon → install add-on from file.

Once you've installed the add-on, you can adjust various default settings.

  

You can analyze the contents of your Zotero library by right clicking on any collection and selecting “Extract Text for Paper Machines.” Once the text is extracted, you have the option of running various processes and viewing the corresponding visualizations.

Word Cloud
Paper Machines’ default word cloud is automatically displayed at the lower left corner of the Zotero pane. You can also compare sets of text using multiple word clouds, which can be divided either chronologically or by subcollection. This option requires that you select among multiple filter methods:

  • None produces a simple word cloud based on raw frequency.
  • Tf*idf eliminates words that are deemed unimportant to the corpus.
  • Dunning’s log-likelihood measures the probability of a word occurring in one corpus of text versus another.
  • Mann-Whitney U assesses how consistently a given term appears in one corpus versus another. Here’s a good post about the differences between Dunning’s log-likelihood and Mann-Whitney U.

Topic Modeling
Using the MALLET toolkit, Paper Machines can determine what topics (derived from groups of words that appear together) arise most frequently in your text. Topics can be charted over time (in days), within specific subcollections, or by mutual information. You can also adjust the topic modeling settings, including:

  • Tf*idf (See above.)
  • Porter stemming modifies words by removing their suffixes."Worked” and “working,” for example, would both be counted under the word “work.”
  • JSTOR for Data Research uses data from JSTOR to supplement the data in your Zotero library. You must have a JSTOR account to use this function.
  • Number of Iterations  (under "Advanced Options"): Paper Machines defaults to 1000; the larger the number of iterations, the longer the sampling will take; smaller numbers will produce lower-quality models.


Right click on a Zotero collection for the basic Paper Machines menu.


There are a number of other adjustable fields under "Advanced Options," but the default settings should work well for almost everyone. If you're interested in delving into the mechanics of topic modeling, I'd suggest starting with this post from The Programming Historian 2, as well as"A Whirlwind Tour of Automated Language Processing for the Humanities and Social Sciences," a book chapter by Douglas Oard.

Certain Paper Machines functions—for example, Periodical PDF Import and Classifier—are still in the experimental phase, so I'll explore them after they've been updated further. Be sure to select "automatically update" under the add-on preferences so you can benefit from the added functionality that's being continually added to Paper Machines.

Authored By: 

Sarita Alami is a Graduate Fellow at DiSC.

 How to use Paper Machines, the add-on that incorporates a range of text visualizaiton tools into your Zotero library.

DiSCussion with Amy Earhart: Digital Canon(s) and Lost Texts


 Share Share

Related Links: 

DiSC Events Calendar

 Join the discussion

For the first Digital Scholarship Commons (DiSC) talk of the year, Amy Earhart, Asst. Professor of English at Texas A&M, will speak on "Recovering the Recovered Text: Digital Canon(s) and Lost Texts." The talk will take place at 4pm on 30 October in the Research Commons of the Woodruff Library.

In short, Earhart presents the difficulties of recovering the many digital projects from the 1990s that featured the work of people outside the canon since many of these projects are no longer working. See an abstract for the talk below.

In addition to her lecture, Earhart will also lead a lunchtime discussion / Q&A with graduate students about preparing for careers in the digital humanities. That discussion will take place at noon on 30 October in the Research Commons, and lunch will be provided. Please RSVP using this form if you plan to attend so we can plan for the right amount of food.

Abstract

This talk examines the state of the current digital humanities canon, provides a historical overview of the decline of early digitally recovered texts designed to expand the literary canon, and offers suggestions for how the field might expand the digital canon. The early wave of small recovery projects has slowed and, even more troubling, the extant projects have begun to disappear. We should find it troubling that the digital canon is losing the very texts that mirrored the revised literary canon of the 1980s. If we lose a large volume of these texts, and traditional texts such as Whitman, Rossetti, and Shakespeare are the highlighted digital literary texts, we will be returning to a new critical canon that is incompatible with current understandings of literature.

We need to examine the canon that we, as digital humanists, are constructing, a canon that skews toward traditional texts and excludes crucial work by women, people of color, and the GLBTQ community. We need to reinvigorate the spirit of previous scholars who believed that textual recovery was crucial to their work, who saw the digital as a way to enact changes in the canon. Preservation of existing digital recovery projects needs to begin immediately.

Authored By: 

Brian Croxall, Digital Humanities Strategist and Lecturer of English
brian.croxall@emory.edu

Research Commons, Robert W. Woodruff Library
540 Asbury Circle, Atlanta, GA 30322
map | directions | hours

"Recovering the Recovered Text: Digital Canon(s) and Lost Texts." The talk will take place at 4pm on 30 October in the Research Commons of the Woodruff Library.

Supercharge Your Zotero Library Using Paper Machines: Part I


Topic Modeling output for a Zotero collection using Paper Machines

 Share Share

Paper Machines, the add-on that integrates a range of text analysis tools into Zotero, has generated quite a buzz in the short period of time since its release. For those of us that store notes, citation information, PDFs, and article links in huge Zotero libraries, Paper Machines has the potential to be a game-changer in terms of how we visualize our research.

Because Paper Machines is so new, it's being updated with added functionality every few days. I'll provide step-by step documentation for how to use specific components of Paper Machines in Part II of this post. For now, I'll discuss whether or not Paper Machines might be a good fit for your research, the tools that it offers, and how it might help your work.

Paper Machines provides a broad range of text analysis tools, but it's not meant for everyone's research. You'll probably benefit most from Paper Machines if you:

  1.  Already use Zotero to manage your sources. Paper Machines draws on a number of open source tools available elsewhere on the web. If you want to visualize your data but aren't already comfortable using Zotero, you might want to look elsewhere.
  2. Have a relatively large or robust Zotero library. At the time of this posting, Paper Machines incorporates the full text of Web snapshots and OCR'd PDF files into its text analysis, as well as the title, place, date, and subcollection of a source. The option to include notes, tags, and links to live websites will be available shortly.
  3.  Are collaborating on a Zotero library with a group. Paper Machines is very good at helping you figure out the contents of a collection. If you're working on a collection with multiple group members, it's a quick way to visualize what kinds of material your collaborators are adding.

What kinds of analysis tools does Paper Machines employ?

  • A word cloud with the option to filter out commonly used words.
  • Phase nets, which allow you to visualize relationships between common words in your text (for example, x and y; x is y)
  • A Geoparser, which uses location information to produce beautiful visualizations of the places mentioned in your texts.
  • DBpedia Annotation, which produces a visualization of what people, places, and things are mentioned in your texts.
  • MALLET-based topic modeling, which generates visualizations based on commonly occurring topics in your texts. The author offers some additional information about information about Paper Machines' use of topic modeling here.

What can Paper Machines help you do?

  • Assess the contents of a collection. Looking through the Paper Machines results is a helpful way to get to know the contents of a group library or to get reacquainted with a collection that you haven't used for a while.
  • Identify gaps in your material. Reviewing the MALLET output for a specific collection in my Zotero library (canonical works in US history) I noticed a surge in books about women's labor history (which MALLET identified using the terms women, labor, work, and activism) during the 1980s. I also noticed a lack of items in my library about these topics since 2000.
  • Compare collections. Analyzing two collections with Paper Machines makes similarities and differences evident. Using topic modeling, for example, I could see what subjects came up most frequently in the two collections and if they coincided. The word cloud function is the easiest way to identify concurrent terms and subjects at a glance.
  • Find patterns in your collections. Using the "Phrase Net" function, I conducted an "x is y" analysis on one of my collections. I was surprised to see that "democracy is necessary" and "Cold War is necessary" were recurring phrases a number of sources.

The Geoparser links texts in a Zotero collection to the places that they mention.

Additional examples of Paper Machines visualizations are available on the developer's site. The add-on is available for Firefox and Zotero Standalone, and visualizations  can also be saved as html files. While the occasional error or puzzling result is inevitable early on, the creator of Paper Machines is constantly tweaking the interface in response to feedback from users.

Authored By: 

Sarita Alami is a Graduate Fellow at DiSC.

Paper Machines is an add-on that incorporates a broad range of text visualizaiton tools into your Zotero library.

Topic Modeling with MALLET

 

 

 Share Share

 

Beck Center Grad Fellow Sara Palmer tells us how MALLET works and what you can (and cannot) do with Topic Modeling

Related Story:

DiSCussion with Lauren Klein: Archival Silence

 Join the discussion

This past summer I have enjoyed working with a tool called MALLET (MAchine Learning for LanguagE Toolkit) that can generate a series of potential topics in a corpus of texts.  Created by Andrew McCallum at University of Massachussetts, it is an open source toolkit and that is fairly easy to download and install.  While the setup instructions on the website are minimal, Shawn Graham’s has written a helpful overview of how to get started.  

Topic modeling is based on the idea that within a set of related texts, certain words will occur near each other with statistically significant frequency. MALLET works on documents like its acronymic namesake by shattering texts into an array of words.  It then applies a statistical method called Latent Dirichlet Allocation to put related words into clusters. 

These related groups of words, or ‘word bags’ as they are often called, can be interpreted as making up a topic.  For example, the group of words ‘needle,’ ‘stitching,’ and ‘fabric’ could reasonably be labeled as the topic ‘sewing.’  Of course, scholars familiar with a corpus can readily name a list of topics but what if you’re working with a large set of texts and don’t have time to read all of them carefully?  MALLET allows you to quickly obtain a general idea of what topics may be present.  

Perhaps the most prominent example of topic modeling with MALLET is Cameron Blevins’s work on the diary of Martha Ballard, an American midwife who wrote daily entries for 27 years.  Blevins’s topic model identified key themes in the diary such as ‘death,’ ‘shopping’ and ‘gardening’ and graphed their prominence over the course of Ballard’s life.  These charts often corresponded with known biographical details.  For example, the topic ‘emotion’ peaked between 1803 and 1804 when her husband was imprisoned for debt and her son was indicted for fraud.   

The field of DH itself has also been put under the MALLET “macroscope.”  Elijah Meeks’s work visualizes the network of associated ideas in the self-definitions of digital humanists while Matt Jockers analyzes the work performed in the field by culling themes from the blog posts of 117 digital humanists on a single day (March 18, 2010).  The Maryland Institute of Technology in the Humanities (MITH) has an excellent overview of how topic modeling has been used in the humanities.  Other prominent topic modeling projects featured on the MITH blog include Travis Brown’s work on Austen and Byron and Jeff Drouin’s work on Proust

Interest in topic modeling has grown at DiSC since it hosted a talk by Robert Nelson from the University of Richmond this past January.  In his research on the American Civil War, Nelson ran MALLET on the issues of Richmond Daily Dispatch newspaper from 1860-1865.  He was able to graph and contextualize topics such as fugitive slave ads and military recruitment, the results of which are illustrated beautifully on the project’s website.  For each topic Nelson has a separate page displaying the graphs of its prevalence over time, the most prominent key words, as well as the articles, ranked by compositional percentage, that most clearly exemplify the topic.  

These two components, keywords and compositional percentage, can be respectively derived from the MALLET output files ‘topic-keys’ and ‘doc-topics.’  At the Beck Center we wanted to see how this output might be used for understanding the journal Southern Changes, published monthly by the Atlanta-based Southern Regional Council from 1979 through 2003.  Comprised of 110 issues containing 978 articles, it is considerably smaller than the data set Nelson worked with but still large enough to make it a good fit for MALLET’s method.  We hoped to identify key terms with which to catalogue the collection and also to generate topical groups of articles that could be featured on the website.  The results were encouraging as I found the themes suggested by the topic keyword lists to roughly correspond with the subjects provided by Allen Tullos, editor of Southern Changes from 1982 through 2003.  Tullos writes:

 “The articles in Southern Changes range across many subjects: racial justice and the freedom struggle, voting rights, educational opportunity, economic democracy, social equality and inclusion, women's rights, environmental justice, critical regional studies, regional-global issues, and popular culture.”

After discovering a ten-topic run of MALLET to be a bit broad, I opted for 20 topics.  In THIS TABLE I have labeled the topic keyword lists with my overall impressions in black.  The generally correspondent Tullos subjects—labeled in green—are included for most of the MALLET topics.  This is, of course, a very rough sketch but it does appear to suggest correlation between MALLET data and expert knowledge.

As a model built on sampling and probability, MALLET naturally does not generate the same exact output each time it is run.  The topics and keywords do vary but are generally similar from one run to the next.  Below is a table comparing the keywords from the ‘Voting Rights’ topic in the previous example to those produced in five additional runs.  The original run was based on 10,000 iterations but for the sake of time, I used 1,000 iterations on the additional trials.  VIEW THOSE RESULTS HERE

While the topic-keys outline the word composition of the topics, the doc-topics indicates the topic composition of each document.  It simply lists, in descending order, the ID number of each topic and the percentage of the document that it makes up.  CLICK HERE for a table of the documents in which ‘Voting Rights’ comprises over 40% of the composition.

From this one can get a clearer picture of what kind of content is picked up by MALLET for the ‘Voting Rights’ topic and see that the topic is most prevalent in the journal around the 1980 and 2000 election cycles. 

In sum, MALLET did what we wanted it to do: we obtained key terms and a general overview of topics present.  My own attempts to graph the topics over time did not illustrate trends with nearly as much clarity as I would have hoped.  However, my objective is not to prove that MALLET is some kind of magic bullet, ripping through a corpus to the heart of its intelligibility.  Rather, I found that the output, when interpreted cautiously, can offer an excellent starting point for more detailed content analysis.  

Authored By: 

Sara Palmer

Topic modeling is based on the idea that within a set of related texts, certain words will occur near each other with statistically significant frequency. MALLET works on documents like its acronymic namesake by shattering texts into an array of words.  It then applies a statistical method called Latent Dirichlet Allocationto put related words into clusters. 

Mapping with OpenHeatMap and Geocommons


Related Story:

Visualizing Words with Voyant 

Related Links: 

Tweeting #OWS

Geocommons

OpenHeatMap

 Join the discussion

Even if data visualization isn't the primary goal of a project, adding an animated or interactive map can be an effective way to enrich a presentation, article, or lecture--and it doesn't have to take up huge swaths of time. As part of the Tweeting Occupy Wall Street project, I tested web-based mapping tools that would allow us to plot some of the 10 million tweets related to Occupy Wall Street.

Dozens of geographic data visualization tools, many of them open-source, are available on the web, but for this particular project I investigated tools that are 1) powerful enough to handle large data sets; 2) relatively easy to learn and share; and 3) free. Here's a rundown of the two tools that I found to be most effective in the Tweeting #OWS project: OpenHeatMap and Geocommons.

OpenHeatMap allows users to create static or animated heat maps (also called intensity or chloropleth maps) based on data uploaded through a Google Doc or Excel spreadsheet. Heatmaps plot values in a range of colors that indicate intensity, similarly to a meteorological radar map. One of the most user-friendly tools that I tested, all OpenHeatMap requires to generate a map is 1) location information, in the form of latitude and longitude or state/country abbreviations; 2) a column of values (used to plot intensity); and 3) if you want an animated heatmap, an optional column marked "time."

Customization options in OpenHeatMap allow creators to control features like color and size of the data and map. After customizing, a user can simply use an autogenerated code to embed the map into an website. Alternatively, Open HeatMap offers an option to host the map on your own site and fully customize it using the Heatmap API. My attempt to do so was unsuccessful, however, and and a number of sources suggest simply using the embed code to store and share a map.

More than a site for creating maps, Geocommons is a robust data analysis, management, and visualization platform. Like its name suggests, Geocommons embraces the open-source model and strongly encourages users to make their maps and data public (20 megabytes of private data storage is also available with a free membership). This means that, along with uploading data, users can access hundreds of data sets including census data, zip code and county maps, and much more.

It's possible to run analyses on data from within Geocommons, but I found it to be much faster to do the process in Excel or Google Docs first and upload the finished dataset that contained the values I wanted to plot. Geocommons makes it easy to aggregate data into non-map geographic visualizations, like this chart I make of the top states with Twitter activity related to Occupy Wall Street.

Geocommons serves up simple embed codes for its maps, as well as a customizable Javascript API geared toward site developers. Like OpenHeatMap, embedding maps is either straightforward and noncustomizable or tricky and flexible. I'd suggest using the standard embed code unless you have some experience with Javascript.

Authored By: 

Sarita Alami is a Graduate Fellow at DiSC.

Visualizing Words with Voyant


 Share Share

Authored By: 

Moya Bailey

When the graduate students arrived at DiSC for the fall semester, we were tasked with creating a visualization of the Emory Library Occupy Wall Street archive. We brainstormed with Jay Varner, our resident solutions analyst, about what might be the best way to highlight what could be done with such a massive amount of data. We decided that using the subset of geolocated tweets would provide an opportunity for some unique visualizations that would entice others to  learn more and want to use the archive.

Syndicate content

Site design by: Sharpdot