The Center for Data Innovation spoke with David Mimno, assistant professor at Cornell University. Mimno discussed the field of digital humanities, and the impact machine learning has had on the field.
Joshua New: What is digital humanities? How does it differ from, say, traditional humanities or library sciences?
David Mimno: I use the term “digital humanities” because it’s the most recognizable way to indicate something about my interests, but it’s not one I like. My personal goal has been to make digital methods invisible, as they are in other fields. There’s no “digital astronomy,” for example. There’s just certain processes for which it makes sense to use computers, like mapping the sky with the Sloan Digital Sky Survey, or looking for tiny fluctuations in light that indicate the existence of exoplanets.
In the same way, if you’re interested in subtle changes in culture and history over hundreds of years from thousands of sources, you’re going to want to use computers. And scholars have been doing computational things for centuries. One of my favorite examples is the concordance. These are big books that show you the context of every occurrence of every word in, say, the Bible or Homer. It’s kind of like the 19th Century equivalent of search engine snippets.
Constructing a concordance used to take years, but now I can assign it as a one-day in-class programming project for undergraduates. And because of that we can also ask even more complicated questions, like are there words that appear in similar contexts, and how do those similarities change between collections and over time?
New: You used to work for the Perseus Project at Tufts University, which aims to make history more accessible. Beyond digitizing historical texts and making them easily accessible online, what does this entail?
Mimno: Perseus is a fantastic resource, and I’m incredibly grateful to its editor, Greg Crane, for giving me my start in this field. What makes Perseus such a vital tool for anyone interested in the ancient Mediterranean world is that it offers multiple levels of support. The first level is simply access. By digitizing pretty much all the Latin and Greek that still exists from the ancient world and putting it online, it enables people to connect with scholarship anywhere and throughout their lives—the majority of hits were not from .edu domains. The second level is that we integrated a lot of linguistic and cultural resources to help people understand texts from cultures they don’t know in languages they don’t speak well. It’s not anything you couldn’t get from grammar book and a dictionary and an atlas, but having everything integrated in one clickable link makes a huge difference. Being part of something so influential was wonderful. The reason I left is that I saw a need for a third level—using new language and machine learning technologies to provide new ways of exploring collections.
New: You are the chief maintainer of the MAchine Learning for LanguagE machine learning Toolkit (Mallet). What does Mallet do that other toolkits do not?
Mimno: At this point I’m not sure if there’s anything that Mallet does that other packages don’t, but I suspect there’s a few reasons for its longevity. First, what it does it does well. I and my colleagues, especially Hanna Wallach, put a lot of work into making things fast and useful. Second, it is mostly self-contained. A lot of researchers put out code that assumes your input data is in some specific numerical format, and don’t give users a lot of help in converting what they have into that format. Mallet still takes some input manipulation, but it’s less than most.
Running an open-source project is a lot of work, and I’m not able to put in nearly as much time as I’d like. But it’s still a really useful thing for me. Having a software package is a great way to find out what people are really doing, and whether you are actually helping anyone! A good way to decide what to work on is to notice when three or four people working in different areas ask you the same question, and you don’t have a good answer. I don’t know how to get that kind of feedback without putting something out there.
New: You published a paper detailing a method for evaluating user-provided responses on online communities that can serve as an alternative to systems where useful or helpful responses get overshadowed by popular or polarizing responses. Could you explain this approach?
Mimno: My student Moontae Lee, now a professor at the University of Illinois at Chicago, has a background in psychology, and he wanted to see if he could use social media data to test some theories. One is that when people can see whether an opinion is unpopular, they’re more likely to keep it to themselves. This could have a big impact in online forums: if one post gets voted up, it can quickly appear to become a “runaway favorite” even if it’s not that much better than another. One mathematical tool for self-reinforcing dynamics is called the “Chinese restaurant process.” The name comes from a thought experiment where people entering a cafeteria tend to want to sit at tables that already have a lot of people sitting at them. We extended this metaphor to up-or-down voting on question answers in Stack Exchange forums, including the popular Stack Overflow site. We found that we could describe the culture of different forums on two axes: “trendiness,” which is high if users tend to agree on a single answer, and “conformity,” which is high if almost all votes are positive. The trendiest forums are also pretty positive, and include management and web development. The less trendy forums, where there’s more diversity of popular opinions, have a bigger spread: more positive forums are often about theoretical subjects or hobbies, such as math education or chess, while the more negative are about religion, language, and meta-discussion about rules.
New: How significantly has machine learning changed the field of digital humanities, and how do you see it shaping the field as AI gets more advanced?
Mimno: I wouldn’t say that machine learning has changed humanities that much. The incentive in machine-learning research is always for more novelty and more complexity, but when your goal is to build evidence that will enable you to make convincing arguments, the incentive is exactly the opposite: simpler, more reliable, more easily explained methods are better. I think this is true of a lot of fields. The big difference is whether the goal is some operational improvement or knowledge. If you just want to increase click-through rate or cancer screening accuracy, then you’re not going to be that worried about whether you can follow the logic of the system. Performance is first, interpretability is great but negotiable. But in a lot of fields, especially history and literature, the interpretation is the primary objective, and “performance,” to the extent that it’s even meaningful, is negotiable. In fact in many cases it’s the “failures” that are most interesting. For example if I create a system that tries to guess whether a novel was written in the 19th or 20th century, and it gets a book wrong, I can then look at the classifier and ask, “what are the characteristics of that book that made it look out of place?” Or, “what does that tell me about the book, and about the two centuries?”
If there’s one area of AI, another term I would gladly never use, that I think is poised to make a big splash in humanities, it’s image analysis. Machine learning is really all about finding good mathematical representations of complex real-world things. A lot of work so far has happened in text, because it’s easy to digitize and work with: counting words is an incredibly good representation, as limited as it is. But a lot of culture is visual, and just in the past year or so we’ve gained the ability to project images into good vector space representations without needing giant GPU clusters or arcane expertise. I think we’ll see some very cool work coming out in the next few years.