The Center for Data Innovation spoke with Patrick Juola, a professor of computer science at Duquesne University in Pittsburgh and an authority on computational stylometry, the study of linguistic style through text analysis. Juola is also the CEO of text analysis company Juola & Associates. He discussed what stylistic factors can reveal a person’s identity, as well as some emerging applications for stylometry in computer security.
This interview has been lightly edited.
Travis Korte: Can you introduce Juola & Associates, what you do, and what markets you serve?
Patrick Juola: Juola & Associates is a startup specializing in the application of text analysis to various industries including computer security and litigation support. One of our major abilities is the analysis of language to determine qualities of the author, such as the identity of the author—or barring that, the gender, education level, age, and even in some cases psychological profiles.
So, for example, if you’re a lawyer and you’re dealing with a case of a disputed email, we can look at the text of the email and say, “This email was probably not written by the defendant, and in fact, was probably written by a native speaker of Hungarian.”
TK: You’ve done a lot of work using text analysis to identify authors of particular pieces of writing. Can you discuss some of your past projects, and, if you can, current ones?
PJ: Our most high profile case so far is probably last year’s unmasking of J.K. Rowling as the true author of Robert Galbraith’s detective novel The Cuckoo’s Calling. More recently, we did an analysis for Forbes magazine about Bitcoin and the authorship of “Satoshi Nakamoto’s” basic design documents. We’ve also been involved in a number of other projects, such as an asylum case involving the author of a number of anonymous newspaper columns critical of a foreign government and a disputed will case that was more or less straight out of Agatha Christie—did she or didn’t she write the will just before the accident?
Our major current project is part of DARPA’s Active Authentication program, applying this technology for computer security. The basic idea is that rather than authenticating you when you log on, the computer will continually look at the behavior of the person at the keyboard right now, so if you get up and get a cup of coffee, no one else can sit down and use your credentials. If the person sending email from your account isn’t you, that’s a problem—and a problem we can help the Department of Defense solve.
TK: Your program, the Java Graphical Authorship Attribution Program, can help non-expert users conduct stylometric analysis. What sorts of things does it look for in comparing writing styles? What traits can give a writer’s identity away?
PJ: It looks for a lot of things, partly because it’s a research tool to help us figure out what kind of traits are the best indicators. Basically, almost anything can give you away: word choice, preferred syntax, even punctuation and spacing. But the feature that we use most often is simple character clusters. These would be, for example, all groups of four consecutive characters. This deceptively simple feature set nevertheless captures an extraordinary depth and complexity of information about the writer.
TK: In response to the successes of computational stylometry, some researchers have worked on ways to fool systems such as yours. How do those generally work? When can writing style be suppressed, and when can it not be?
PJ: Stylometric deception is somewhat dependent on what you’re trying to deceive. If the analyst is looking at punctuation, you can try punctuating differently than you normally do. One of the advantages of character clusters that they have so much information it’s hard to hide everything.
Basically, you would be doing the equivalent of wiping your fingerprints off a doorknob, and, yes, you can succeed in wiping that doorknob. What our research has found, however, is that this kind of deception is itself detectable—we can tell that the metaphorical doorknob has been wiped, even if we can’t tell who wiped it. So we’re still getting some useful information.
But you’re right, this is a very important research area, and I expect the answers to this question to change, possibly in the next few months as a new paper comes out.
TK: What are some future applications you foresee for stylometry that have not yet been fully explored?
PJ: I think there’s a lot of application to behavioral biometrics. I don’t like passwords; they’re too easy to steal and too easy to forget, which is basically the worst of both worlds. It’s hard to forget your own language and almost as hard to steal it. I think the DARPA project will open the door to a lot of new applications, and I look forward to working on them.