The Center for Data Innovation spoke with Harald Gölles, CTO and co-founder of SearchInk, a German startup that produces a machine-learning tool for computerized searches of scanned handwritten texts. Gölles discussed the challenges handwriting poses to computers and the opportunities that machine learning unlocks for researchers.
Nick Wallace: SearchInk makes it possible to scan handwritten documents and search the digitized files. What is the key challenge of digitizing handwritten texts?
Harald Gölles: The problem with handwritten texts is it is more difficult, and sometimes impossible, to read individual characters in isolation—which is how computers typically interpret print. If you look at different writing styles, you will see the same letters written many different ways. Sometimes they’re clearly written, but often they’re just implied. An “S”, for example, may just be a line or a curve. This means some letters may be indistinguishable from a dash or a hyphen, and so can’t be understood in isolation. You cannot decode handwriting character by character in the way you can print. You have to interpret a whole line of writing in order to get the meaning.
This is not the same as an electronic pen on a computer. There, the problem is already solved, because the machine can see the motion of the writing. It can see the text being created—it has a lot more information to work with. But with ink writing, the machine only has the finished product, which makes it harder to interpret.
Wallace: How do you teach the software to recognize all the different varieties of handwriting out there?
Gölles: We use a technique called transfer learning. The neural network breaks words down in order to classify their components and then recognize them again in other words. Think of all the ways of writing the letter “S”: there may be a lot, but they do follow common patterns and styles. If you feed enough handwritten words into the machine, the neural network can quickly learn all of the common ways that letter can appear in different words, even if it isn’t recognizable on its own. Just as a human isn’t confused by one strangely-written “S” in the middle of a word, the machine can identify the letter based on the context in which it appears.
But the most exciting thing is the neural network can learn things about handwriting that are not obvious to a human reader. For example, we found the software can identify male or female names just by the way they’re written with 80 percent accuracy. This isn’t just down to the names themselves. It seems there is something in the handwriting that reveals what the writer knows about the subject, such as whether a person is male or female. We don’t even know what it is, but the machine can spot it.
Wallace: How does SearchInk’s ability to interpret handwriting compare to that of a human? For example, can it help us finally figure out what doctors’ scribbles are all about?
Gölles: The machine is better than humans at reconstructing texts that are just badly written. The software engine reduces everything that’s written to its base variations, whereas a human looks for meaning straight away—this gives the machine something of an advantage when it comes to “illegible” texts. So with a doctor’s note, it may well be that the text is incomprehensible to humans. But that doctor probably writes many, many prescriptions and sick notes, so the software can reduce it to its basic elements and compare it with all the other information in the neural network, giving the machine a better chance of understanding than a human.
But the software can only do this better than humans as long as the text adheres to some basic conventions of writing that the machine has seen before. For example, if we found an old book not written according to modern “rules”—perhaps in an earlier version of the language, or when writing standards were very different—the machine might need more information before it could interpret it, whereas a human may still be able to figure it out.
Wallace: SearchInk is owned by Qidenus, another company you co-founded, which applies digital technologies to books and archives, and also produces robotic book scanners. What do you think the impact could be of the automated ability to digitize and search any kind of text?
Gölles: The robots can flick through the pages of a whole book, without damaging it, and scan the entire thing. This makes the task of digitizing information much easier. When you add the ability to make it searchable, the most important implications are for research. Being able to search the content of books, instead of just their metadata, doesn’t just make it easier to find the specific book you’re looking for. It also allows you to use data-mining to ask broader questions. For example, if you’re researching a historical event, you can mine the archives records for very detailed effects of things that happened, and quantify that information. Also, at the moment you have to have access to the archives to get at those records. By digitizing them, and making them searchable, you can open them up to everyone—like open data—and then anyone can do their own research.
Wallace: Where do you want to see the technology go next? What is the next frontier?
Gölles: Semantics are the next frontier. Machines still don’t understand the meanings of words in the way humans do. Once you come to the point of being able to detect meanings, then you really have an extremely powerful tool on your hands. It could be applied to more or less anything, not just text, to search and mine for extremely complex information. For example, you could use it to search pictures: that means not just having software that can tell you what’s in the picture, but what the picture actually shows. For example, you already have software that can spot a dog in a picture; but a machine with a grasp of meaning could tell you that it’s not just a dog, it’s a dog playing catch.