The Center for Data Innovation spoke with Bryan Catanzaro, senior researcher at Baidu’s Silicon Valley Artificial Intelligence Lab. Catanzaro discussed how the concept of “data parallelism” enables powerful deep learning programs and how artificial intelligence will transform how humans and computers interact.
This interview has been edited.
Joshua New: While you were an intern at Nvidia in 2008, you developed Copperhead, an embedded language in Python—essentially an extension of an existing programming language—to support data parallel computation. What is data parallelism and why was there a need for Copperhead to support it?
Bryan Catanzaro: Data parallelism is somewhat of an overloaded term. It means different things in different contexts. In the context of programming models, data parallelism refers to parallelism that naturally arises from bulk operations over collections. There are a few important operations that are useful for many algorithms: element-wise computation, sorting, reductions, parallel-prefix sum, and indirect addressing, for example. Many algorithms that would be difficult to express directly using threads—small pieces of a program that can execute independently—and mutexes—which allow multiple threads to access the same resource, like a data set—are simple in data parallel programming.
This leads to programs that are generally scalable and free of the complicated bugs that plague many parallel programs. By understanding a few data parallel primitives—essentially just basic elements—programmers can solve many important problems while getting good parallel efficiency. Data parallelism is therefore a great target for higher-level parallel programming models that aim to improve programmer productivity.
Data parallel programming is not the best programming model for some applications: if you can’t cast the algorithm in terms of a series of bulk operations on collections, you’re probably better off with some other programming model. However, it is an attractive programming model for many workloads.
We use this kind of data parallelism extensively at Baidu for deep learning. We use frameworks that expose data parallel primitives to programmers, and we also use data parallel ideas to simplify parallel implementation of algorithms with complex dependence patterns. For example, the Connectionist Temporal Classification (CTC) loss function enables end-to-end deep learning for time series data where the alignment between input and output is not known. Our recently open-sourced Warp-CTC library, which provides implementation guidance for the CTC loss function, uses data-parallel primitives internally to make an efficient implementation of the dynamic programming algorithm at the heart of CTC. Our speech recognition models use this because our training data sets do not come with alignment between sound recordings of speech and the characters that form the transcription. CTC uses dynamic programming to avoid the combinatorial process of checking all possible alignments between input and output sequences. Parallelizing this algorithm is difficult, which is why there were no graphics processing unit (GPU) implementations of CTC available before Warp-CTC. Warp-CTC exists because we were able to parallelize it using data parallel primitives, which dramatically simplified the implementation.
I wrote Copperhead because I was interested in higher-level programming models that encourage programmers to successfully parallelize their computation, and I believe data parallelism is one of the most important ways to do this. Awareness of data parallelism is growing in the programming community. For example, there is some great work going on in the C++ standards committee that would include some data parallel algorithms in the C++ standard library. I hope it arrives in the upcoming revision of C++, C++17. Overall, data parallelism is easy to understand, encourages parallel algorithm design, and can be efficiently mapped to modern processors, which is why I chose it as the programming model for Copperhead.
New: Many machine learning applications rely on GPUs, rather than central processing units (CPUs), even though these applications may have nothing to do with graphics. Why is this?
Catanzaro: In processor design, architects must make choices between favoring a latency-oriented design or a throughput-oriented design. Latency-oriented processors are built to solve a single task as quickly as possible. To do this they exploit speculation of many kinds to execute a sequential instruction stream as fast as possible. In contrast, throughput-oriented processors are built to execute a huge number of tasks simultaneously, maximizing the overall throughput of the processor rather than the time it takes to execute any given instruction sequence. The tradeoff between latency-oriented design and throughput-oriented design is quite fundamental, and leads to very different processors: the CPU is latency-oriented, while the GPU is throughput-oriented. This makes sense, given the historical roots of each of these processors: the CPU was built originally to run only one instruction stream, and it wasn’t until the past decade or so that it became common for multi-core CPUs to execute several instruction streams. The GPU, on the other hand, was built to run a program for every pixel of a display, for every frame. This is a natively throughput-oriented task.
Nowadays, GPUs have become programmable, so we can use them for other things besides graphics as long as they are throughput-oriented. Machine learning applications are often throughput-oriented because we have large amounts of data and computation that we can perform in parallel. Even processing one piece of data through a neural network, for example, can create large amounts of parallel work. GPUs are good at this, which is why we use them. For example, the GPUs we use can get about seven times more work done for the same energy consumption. We use GPUs because we need to perform extremely computationally intensive, throughput-oriented tasks as quickly as possible.
New: You’ve talked before about the value of scaling deep learning systems. Why does scale matter? Is it just about performing calculations faster or increasing accuracy, or does scale influence the kinds of problems a machine learning system can solve?
Catanzaro: Scale changes everything in machine learning. Deep learning, for example, has been extensively studied for decades and the core ideas behind deep learning have been very well understood for quite some time. However, they didn’t become practically useful until the last few years. This is because we didn’t have access to large enough training sets or computational resources until recently. Progress for many difficult artificial intelligence (AI) problems has been far more rapid than I would have predicted as a graduate student, and I believe scale is the dominant reason why these methods have achieved such success. In some sense, I believe deep learning has been successful precisely because it is well positioned to take advantage of scale—it maps well onto parallel computers, and because training deep models is so computationally intensive, deep learning has an insatiable appetite for computational throughput. This is a good fit for what modern semiconductor processes can provide: it becomes relatively ever more expensive to move data around a chip, but the energy cost of performing a math operation continues to fall.
Deep learning is therefore a technology at the confluence of several important factors: the growth of large and high-quality data sets thanks to the Internet, the increase in computational throughput afforded by modern semiconductor manufacturing, and the maturation of deep learning algorithms that have been incubated for several decades. Training one of our neural networks for speech recognition takes about 20 ExaFLOPs worth of work, which equates to 20 billion billion math operations. We train each network on 12,000 hours of speech data, which would be one and a half years of audio. And we train these models in about a week’s time. Such experiments were simply not possible until recently, and in order to improve our models, we run many of these experiments constantly. All of this is made possible by scale.
New: At Baidu, you focus on making it easier for humans to interact with computers. How can deep learning improve user interfaces?
Catanzaro: I think the future of human computer interaction will be driven by AI. For example, speech recognition, emotional recognition, natural language understanding, dialog systems, and speech synthesis are all undergoing rapid change driven by new AI techniques. All of these tasks are fundamental to the user interfaces of the future. Andrew Ng, Baidu’s chief scientist, said recently, “In the future, I would love for us to be able to talk to all of our devices and have them understand us. I hope to someday have grandchildren who are mystified at how, back in 2016, if you were to say ‘Hi’ to your microwave oven, it would rudely sit there and ignore you.” I feel the same way. To achieve this vision, we’ll need to make more progress on all aspects of conversational interfaces, which are all difficult and interesting AI problems.
One such problem is that of context. Although we may not always be conscious of it, our interactions with other people occur against a deep background of contextual information. One of the things that frustrates me most about current human computer interfaces is that they rarely have the right context. For example, when I’m in my car and I make a phone call, chances are it is to my spouse. However, my car has to ask me every time who I want to call. There’s a lot of work going into rule-based interfaces that can help solve these problems by augmenting computers with background information. These are helpful, but I’m looking forward to a future where computers can model context without the need for scripts and other rules. I think this will allow us to build systems that are significantly more natural to interact with. There has been some good progress along this dimension from projects like Google Now and Apple’s Siri, but I believe we’re just at the beginning.
New: What does the future of human-machine interaction look like 10 years from now? What will new technologies and increased computing be able to accomplish?
Catanzaro: It’s always dangerous to predict the future, but as I mentioned, I’m really excited about conversational interfaces, which I think will be running many of the things we interact with on a daily basis and increasing our productivity. I’m also looking forward to autonomous vehicles, which I hope will be available, at least in certain restricted conditions, within 10 years.
I hope that we’re able to use AI to reduce the drudgery in our lives—the combination of AI with robots will solve a lot of irritating problems in the physical world, and AI will help us solve a lot of mental drudgery as well. I’m an optimist, so I believe this will make us happier and more productive as a society, although I acknowledge that we’ll need to do a better job taking care of people who have their jobs displaced. I hope we’ll do so in a way that frees everyone to follow their dreams, giving people more opportunity. If we can, I believe the world will be a better place. This is what gets me up in the morning to work on AI.