The Center for Data Innovation spoke with Daryn Nakhuda, co-founder of Mighty AI, a training data-as-a-service company based in Seattle, Washington. Nakhuda discussed how his company crowdsources the annotation of training data, as well as the possibility of one day automatically creating training data.
Joshua New: Mighty AI sells “training data as a service.” Could you explain what this means, and how it works?
Daryn Nakhuda: We work with companies developing AIs that need human-annotated training data to teach their computer vision or natural language systems to perform certain tasks.
For example, say an automaker needs to train a self-driving car to detect objects in the roadway. We have a mobile app and website called Spare5 that we use to push microtasks—like identifying and labeling objects such as trees and pedestrians in a photo of a street scene—to a community of people to complete in their spare time. We then use our own machine learning algorithms to assess how they did, make sure it’s accurate, and deliver the labeled data to the automaker to train and retrain the car’s computer vision capabilities.
The “as a service” part refers to our fully managed approach and software stack which manages the process from beginning to end, so our customers can focus on their models.
New: How many people do you have annotating and labeling training data for you?
Nakhuda: We have more than 300,000 people across 155 countries who do annotation tasks as part of our Spare5 community. We affectionately call them our “Fives” and do everything we can to make our app and website a fun place to spend a spare five minutes. For instance, we’ve built game-like user interfaces with experience levels and points they can earn. We’ve also created a Facebook page and online forum where people can ask questions, share stories, get access to mentors and more to help foster a community for people.
New: One of Mighty AI’s main training data offerings focuses on autonomous driving. How in-demand is this kind of data right now?
Nakhuda: The automotive industry is probably one of the hottest and most advanced industries applying AI today. To make fully autonomous vehicles a reality will require an extraordinary amount of training data. In fact, most of our customers developing autonomous vehicles tell us that not having clean labeled data is the number one blocker to getting to higher levels of autonomy. Pair this with the fact that estimates show a typical vehicle will be collecting upwards of 40 terabytes of data per day across all sensor types, and it’s a tremendous hurdle to clear.
New: Mighty AI also offers training data for natural language processing. Is this harder to generate than training data for computer vision? I imagine image classification is more straightforward.
Nakhuda: Every type of training data poses its own challenges. For natural language, beyond a fluency for the language itself to understand idioms and colloquialisms, there’s a strong need for cultural awareness to truly understand sentiment and context. Images may seem easier, but similar challenges exist, especially as far as the terms used to describe objects in different parts of the world, and regional differences in the appearance of common objects. Take for example traffic signs, which look different in the United States than they do in Europe, Asia, and so on. Or consider trucks, which in the United Kingdom someone might describe as a “lorry,” whereas in the United States, someone might describe as a “semi.”
New: Will we always needs humans to generate training data? Or could this process one day be automated?
Nakhuda: While it is becoming easier to generate synthetic data—which is data produced by models as opposed to data that has gathered in the real world—it will never replace human-annotated training data. Generative models and simulations are fundamentally limited and cannot show you anything truly novel that goes beyond their initial training data.
For example, a generative model trained up on images of cars is useless if you suddenly need images of stop signs. To train a new model to generate stop sign images, you first need to collect a large set of real-world stop sign images from which to learn. And to obtain images of stop signs, you need to comb through large sets of images and identify the images that include stop signs. To do that, you can either use human annotators, or you could try a pretrained image classifier to identify whether a given image contains a stop sign or not. However, to train a supervised classifier like that requires a large volume of—you guessed it—human-annotated training data.
It’s also worth noting that today, many of the tasks humans do to train AIs are fairly basic and straightforward. But as AIs become more advanced, it will elevate people to roles where they focus on more creative, higher-level tasks as we lean on AIs to automate trivial, rote tasks. So while the role for humans in training AIs will evolve, it will never disappear entirely.