In March 2019, IBM created the “Diversity in Faces” dataset to provide a set of photos of peoples’ faces of various ages and ethnicities to help reduce bias in facial recognition systems. Even though IBM compiled the dataset from photos people shared online with a license which allows others to use the images for any purpose, some people strongly objected because IBM did not explicitly ask people for permission to use their photos in this dataset. NBC News even called it “facial recognition’s ‘dirty little secret.’” While this characterization is profoundly misleading (it was an effort to reduce bias in facial recognition, which is hardly “dirty,” and IBM was very public about the source of this data), this controversy highlights the challenge organizations face in creating datasets for AI, even when they have lawful access to the data, and the need for government to play a larger role in compiling data for computational uses.
IBM created the Diversity in Faces dataset from images uploaded to the photo sharing site Flickr, which allows users to post images using a variety of different licenses. All of the photos used in the IBM dataset had a Creative Commons license, which content owners assign to their work to allow for reuse of the work with minimal, if any, restrictions. The widespread use of these licenses has been a tremendous boon to society and the economy by creating a wealth of valuable content that others can freely use and adapt for their own purposes, and the Diversity in Faces dataset is a perfect example of how openly licensed works generate valuable benefits. AI systems need good training data to be effective and suffer performance issues if this data is incomplete, inaccurate, or not representative. Training a facial recognition system, for example, on a dataset that is disproportionately images of white male faces will cause the system to perform poorly for women and people of color. IBM compiled the Diversity in Faces dataset as a free resource for developers who want to reduce bias in their facial recognition systems.
The backlash to the idea of using publicly available photos to train facial recognition systems highlights some misunderstanding of how U.S. copyright law permits the use of copyrighted works for computational purposes, such as training a machine learning system. Even if these images were not openly licensed, fair use would allow companies to collect and use images they have access to without seeking additional permissions from copyright holders. Companies often use scraped images from public websites to train image recognition systems. DeepMind, for example, offers a dataset called “Kinetics” containing 650,000 links to short clips of public YouTube videos showing human actions, such as bowling, picking fruit, or yawning.
It is clear that IBM can lawfully distribute images with Creative Commons licenses. While some people may be opposed to facial recognition technology, and not like that their images were used to train some company’s algorithms, that does not mean copyright law is broken or needs to be changed, or that IBM did anything wrong. As Ryan Merkley, the chief executive officer of Creative Commons, notes, “copyright is not a good tool to protect individual privacy, to address research ethics in AI development, or to regulate the use of surveillance tools employed online.” It would be unfortunate if general public angst about AI led to the popularization of licensing agreements that explicitly prevent the computational use of data. Platforms like Flickr should resist any pressure and continue to offer technology neutral licenses, ensuring that any data a human can access, a computer can also access. And companies like IBM should be encouraged to continue to package datasets for public use.
Unfortunately, misunderstandings about how open licenses work are commonplace and responsible for new waves of outrage. More recently, an October New York Times article called attention to another facial recognition training dataset compiled from openly licensed Flickr photos called MegaFace. People included in the database expressed similar distaste and frustration, failing to acknowledge that their (or, for people whose childhood photos were included, their parents’) use of open licenses explicitly allows for this kind of use. Ultimately, there are limits to the amount of control individuals have over content they share publicly—whether they do this online, in print, or in person. There is no need to rewrite the rules of copyright simply because AI systems are now one of the potential users of this publicly shared content.