Developing AI requires a lot of data and, in many cases, this data comes from third parties. But organizations willing to share data for computational uses have not had easy-to-use licenses for distributing data. Many common licenses, such as the Creative Commons licenses, were developed without consideration for how data could be used for machine learning. The absence of model data sharing agreements has put off many data owners who would otherwise be eager to share their data, thus hindering AI development. To address this problem, Microsoft has published three model data use agreements designed to address if and how data could be used for AI development.
The first model data use agreement, the Open Use of Data Agreement (O-UDA), is similar to the existing Creative Commons No Rights Reserved (CC0) license, as it places no restrictions on the use of data or any outputs—products developed with the use of this data —while clarifying that the creation of an AI model is considered an output, and is thus unrestricted as well.
The second model data use agreement, the Computational Use of Data Agreement (C-UDA), allows a data holder to share data for computational use purposes only. C-UDA, which allows data to be used for machine learning, provides a mechanism to share data for AI development in situations where datasets contain copyrighted works. For example, a local newspaper may want to enable AI developers to use its corpus of articles but not allow them to otherwise use or distribute the content. The C-UDA creates an opportunity for a data holder to share data without giving up its rights to limit access and use for non-computational purposes.
The third model data use agreement, the Data Use Agreement for Open AI Model Development (DUA-OAI), is similar to C-UDA in that it allows for data sharing for computational use only, but with the key distinction that the resulting AI model trained on this data must be made publicly available under an open-source license. DUA-OAI is designed to facilitate data sharing in situations where data holders wish to share data to advance AI development, but only if the outputs of this sharing are available to all, rather than proprietary.
These model agreements are a valuable contribution to the public debate about how to make it easier for organizations to share data for AI development. Federal agencies, such as OMB and NIST, should work with the private sector and other stakeholders including state and local governments to evaluate how to incorporate these types of agreements into future government data releases and standardize the licenses used to release data for AI.