The Center for Data Innovation spoke with Ed Kearns, chief data officer at the U.S. National Oceanic and Atmospheric Administration (NOAA). Kearns discussed NOAA’s Big Data Project, as well as NOAA’s new National Water Model.
Joshua New: NOAA collects more data than probably any other government agency, and likely more than almost any private sector organization as well. Just how much data does NOAA collect on a daily basis? How do you expect this to change as new data technologies, such as more powerful satellites and the Internet of Things, proliferate?
Ed Kearns: NOAA collects many observations of the atmosphere and oceans, and also generates large volumes of data from computer models that use those observations. Many of the basic, routine observations from satellites and radars are processed into higher level information products that help NOAA meet its mission. On top of that, NOAA’s ships, floats, gliders, and aircraft collect large volumes of data, but not always on a daily basis. Our biologists are collecting not only the traditional fisheries data but are now generating high volumes of genomics information as well. NOAA is a fascinating organization with a complex mission, and its data collection systems are a window into that. So, “how much data” is collected every day is a rather more complex question than it may initially seem!
One way to estimate the total volume is to look at what goes into the NOAA archive. The daily, routine volume going into NOAA’s archives today is about 7 terabytes (TB) per day, from several hundred input streams, adding to the total archived volume of around 25 petabytes (PB). Weather and climate computer models in NOAA’s operational and research environments generate very large volumes, that I would estimate are now in the 100 TB/day range, and are usually stored near the computation sites for practical reasons.
How will this change? Of course, our volumes will go up. New satellite systems such as JPSS-1 and GOES-S will be launching soon, adding another 5 TB/day or so. Higher resolution computer models, new ensemble techniques and additional genomics advances will bring many more times the current data output. In the years ahead, new observing technologies will be deployed in operations, such as NOAA’s phased-array weather radars, and will bring even more data volumes with them.
But the main data challenge facing NOAA today will not change. And that challenge is how to make all those open data available and usable to everyone that wants to use them! The volumes, variety of data, and the always-increasing demand for access to NOAA’s data put a significant strain on our networks, IT systems, and budgets. And that’s why we are engaging with industry in NOAA’s Big Data Project.
New: Could you explain NOAA’s Big Data Project? How is it coming along?
Kearns: We started NOAA’s Big Data Project in April 2015 through three-year, extendable Cooperative Research And Development Agreements, or CRADAs, between NOAA and Amazon, Google, IBM, Microsoft and the Open Commons Consortium. This is an experiment to discover ways for NOAA to work even smarter through partnerships with American industry, and leverage the value inherent in NOAA’s data to enable their availability on modern cloud platforms. We hope the availability of NOAA’s data on those cloud platforms will also create new business and research opportunities using federal open data. The market for the data should help determine, and also help pay for, the ways in which the data are made accessible. So, it’s really more of a business experiment than a data experiment.
Under this agreement, the data that NOAA is delivering are still free and open—meaning that the data aren’t being sold, you don’t need to pay to access them. Instead, industry is seeking to develop ways of using, delivering, or building upon those free NOAA data. They are seeking to monetize those new data services, to subsidize the basic free access to the NOAA data they are providing to everyone. The Open Common Consortium is a non-profit academic entity, so they are playing a slightly different role in the project by encouraging new research results through their Environmental Data Commons.
While over a dozen NOAA datasets are at some level of delivery through the project to our CRADA collaborators, NOAA’s NEXRAD Doppler weather radar data were among the first data to be delivered through the agreement. And we have learned the most about how this can work from the radar experience, so far. NOAA transferred the complete Doppler radar historical archive, almost 300 TB, from its internal systems, with the help of the Cooperative Institute for Climate and Satellites from North Carolina State University. Amazon was the first collaborator, along with their partner Unidata, to make those radar data freely available on their platform, and found after a year that utilization has increased over 100 percent by volume, and that up to 8,000 distinct users per month are accessing those NOAA data on Amazon Web Services. This has far exceeded all of our expectations. Amazon has revealed that about 50 percent of their users are keeping and using the data directly on their platform, because they find the services provided by Amazon Web Services to be valuable and useful—and that use generates the revenue to keep the access going. The other 50 percent of users are downloading the data for free. And remember, all this is at no net cost to the American taxpayer.
Additionally, the time required to develop new information products based on weather radar data has been drastically decreased. Products that took years or months to produce, now only take days or hours on Amazon’s platform. For NOAA, Amazon, and many businesses, this is the kind of win-win-win scenario we are seeking.
Google recently released a video of one of its data scientists, who is not a meteorologist or climate expert, who was able to do a quick assessment of global climate variability by using NOAA’s temperature data that were integrated into its BigQuery tool. In this example, NOAA brings its high-quality data and expertise to the project, Google brings its technical tools and capacities, and then any scientist can more easily overcome the barriers to using those data to answer their questions. It works.
The project is continuing to deliver other datasets, including temperature and precipitation records, fisheries catch data, the new National Water Model data, climate forecast models, advanced weather radar products, fisheries genomics information, and NOAA’s new geostationary satellite data. IBM is currently exploring a number of these NOAA datasets and has recently begun to make several, including the Rapid Refresh weather model and some types of fisheries data, available through their Earth Systems Data portal on their Bluemix platform. We’re excited to see what kinds of applications it can support.
The high resolution geostationary imagery from NOAA’s GOES-16 satellite will become publicly available in June of this year, and we are all very excited to see how the Big Data Project can make those data easier for everyone to use. This will be a major test case for the project, with large data rates and velocities, as well as incredible demand from the user community.
New: Is NOAA exploring any other kinds of CRADAs to improve how it uses or publishes data?
Kearns: Yes. And we are actively investigating how to sustain and operationalize the ideas behind the Big Data Project CRADA. We are discovering that, in order for the business relationship to work in support of data usability, that both NOAA and a CRADA collaborators often need some defined level of service that we can both count on, in order to develop data services that make business sense. We’re asking the question, “what kind of public-private relationships centered on open data are most sustainable, and make the most sense in the long term?”
We are currently exploring all options. The traditional idea that NOAA’s budget should cover all the costs of data dissemination simply does not scale with rapidly increasing volumes and demand for those data. We need a solution that scales, and does not place the entire financial burden squarely on the backs of the American taxpayer.
If we all need to learn more about how this can work for more types of data, NOAA is open to extending the existing Big Data CRADAs. And NOAA is always open to new ideas to partner with industry and academia through new CRADAs, too.
One of our guiding principles is one of a fair and level playing field, in that if we offer data or services to one collaborator, we offer that to all of them—there is no privileged access or special service for any one collaborator. And we want to foster all new ideas about how to make NOAA’s open data as useful as possible, and to break down any barriers and obstacles to their use.
New: Last year, NOAA launched the National Water Model, which uses a supercomputer to forecast changes in the United States’ water supply to help improve flood forecasting. How effective has the model proven?
Kearns: The more I learn about NOAA’s new National Water Model, the more excited I am about its availability. I personally believe it has the potential to be a truly transformative information product—it will change the way we all look at water resources all across the country. It provides an integration of many different kinds of information and at resolution that simply wasn’t available before for the entire nation to use. The National Water Model, using the most powerful Cray supercomputer NOAA has, produces that information at 2.7 million different points in the contiguous United States, every hour. That’s a huge improvement over what information was available previously. And it’s only going to keep getting better, and more useful to everyone who is concerned about our water resources. Which is a pretty big group—anyone interested in agriculture, flooding, insurance, water management, and on and on and on. The National Weather Service is on a path towards fully incorporating the model into its operations, and I’d expect to see it be adopted into various operational products as they continue to evaluate its capabilities.
One of the things that limits taking full advantage of the new water model is the ability to make those data widely and easily accessible. Like I said, making NOAA’s information available and useable to everyone is really our biggest data challenge! Many of the Big Data Project’s collaborators are now gearing up to make the National Water Model data available on their platforms. Since the daily output is almost 1TB/day, and the retrospective time series is almost 40 TB in size, simply moving those data around to everyone that needs to use them is a big challenge. Instead of NOAA trying to send the data to so many consumers, we hope our Big Data Collaborators can offer new ways of bringing the consumers to those same data on their cloud platforms instead. I think that is proving to be a much more effective strategy.
New: The Doppler Radar National Mosaic is consistently one of the most popular datasets on Data.gov. What other NOAA data assets do you think would be highly valuable once made open?
Kearns: Well, I’d like to say “once more easily usable,” since almost all of NOAA’s datasets are now open. And they are usually available, online, now, if you can discover where to find them, are able to move them and can understand what they mean. Those are significant barriers to use, unfortunately, and we continue to work on that challenge.
NOAA’s done a good job making many of its complex data, like the Doppler weather radar through that National Mosaic, accessible and usable. And it’s continuing to do it with datasets like the National Water Model and GOES-16. But I think NOAA is just scratching the surface of what’s possible, and we need our partners in the private and research sectors to help.
The climate prediction model data, from the Climate Forecast System and the National Multi-Model Ensemble, are very valuable but have significant complexities and volumes of hundreds of terabytes that are barriers to their wider use. NOAA’s National Marine Fisheries Service is continuing to make their catch and by-catch data available, and though they are not large by volume, they are following a careful process by which those many datasets can be properly and lawfully released. NOAA’s marine genomics data will be very useful to many outside of the earth science community as well when those data are made more widely accessible.
There are some valuable NOAA datasets that are not open by definition, since their intellectual property is jointly held by NOAA’s cooperators according to law and agreement. One great example is the Multi-Radar Multi-Sensor (MRMS) weather time series, which describes precipitation and severe weather using advanced techniques and at very high resolution in space and time. NOAA uses the so-called MRMS product for its federal mission, as is permitted by law, but open distribution of those data has legal limitations. In this case, NOAA is currently discussing options with the University of Oklahoma (OU), the intellectual property rights holder, to ensure that the value of those data to the nation’s well-being are fully realized while respecting OU’s well-earned rights as well. The MRMS products have been openly discussed in scientific circles for many years and so their potential is recognized by many, and it is exciting to think about their wider use.