The Center for Data Innovation spoke with Mike Olson, co-founder and chief strategy officer at Cloudera, a data management and analytics firm headquartered in Palo Alto, California. Olson discussed the value of the open-source software Hadoop, Cloudera’s work on the Precision Medicine Initiative, and the major milestones in the history of the big data industry.
Joshua New: In 2009, Cloudera became the first commercial vendor of Apache Hadoop, which at the time was a little-known open-source software framework for data management and analytics. Why use open-source software? Was Hadoop just particularly well-suited to your business, or did it being open-source have other unique benefits?
Mike Olson: When we started Cloudera, nobody was talking about “big data,” and the only people who knew about Apache Hadoop were wild-eyed engineers working in the consumer Internet. Amr Awadallah, Jeff Hammerbacher, Christophe Bisciglia, and I decided to build a business on this technology because we believed in the transformative power of Hadoop’s parallel processing and flexible storage architecture. We also fundamentally believed in the power of open source software to accelerate innovation and alleviate enterprises from information technology (IT) infrastructure lock-in.
Fast forward to today. You cannot find a company that is operating a data center without the use of open source technology. No dominant platform-level software infrastructure has emerged in the last ten years in closed-source, proprietary form. We’re finding more and more that the conversation on big data and real-world use cases for it are becoming pervasive all the way into the boardroom.
The real unfair advantage of open source is the community that creates it. The community is global: No single organization could hire all the smart, motivated people world-wide that collaborate on open-source projects. The community is diverse in interest: Different teams or individuals have different interests, and concentrate on the innovations and improvements that matter most to them. The community is sustainable: Newcomers can join, work on small things, gain mastery, and become leaders over time. The community can grow: No single organization needs to hire, train, pay, and coordinate the activity of the team, so none places an artificial ceiling on size. These are emergent properties of good open source communities. Look at the decade-plus of success of projects like Hadoop and Linux. It’s really tough for single proprietary vendors to compete.
Cloudera has deeply invested in open source technologies from the beginning. Many of our employees are part of the thriving community. They have contributed projects such as Impala, Kafka, Parquet, Sentry, and many others. They are serious committers to the most innovative new projects such as Spark and Arrow. We are heavily involved in this ecosystem because this is where the next generation of big data technology is developing. Projects being built and continuously improved upon are speeding up and extending what is possible with data.
New: Your colleague and Cloudera CEO Tom Reilly has described Cloudera’s goal as getting traditional companies to “operate more like Google.” What does that mean exactly, and how does Cloudera go about accomplishing that?
Olson: Cloudera’s goal is to help any organization drive better outcomes through fast, easy, and secure management and analytics on all of their data. This includes companies in established sectors such as finance, healthcare, manufacturing, and telecommunications. Some of those companies are hesitant to adopt new technology. Companies in these sectors face strict security and compliance requirements. They also face the complications that go along with introducing a new tool into an already operating infrastructure comprised of various legacy systems. Meanwhile, companies like Google, Facebook, and Yahoo have been working with big data from their onset; they’ve built their businesses on data. They are using it to transform customer service, predict market trends, and deeply analyze entire industries—capabilities that are core to their success. Cloudera has set out to help traditional enterprises become data-driven and we’re doing this through comprehensive online and hands-on training programs, while also making sure our platform meets their unique security needs.
Since starting the company in 2008, we have eliminated many of the barriers that kept traditional companies from taking advantage of big data. For one, we have educated and trained over 40,000 people through Cloudera University, helping to fill the big data skills gap. Our Cloudera certification is now the most recognized credential in the industry. Second, I am proud to say that Cloudera is the only Hadoop vendor with both PCI and HIPAA compliance certifications. Those are the regulatory frameworks that govern use of private data in finance and healthcare, respectively. That means companies like MasterCard, Capital One, Children’s Healthcare of Atlanta, and Cerner can get the benefits of powerful new analytics over customer and patient data. Education, security, and ease of are key to leveling the playing field for traditional companies and bring meaningful use cases to fruition.
New: In February 2016, Cloudera announced it would be providing its platform for free to 50 organizations working on the Precision Medicine Initiative (PMI), President Obama’s genomics-focused project to develop personalized medicine. Why is data analytics such a critical aspect of the PMI?
Olson: We’ve long believed that big data and analytics are powerful tools for understanding disease, improving outcomes, containing cost, and delivering better care where it’s needed the most. That’s proven true. Genomic data has become much more economical to store and analyze. A decade ago, the first human genome was sequenced for $300 million. We can now do it for 100,000 times less, and much faster. The technology needed to combine these massive volumes of genomic data with historical data about millions of patients is ready. Healthcare professionals, academia, researchers, and government officials now realize how crucial data analytics is in making sense of all the data we’ve collected from patients over the years, which includes complicated clinical data, imagery from diagnostics, physician notes, and more. With analytics we can find patterns in this mass of information to unlock new ways of designing custom treatments. In the future, we will no longer build treatments for the average patient; healthcare providers will be able to specifically target the treatments they deliver to the individual patient, which will radically improve the rate of their success and uncover cures to once fatal diagnosis.
Our three-year commitment to the PMI includes providing software to 50 organizations, training one thousand researchers, and encouraging collaboration across the community on important projects. In collaboration with the brightest minds in academic and government research institutions, we will help the healthcare community apply precision medicine findings in their routine diagnostic recommendation and care delivery.
New: Cloudera has customers in both the public and private sector. Could you speak to some of the differences you’ve observed between how your government and industry clients approach data and analytics? Are there lessons that they could learn from each other?
Olson: There’s a really big shift underway in government procurement. I talk to a lot of agency chief information officers and chief technology officers, and they are rethinking their technology strategies and expressly looking for open source software and cloud platforms. They’re doing that for the same reason that the private sector has: they’re concerned about lock-in, they need to control costs, they want the flexibility and elasticity that they get from subscription- and consumption-based deployment and use.
So government looks a lot like the private sector, if you’re a software vendor. But they’re also leading in interesting ways, not just following. Vendors and government procurement people used to talk about software in terms of “GOTS,” which stands for “government off-the-shelf,” and “COTS,” which stands for “commercial off-the-shelf.” GOTS was generally built and delivered by big contracting firms. It was custom-designed for government buyers, supporting compliance and regulatory needs peculiar to that market. Sometimes, of course, the customization was just a way to bend the software to existing usage patterns. It didn’t need to be custom because they could have changed the way they worked, but it’s sometimes easy to be organizationally lazy and spend more money.
The government buyers we work with are increasingly adopting the commercial analogues of those compliance and regulatory regimes, and recasting their use to take advantage of common, instead of custom, features. Where they really do need something specific that’s not in our product yet, they’re pushing us to add it to our standard offering, so they’re running the same bits that our bank and hospital customers do. They’re driving innovation and new capabilities that we’re then able to roll out to the commercial sector.
New: You are a long time veteran of the big data industry, working at and leading database and analytics companies for over 20 years. What has surprised you the most about how the industry has changed, and what has you the most excited about where you think it’s going?
Olson: I spent 25 years building and selling relational database technology before co-founding Cloudera in 2008. Over that quarter century, we made an enormous number of incremental improvements in databases, but no really fundamental advances. The ideas that Ted Codd and Chris Date laid out in the 1970s, and products like System R out of IBM and Ingres out of University of California, Berkeley had been huge advances. We spent decades polishing the chrome and tightening the screws on those systems, but nobody fundamentally rethought them.
In 2000, Google published a research paper on its storage platform, the Google File System. Four years later, it followed up with a second paper about a new scale-out processing and analysis system it had built called MapReduce. Those two things together inspired Mike Cafarella and Doug Cutting to create the Apache Hadoop project.
You have to think about that for a minute to understand how shocking it is. The relational database industry, even in 2000, was generating tens of billions of dollars a year in revenue for a huge number of companies. Major research universities had research efforts underway on databases. And that whole industrial and academic machine totally whiffed on big data. Google had to invent that technology because it just couldn’t buy a product that was scalable enough to ingest the entire Internet.
I read that MapReduce paper in 2004. Lots of us in the database industry did. We all thought it was a joke. We’d been building transaction processing for banks forever; this Google thing could never debit a checking account and credit a savings account. It just didn’t look anything like a database system to us. We were so focused on our existing customers and their workloads that we missed the most important innovation in data management in forty years. Total innovator’s dilemma stuff.
So my answer is, at the platform level, Hadoop. Hadoop is the most amazing piece of core technology to emerge since the web server, based on economic impact. If you want to talk more broadly, above the technology layer, then the thing that I find most amazing is the explosion of data, and the incredible innovation in analytic algorithms that can work with it. The huge advances we’ve seen in applied machine learning, the use of powerful statistical methods to oceans of data that were never available before—it seriously takes your breath away. Novel diagnostic and treatment regimes for deadly disease; cheap and effective improvements in crop yield; efficient production and distribution of energy; these are all problems that we’re solving with big data.
And the most amazing thing of all: This whole thing is ten years old. Seriously. We’re only getting started.