The Center for Data Innovation spoke with Arno Candel, chief architect, physicist, and hacker at H2O, a data science platform company based in Mountain View, California. Candel discussed H2O’s variety of use cases and how his background in particle physics gave rise to his data science career.
This interview has been lightly edited.
Joshua New: You are a lead on H2O, a data science platform that the GitHub community ranked as the best open-source Java machine learning project. What makes H2O stand out compared to other similar projects?
Arno Candel: H2O aims to be the fastest, most accurate, and easiest-to-use distributed machine learning platform. Our team members are leading experts in computer science, distributed systems, machine learning, visualization, and data science. We also work in close collaboration with world-renowned academics in machine learning and data science.
New: What are some of your favorite use cases of H2O?
Candel: H2O is currently used by many thousands of users worldwide, so it’s hard to imagine all the use cases. Many of our customers are using H2O for fraud prevention, churn prediction, digital marketing, healthcare, process optimization or pricing engines, while others are using it for predicting financial markets or automating tasks such as digging through resumes. My favorite use cases are for healthcare and medical research, where I hope H2O can make a difference to all our lives. Another compelling use case is data science competitions: Team H2O is currently leading a Kaggle challenge on rain prediction. As our CEO SriSatish Ambati likes to say, “ML (machine learning) is the new SQL,” so you can expect the use cases for H2O and similar tools to grow rapidly.
New: In your previous job at the Department of Energy’s SLAC National Accelerator Laboratory, you are credited with authoring the “first curvilinear finite-element simulation code for space-charge dominated relativistic free electrons” and scaling it to thousands of computer nodes. What does this means in layman’s terms, and why it is important?
Candel: The highly successful theory of quantum mechanics has driven major advances in the understanding of particles and matter in the last century, and it is now increasingly difficult to push the frontier of experimental particle physics. New colliders and light sources cost hundreds of millions or billions of dollars and require decades of planning and design, so every piece has to be carefully designed to meet the strict requirements. High-fidelity numerical simulations are needed to meet these standards.
While at SLAC, I wrote a highly optimized computer program that utilized the world’s fastest supercomputers and novel mathematical techniques to accurately predict the dynamics of rapidly accelerated electrons in the strong electromagnetic fields of a particle accelerator. It’s like taking your microwave at home and driving it with 50,000 times the input power for a very short time. Very strong electromagnetic fields rip electrons out of a metal plate and accelerate them to 99 percent of the speed of light within a few inches, similar to a surfboard on a tsunami. My code is used to predict the critical initial part of this dynamic journey of the electrons that determines the performance of the entire accelerator.
New: Your current title is “chief architect, physicist, and hacker” and your resume shows an extensive background with particle physics—working at the SLAC National Accelerator Laboratory and collaborating with the European Organization for Nuclear research (CERN), holding a PhD in physics. How do these skills translate to your expertise in machine learning and data science?
Candel: Physicists are trained to translate complex mathematical problems into actionable insights, and they develop a skill for estimating the importance and scale of the most critical parts when solving a problem. Large computational efforts are often closely related to some physical problem, whether it’s drug research, weather forecasts, microchip design, or searching for oil. When these problems get translated into computer code, it’s important to pick the right numerical algorithms. The single best way to speed up computer code is by using a better algorithm to solve the same problem. For example, solving a given set of mathematical equations on a workstation with lots of memory might require a different algorithm than when solving them on 20 interconnected servers with less memory each. Often, limiting data movement is key and smart data structures and careful design of memory access patterns go a long way towards that goal.
New: What do you hope to see for the future of machine learning as supercomputing advances?
Candel: I am hopeful that machine learning and smarter applications will continue to enrich people’s lives, make them more productive, and allow them to spend their time on more interesting things. It will also increase our safety and well-being and allow us to connect with our families and friends in more meaningful and satisfying ways. Faster hardware, better algorithms, and new generations of students in machine learning will help to make this a reality.