**Published on** December 20th, 2013 | *by Travis Korte*

# 5 Q’s for the Creators of BayesDB, a Database Built for Data Science

The Center for Data Innovation spoke with the creators of BayesDB, a database designed to automate certain data science techniques such as prediction and simulation. Researchers at the University of Louisville (UofL) and the Massachusetts Institute of Technology (MIT) released an alpha version of BayesDB earlier this month.

*Answers were provided by the BayesDB team, comprising UofL Professor of Psychology and Computer Science Patrick Shafto, UofL graduate student Baxter Eaves, MIT Research Scientist Vikash Mansinghka, MIT Research Engineer Dan Lovell, and MIT graduate student Jay Baxter. Responses have been edited for length and clarity.*

**Travis Korte**: For those who may be unfamiliar, can you briefly introduce BayesDB and CrossCat?

**BayesDB Team**: BayesDB is a Bayesian database table, designed to let users query the probable implications of their tabular data as naturally as an SQL database lets them query the data itself. BayesDB makes it possible for users to solve basic data science problems such as detecting predictive relationships between variables, inferring missing values, simulating probable observations and identifying statistically similar rows, without requiring them to do custom statistical modeling.

Users interact with BayesDB using our Bayesian Query Language (BQL), an SQL-like language augmented with commands for accessing the results of inference. For example, in addition to being able to retrieve data using SELECT, you can INFER values that might not be observed: INFER annual_income, num_dependents FROM tax_return_summary WHERE employed = True AND age > 30 WITH CONFIDENCE 0.95

This fills in annual_income and num_dependents with the results of inference, whenever BayesDB can do so with sufficient confidence, and will also find rows where employment status and age are unknown but can be inferred.

CrossCat is the Bayesian machine learning method currently used by BayesDB to answer most user queries. Unlike typical regression models, CrossCat does not assume that a given variable can be predicted from the others, nor does it assume that predictive relationships have a simple mathematical form. It is flexible enough to give robust answers on a broad class of data tables. (Readers familiar with statistics may be interested to know that CrossCat estimates the full joint distribution over the variables in the table. Take a look at the FAQ for some information on its statistical properties.) Over time, we expect BayesDB will incorporate additional models optimized for commonly occurring types of data.

**TK:** Twitter reacted pretty enthusiastically at the news of your alpha release, and a lot of commentators seemed to have high hopes for BayesDB helping lower the barriers to entry for conducting sophisticated data analysis. Do you see it this way? Is BayesDB going to make machine learning more accessible, or was it conceived more for power users?

**BDB:** Yes, you (and Twitter) got it exactly right.

BayesDB can enable users to draw robust inferences from their data tables without needing to first become experts in computational statistics. The hope is that this is a step towards making rigorous data science ubiquitous, similarly to how the relational database made reliable data storage and efficient retrieval ubiquitous. We think that doing this requires both an intuitive, simple language—ideally one that will be familiar to the traditional IT and emerging data science communities—and an underlying suite of statistical methods that are robust enough for use by non-experts.

Of course, there will always be problems where it would be irresponsible to make a forecast without consulting a professional statistician and acquiring a solid understanding of the statistical argument they are using. In these settings, BayesDB could help a domain expert user get a sense of the landscape, ask more informed questions of a professional statistician, and help the statistician save time and effort. Our FAQ discusses this issue in more depth, including ways we’ve tried to make BayesDB safer than many traditional statistical methods. We’re looking forward to engaging more with the statistics and econometrics communities around these questions.

There are potential benefits for statistics and machine learning experts, too. BayesDB could help these users quickly get first-pass results for typical problems, and work more effectively in settings where typical methods struggle, such as when there are lots of variables, few rows, and lots of missing values. Also, over time, we think the Bayesian Query Language will evolve to support customization of the underlying probabilistic model—just like SQL databases let experts customize the indexes used to make retrieval efficient. We’re interested in connecting with experts who have opinions about what these features should be.

**TK:** One of BayesDB’s most interesting capabilities is the INFER command, which can estimate and fill in missing data. The usefulness of this functionality seems clear, but have you given any thought to discouraging scientific data fabrication?

**BDB:** Thanks—INFER is one of our favorites, also. 🙂

It’s true that BayesDB might make it easier to plausibly replace subsets of data. This is similar to the fraud risk posed by computer graphics and image editing tools. It’s possible to build in some protections to BayesDB to inhibit the most direct approaches, and to study techniques for distinguishing real from filled-in data; once the core system is more robust, these might make sense to incorporate. We’re also currently working on improving the treatment of missing values—inferring the pattern of censoring, and incorporating that information into BayesDB’s predictions—in ways that might additionally mitigate this kind of risk.

**TK:** Another thing BayesDB can do is SIMULATE new data from the observed distribution. Combined with INFER, this seems like it could be a powerful tool for organizations (government agencies, for example) looking to open up access to data that might be sensitive in its raw form. Can you speak a little bit to the value of simulated data that BayesDB enables, and what sorts of applications might benefit from it?

**BDB:** As you point out, SIMULATE could be used to generate proxy datasets from sensitive sources, that license similar population-level inferences. We’d love to hear from potential collaborators who might be interested in exploring this further and doing quantitative evaluations, along the line of what’s been done in statistical genetics.

SIMULATE helps make BayesDB more transparent: users can explore what BayesDB has found by simulating from its learned models under various contingencies and comparing the results to their prior expectations.

SIMULATE also facilitates making decisions on the basis of uncertain predictions. Consider a fraud detection problem. A transaction that is almost certainly not fraudulent but which might indicate a very large fraud could be handled differently than a transaction that almost certainly represents a very small fraud, even if the expected loss is the same. In this case, it is helpful to have access to the entire simulated distribution over fraud amounts, rather than the kind of single estimate summary produced by INFER.

This kind of decision-making application can be used to inform decisions about what data to gather. Say the probable effect of a policy intervention in a given county depends on demographics that are expensive to obtain. SIMULATE could be used to assess the expected value of measuring each demographic, by giving users a sense of what demographic values would be most likely and how much information about the effect of the policy would be gained by actually measuring them.

We’re curious to see what other applications of SIMULATE people find, and to integrate the most popular ones into future versions of BQL.

**TK:** CrossCat has only been around a couple of years, but aside from being the BayesDB workhorse, it also underlies [2013 Salesforce acquisition] Prior Knowledge’s predictive analytics technology. Are you working on (or planning) any other applications you can speak about? What are your next directions, besides working on BayesDB?

**BDB:** Our students the University of Louisville and at MIT are currently taking a look at large-scale social surveys as well as genomic data from a new cancer research effort. Another extension we’ve done a little preliminary work on is to handle longitudinal and time-series data—to handle a stream of timestamped database updates, rather than a table, and support something like FORECAST, in addition to INFER.

BayesDB is also part of a broader research program in probabilistic computing: software (and even some hardware) designed to explore alternative explanations for ambiguous data and identify the probable ones, as opposed to calculate the logical consequences of precise assumptions. Modern applications of computing have already shifted in this direction, but the building blocks of software and hardware haven’t caught up. So far, in addition to BayesDB, this work has produced new general-purpose probabilistic programming technology, including languages like Church and Venture. Over time, we expect BQL to incorporate ideas from these languages to enable experts to customize the models that BayesDB is using. We have also begun to put probabilistic programming and probabilistic hardware to use in computer vision. Please visit us at http://probcomp.csail.mit.edu if you’re interested in learning more, or email probcomp@lists.csail.mit.edu to get added to our mailing list for announcements.

Our whole team has been humbled and inspired by the interest in BayesDB, and we are working hard to make the system more robust, scalable and flexible. We’re actively looking for collaborators with interesting use cases, especially in areas relevant to policy, so please don’t hesitate to contact us!