5 Q’s for StumbleUpon Principal Data Scientist Debora Donato
The Center for Data Innovation spoke with Debora Donato, the Principal Data Scientist at San Francisco-based content recommendation company StumbleUpon. Donato spoke about some of StumbleUpon’s insights into different demographics’ interests, as well as the unique opportunities and challenges the mobile environment brings for data scientists.
This interview has been lightly edited for clarity.
Travis Korte: Can you introduce StumbleUpon, what it makes, and who uses it?
Debora Donato: StumbleUpon is the easiest way to discover new and interesting things across the web. Over 30 million people turn to StumbleUpon across desktop and mobile to be informed, entertained and surprised by content and information that is recommended on the basis of their declared interests and their activity.
Partners use StumbleUpon to distribute their content to influential and socially active audiences by maintaining active accounts, sharing links, employing StumbleUpon badges, and creating StumbleUpon lists. More than 100,000 publishers, brands, and marketers leverage StumbleUpon’s native advertising and promotions platform, Paid Discovery, to tell their stories, distribute their content, and sell their products and services.
Serendipity Search is the main focus of StumbleUpon. Users are not motivated by an explicit information need that they want to fulfill. Our goal is to inform and entertain people by recommending relevant content; succeeding in such a task is substantially more difficult than traditional search.
We have a diverse user-base, evenly distributed between male and female in the US, and with a large prevalence among young males in the rest of the world. The vast majority of users are in US, but the international market is growing and StumbleUpon now recommends content in more than 15 languages with users in Canada, the UK, India, Australia, Mexico and Europe.
TK: Can you share some of the best predictors that a StumbleUpon user will like a particular piece of content?
DD: StumbleUpon personalized recommendation engine uses many different signals (or predictors) both at content and user levels. At content level, domain and semantic features are pretty good indicators for interests like News, Video, Photograph, Food/Cooking. However user-level features are more important than content-level ones. StumbleUpon tailors recommendations on the basis of user behavior.
Generally speaking collaborative filtering methods perform quite well. Users are presented with content previously rated by like-minded users—that is, users that are similar on the basis of chosen interests and rated pages. We also have methods that are designed to exploit recent user actions, which may recommend content similar to that which the user has previously expressed interest in.
TK: StumbleUpon lets users select “interests” that influence what content your system recommends. Can you tell me some insights you’ve gleaned from these interests? Are there certain groups of interests that are especially popular with StumbleUpon users? Are there groups of users who use StumbleUpon for totally different purposes?
DD: Interests are vital for StumbleUpon, since they allow us to gain insights both from an “infographic” perspective but also for a better characterization of large segments of the population.
The article, “March: Time to party or to get in shape? Top cities to do both according to Stumbles,” published last year on our blog, is a good example of how interests can be used to infer peculiar characteristics of large population segments. In particular, aggregating stumbles in the categories “Alcoholic Drinks” or “Nightlife” over the device location, we discover the top partying cities in the United States; meanwhile, monitoring the Stumbles volume in the Fitness and Health categories we could identify the cities with healthiest habits.
Interests are also important from a user-understanding point of view. We trained models to be able to predict with high accuracy users’ age and gender simply by looking at the combination of subscribed interests.
For example, teenagers prefer nature-related topics (Animal, Pets, Exotic Animals). Similarly, they rate content in Comics Books or Computer Graphics, but surprisingly they also rate content in categories like Mathematics and Writing. Such ratings and preferences are likely used to bookmark useful resources for future homework.
Young women subscribe and rate family-related topics (Babies, Kids, Parenting) but also work-related pages like Programming, Computer Networks, Embedded Systems. Cats, Dogs, Pets, and Ambient Music are the topics that collect the highest number of (normalized) ratings across all gender and age segments. Topic preference information is leveraged during the signup process to determine the set of potential interests to be pulled from past data.
TK: StumbleUpon has recently made a push into the mobile space. In what ways is the mobile environment different from the ordinary web environment, from a data science perspective?
DD: The mobile environment is a constant challenge for data science. We have witnessed such a constant evolution in mobile technology, and this change has clearly affected user behavior with a continuous change in users’ needs and expectations.
For example, at the end of 2012, we had evidence that users were unlikely to consume videos (and multimedia content in general) on mobile devices. Less than a year later, the trend was reversed, motivating the step the company made in the mobile space. The rapid pace of mobile technological changes forces data scientists to keep questioning what they know: this is the real challenge from a data science perspective.
Nowadays, mobile users activate and retain better than traditional web users. Mobile Stumbles, which account for about 40 percent of our total traffic, are going to play a determinant role in the future.
TK: Can you speak a little about the ways your machine learning processes for content recommendation are supported by human judgment? In what types of situations are humans still needed?
DD: In general we prefer not to use human evaluators, instead hoping to fully exploit data we can collect from our users. Users explicitly interact with StumbleUpon in two main ways: submitting new content and rating recommended content.
Curated content is valuable. With the few exception of content submitted with the goal of self-promotion and spammers/bots (that we are able to filter out right away), content discovered by users performs pretty well. Within each interest, we can identify a subset of users (internally called “Experts”) whose submitted and rated content is usually well received by other users. Furthermore, we are able to select users who are trustworthy/likely to provide high quality feedback and exploit their contributions at a level that may be higher than average.
From a machine learning perspective, we deeply rely on explicit ratings and other forms of implicit and explicit feedback to determine content quality. Although recent studies have attempted to automatically quantify beauty, novelty and interestingness of content, we are not interested in judging quality without considering context—that is, without assessing the content’s relevance for each segment of our user base.