The Center for Data Innovation spoke with Scott Sorensen, the Chief Technology Officer of genealogy company Ancestry.com, who spoke about how the 30-year-old company uses Hadoop to manage its billions of genealogical records and has branched out recently to DNA testing.
Travis Korte: Ancestry was something of a “big data” company before the term existed. How did you cope with the massive datasets back then, and how have you modernized your operation since?
Scott Sorensen: Ancestry.com is a big data company that you might not normally think about, but we have had massive amounts of data (unstructured and structured) before the term big data came around. Today we have 10 petabytes of content which includes 12 billion records (many of which are hand-written), 55 million family trees, over 175 million user-contributed photos, stories and scanned documents, and DNA data.
For years we have used proprietary technology that we created at Ancestry.com to deliver value to our customers from this large trove of data, but as solutions like Hadoop and MapReduce have become available we have been able to leverage these technologies to great advantage. Five years ago we had a record linking problem that led us to create a machine learned algorithm for record linking and this brought Hadoop into our toolset. At a certain point we had so many variables or features that we were trying to tune for in our record linking algorithm that it became impossible to manage so we turned to machine learning. We started to hire data scientists for machine learning about three or four years ago and they insisted on using tools of the trade, Hadoop, MapReduce and R instead of our proprietary software.
Hadoop has been an ideal solution since it allows us to parallelize the process of training an algorithm by giving every node a sample of the training data and then aggregate the results back into a global model. We use Hadoop for machine learning and DNA pipeline processing but we also use it for advanced analytics like predictive modeling.
TK: You’ve also had to contend with digitizing and structuring very old, handwritten records. How much of these efforts are optical character recognition- (OCR-) based, and how much do you rely on human readers?
SS: A large majority of the digitized records on Ancestry.com have been transcribed by hand by our keying partners. We’ve also been able to leverage OCR technology and entity extraction to make certain collections searchable without having them indexed by hand which can be time consuming and costly. Take an obituary for example. Rather than simply allowing term search we can know that a place is a place, a name is a name and a date is a date. In addition, we can determine relationships. For an obituary we know which name represents the deceased and we know the relationship between other people in the obituary—for example the children of the deceased. Searching a “bag of words” is a whole lot less effective than searching for terms that have semantic meaning.
By adding semantic search technology over low performing record collections that have not been indexed by hand, we’ve been able to increase their value and in some cases bumping them up to our most valuable top ten collections out of 30,000.
Here is another example of OCR and entity extraction on our Tech Roots blog: Big Data From Historical Sources
TK: Aside from data specifically collected for genealogical records, what other sources of data has Ancestry ingested to inform the creation of family trees?
SS: In addition to records and DNA data, Ancestry.com has a product feature called Facebook Connect which builds out a family tree for a user in less than a minute based off of predefined relationships in their account. At that point, our systems will begin providing record hint recommendations to help them further their discoveries. Large amounts of general history data also allows us to put genealogy data into context.
TK: Ancestry now also offers a DNA testing service. Is that independent of the genealogical data, or do you combine the two in some way?
SS: What makes the AncestryDNA data set so unique is that it’s combined with the 12 billion records and over 55 million family trees on Ancestry.com. AncestryDNA leverages a unique collection of documented family trees along DNA samples to conduct innovative research in population genetics and provide users with a breakdown in ethnicity percentages and distant cousin matches.
If we determine that you have a fourth cousin then you likely share a common ancestor with that person between 7 and 10 generations ago or 150 – 300 years ago. The AncestryDNA product experience will then overlay your family history data with other potential cousins to help find the common ancestor.
TK: To your knowledge, has the Ancestry database been mined for other information, aside from family relationships?
SS: Occasionally we have worked with PhD candidates to crawl our data. We’ve then hired them because it’s extremely hard to find data scientists – especially ones that have experience with your data.
We are also working with the University of Minnesota to pull contextual information from the 1940 census. See more information here.