Published on March 1, 2014
Learning from Data: Who Do You Think You Are? DNA Sco$ Sorensen and Leonid Zhukov
Ancestry.com Mission 2
Discoveries It’s the “aha” moment of a discovery that drives our business! 3
World’s largest online family history resource Historical Content Over 30,000 historical content collec2ons 11 billion records and images Records da2ng back to 16th century 4
World’s largest online family history resource User Contributed Content 45 million family trees More than 4 billion proﬁles 200 million stories and photos 5
DNA Data DNA Data Over 120,000 DNA samples 700,000 SNPs for each sample 2,000,000 4th cousin matches DNA molecule 1 differs from DNA molecule 2 at a single base-pair location (a C/T polymorphism). (http:// en.wikipedia.org/wiki/Singlenucleiotide_polymorphism) Spit in a tube, pay $99, learn your past Derrick Harris -‐ GigaOm 6
User Behavior Data User Behavior Data 40 million searches / day 10 million people added to trees / day 5 million Hints accepted / day 3.5 million Records aMached / day 1/12 7 12/12 1/12 12/12
Real-‐Ome data feed 8
Technology Machine Learning 9
Person and record search • Search query 10
Hint suggesOons system • Hints -‐ sugges2ons to aMach a record 11
Record linkage • Record linkage – ﬁnding and matching records in mul2ple data sets with non-‐unique iden2ﬁers • Goal: bring together informa2on about the same person • Some non-‐unique iden2ﬁers: – Names: ﬁrst name, last name (John Smith – 300,000 records) – Dates: date of birth, date of death – Places: place of birth, residence, place of death – Extra: family members, life events • Records o[en incomplete • Records contains mistakes • Exact and fuzzy match 12
Life events in collecOons • Life events – Birth: 2.59 bln – Marriage: 114 mln – Census: 2.74 bln – Death: 467 mln • Total: 5.91 bln events 13
Candidate set funnel: exact match John Smith: 300,000 John Smith, 1870: 2,200 John Smith, 1870, Boston, MA: 10 Search: high precision 14
Candidate set funnel: fuzzy match John Smith: 380,000 John Smith, 1870: 97,000 John Smith, 1870, Boston, MA: 1400 Explora2on: large recall 15
Results set Name se distan d i t ce Exact match es t nam Shor ls initia Exten de dates d Missing fields 16
Hints suggesOon system • User feedback loop: – Accept sugges2on – Reject sugges2on 17
A place for machine learning • Supervised machine learning • Learn similarity measure Person ? Record (how to combine iden2ﬁers) • Training & tes2ng sets: – User accepts, rejects • Features (> 500): – First last name, DOB, POB, DOD, POD – Parents, children, siblings, spouses – Fuzzy matches • Similar to “learning to rank” problem 18 ML suggest Candidate k-‐set
Similarity measure learning Training Label Person ID Feature generation Record ID Index Ancestry collections ML Random forest Hadoop Hive Member trees Scoring Top-k records candidate set Person ID 19 Feature generation Model Ranked List
Large scale machine learning Hadoop HDFS Hadoop streaming Random forest (R) Random forest (R) Random forest (R) Model 20 Random forest (R)
Data Big Data – Big Picture 21
Family tree • User generated family trees: – 45 mln family trees – 4.9 bln proﬁles 22
Family tree as a graph (DAG) 2020 nodes 572 marriage edges 2910 family edges 23
Family trees 24
Family trees staOsOcs “Power law” distribu2on 44 mln trees 25
History from family trees 500 nodes 700 edges 55 genera2ons 2me 26
Historical immigraOon to the US • ImmigraOon is the movement of people into a country or region to which they are not na2ve in order to seMle there • Immigrants are those who were born outside the US and died in the US • Based on family tree proﬁles: – Birth/death dates range 1500-‐1990 – Select only complete proﬁles with FLN, POB, DOB, POD, DOD – Perform de-‐duplica2on, remove same ancestors from diﬀerent family trees – Select only those with POB != US, POD == US • 15 mln proﬁles ( 0.3 % from 4.9 bln proﬁles) 27
ImmigraOon to the USA 1500-‐1990 28
ImmigraOon map 30
Ports of arrival (1800-‐1980) 31
Data Science • Ancestry is building data science team • We work on product data and BI • We are hiring • Special thanks to Mercator Group for inforgraphics 32
With the help of Ancestry Hints, ... “So I decided to have my DNA tested, and the big surprise was we're not German at all,” he says.
Contributor. I write about big data, analytics and enterprise performance full bio → Opinions expressed by Forbes Contributors are their own.
Online genealogy service Ancestry.com (s acom) is trying to become like the Amazon (s amzn) or Netflix (s nflx) of family trees. Much like those companies ...
How Ancestry.com Manages Big Data. ... "Ancestry's been dealing with big data for a long time. We've been around for 15 years," said Scott Sorensen, ...
Visualizing Data with Tableau Posted by Camille Penrod on November 11, 2015 in Big Data, Technical Management. At Ancestry we quickly analyze billions of ...
Big Data Engineer/ Senior Data ... Big Data team is looking for an experienced Data Engineer ... please visit our website at http://ancestry.com ...
According to Bill Yetman, senior director of engineering at Ancestry.com, the big data explosion led to growing pains. "We measured every step in our ...
Big Data is not only limited for analysts and decision makers. Developers can and should leverage on these types of logs to better understand the system.
Presentation at Big Data Summit, April 2013, SF ... 1. Learning from Data: Who Do You Think You Are? DNA Sco$ Sorensen and Leonid ...
Kaggle competition data: Go from Big Data to Big Analytics; ... Social Networking: Ancestry.com Forum Dataset; UCI Machine Learning Repository: ...