Big Data at

25 %
75 %
Information about Big Data at

Published on March 1, 2014

Author: leonidz



Presentation at Big Data Summit, April 2013, SF

Learning  from  Data:     Who  Do  You  Think  You  Are?     DNA Sco$  Sorensen  and  Leonid  Zhukov  Mission   2

Discoveries   It’s  the  “aha”  moment  of  a  discovery  that   drives  our  business!   3

World’s  largest  online  family  history  resource   Historical  Content   Over  30,000  historical  content  collec2ons     11  billion  records  and  images   Records  da2ng  back  to  16th  century   4

World’s  largest  online  family  history  resource   User  Contributed  Content   45  million  family  trees   More  than  4  billion  profiles   200  million  stories  and  photos   5

DNA  Data   DNA  Data     Over  120,000  DNA  samples   700,000  SNPs  for  each  sample   2,000,000  4th  cousin  matches           DNA molecule 1 differs from DNA molecule 2 at a single base-pair location (a C/T polymorphism). (http://   Spit  in  a  tube,  pay  $99,  learn  your  past  Derrick  Harris  -­‐  GigaOm     6

User  Behavior  Data   User  Behavior  Data   40  million  searches  /  day   10  million  people  added  to  trees  /  day   5  million    Hints  accepted  /  day   3.5  million    Records  aMached  /  day     1/12   7 12/12   1/12   12/12  

Real-­‐Ome  data  feed   8

Technology   Machine  Learning     9

Person  and  record  search   •  Search  query   10

Hint  suggesOons  system   •  Hints  -­‐  sugges2ons    to  aMach  a  record     11

Record  linkage   •  Record  linkage  –  finding  and  matching  records  in  mul2ple  data  sets     with  non-­‐unique  iden2fiers   •  Goal:  bring  together  informa2on  about  the  same  person   •  Some    non-­‐unique  iden2fiers:   –  Names:  first  name,  last  name  (John  Smith  –  300,000  records)   –  Dates:    date  of  birth,  date  of  death         –  Places:  place  of  birth,  residence,  place  of  death     –  Extra:  family  members,  life  events   •  Records  o[en  incomplete     •  Records  contains  mistakes   •  Exact  and  fuzzy  match   12

Life  events  in  collecOons   •  Life  events   –  Birth:  2.59  bln   –  Marriage:    114  mln   –  Census:    2.74  bln   –  Death:    467  mln   •  Total:    5.91  bln  events   13

Candidate  set  funnel:  exact  match   John  Smith:    300,000     John  Smith,  1870:   2,200   John  Smith,  1870,     Boston,  MA:    10   Search:    high  precision   14

Candidate  set  funnel:  fuzzy  match   John  Smith:    380,000     John  Smith,  1870:   97,000   John  Smith,  1870,     Boston,  MA:    1400   Explora2on:  large  recall   15

Results  set   Name se distan d i t ce Exact match es t nam Shor ls initia Exten de dates d Missing fields 16

Hints  suggesOon  system   •  User  feedback  loop:   –  Accept  sugges2on   –  Reject  sugges2on   17

A  place  for  machine  learning   •  Supervised  machine  learning   •  Learn  similarity  measure     Person ?   Record (how  to  combine  iden2fiers)   •  Training  &  tes2ng  sets:   –  User  accepts,  rejects   •  Features  (>  500):   –  First  last  name,  DOB,  POB,  DOD,  POD     –  Parents,  children,  siblings,  spouses   –  Fuzzy  matches   •  Similar  to  “learning  to  rank”  problem   18 ML suggest Candidate  k-­‐set  

Similarity  measure  learning   Training   Label Person ID Feature generation Record ID Index Ancestry collections ML Random forest Hadoop   Hive   Member trees Scoring   Top-k records candidate set Person ID 19 Feature generation Model Ranked List

Large  scale  machine  learning   Hadoop  HDFS   Hadoop  streaming   Random forest (R) Random forest (R) Random forest (R) Model 20 Random forest (R)

Data   Big  Data  –  Big  Picture     21

Family  tree   •  User  generated  family  trees:   –   45  mln  family  trees   –   4.9  bln    profiles   22

Family  tree  as  a  graph  (DAG)   2020  nodes   572  marriage  edges   2910  family  edges     23

Family  trees   24

Family  trees  staOsOcs   “Power  law”  distribu2on   44  mln  trees   25

History  from  family  trees   500  nodes   700  edges   55  genera2ons     2me   26

Historical  immigraOon  to  the  US   •  ImmigraOon  is  the  movement  of  people  into  a  country  or  region  to  which  they   are  not  na2ve  in  order  to  seMle  there   •  Immigrants  are  those  who  were  born  outside  the  US  and  died  in  the  US   •  Based  on  family  tree  profiles:   –  Birth/death  dates  range    1500-­‐1990   –  Select  only  complete  profiles  with  FLN,  POB,  DOB,  POD,  DOD   –  Perform  de-­‐duplica2on,  remove  same  ancestors  from  different  family  trees   –  Select  only  those  with  POB  !=  US,  POD  ==  US   •  15  mln  profiles  (  0.3  %  from  4.9  bln  profiles)   27

ImmigraOon  to  the  USA  1500-­‐1990   28


ImmigraOon  map     30

Ports  of  arrival    (1800-­‐1980)     31

Data  Science     •  Ancestry  is  building  data  science  team   •  We  work  on  product  data  and  BI   •  We  are  hiring   •  Special  thanks  to  Mercator  Group  for  inforgraphics       32

Add a comment

Related presentations

Presentación que realice en el Evento Nacional de Gobierno Abierto, realizado los ...

In this presentation we will describe our experience developing with a highly dyna...

Presentation to the LITA Forum 7th November 2014 Albuquerque, NM

Un recorrido por los cambios que nos generará el wearabletech en el futuro

Um paralelo entre as novidades & mercado em Wearable Computing e Tecnologias Assis...

Microsoft finally joins the smartwatch and fitness tracker game by introducing the...

Related pages

With the help of Ancestry Hints, ... “So I decided to have my DNA tested, and the big surprise was we're not German at all,” he says.
Read more

Big Data At Why Data Stewardship And Open ...

Contributor. I write about big data, analytics and enterprise performance full bio → Opinions expressed by Forbes Contributors are their own.
Read more

Gigaom | How big data helps map people ...

Online genealogy service (s acom) is trying to become like the Amazon (s amzn) or Netflix (s nflx) of family trees. Much like those companies ...
Read more

How Manages Big Data - InformationWeek

How Manages Big Data. ... "Ancestry's been dealing with big data for a long time. We've been around for 15 years," said Scott Sorensen, ...
Read more

Big Data – Tech Roots

Visualizing Data with Tableau Posted by Camille Penrod on November 11, 2015 in Big Data, Technical Management. At Ancestry we quickly analyze billions of ...
Read more

Ancestry Big Data Engineer/ Senior Data Engineer ...

Big Data Engineer/ Senior Data ... Big Data team is looking for an experienced Data Engineer ... please visit our website at ...
Read more

How Manages Generations Of Big Data ...

According to Bill Yetman, senior director of engineering at, the big data explosion led to growing pains. "We measured every step in our ...
Read more

Big Data for Developers at Ancestry - Tech Roots

Big Data is not only limited for analysts and decision makers. Developers can and should leverage on these types of logs to better understand the system.
Read more

Big Data at - Technology -

Presentation at Big Data Summit, April 2013, SF ... 1. Learning from Data: Who Do You Think You Are? DNA Sco$ Sorensen and Leonid ...
Read more

Where can I find large datasets open to the public? - Quora

Kaggle competition data: Go from Big Data to Big Analytics; ... Social Networking: Forum Dataset; UCI Machine Learning Repository: ...
Read more