Big Data at Ancestry.com

25 %
75 %
Information about Big Data at Ancestry.com
Technology

Published on March 1, 2014

Author: leonidz

Source: slideshare.net

Description

Presentation at Big Data Summit, April 2013, SF

Learning  from  Data:     Who  Do  You  Think  You  Are?     DNA Sco$  Sorensen  and  Leonid  Zhukov  

Ancestry.com  Mission   2

Discoveries   It’s  the  “aha”  moment  of  a  discovery  that   drives  our  business!   3

World’s  largest  online  family  history  resource   Historical  Content   Over  30,000  historical  content  collec2ons     11  billion  records  and  images   Records  da2ng  back  to  16th  century   4

World’s  largest  online  family  history  resource   User  Contributed  Content   45  million  family  trees   More  than  4  billion  profiles   200  million  stories  and  photos   5

DNA  Data   DNA  Data     Over  120,000  DNA  samples   700,000  SNPs  for  each  sample   2,000,000  4th  cousin  matches           DNA molecule 1 differs from DNA molecule 2 at a single base-pair location (a C/T polymorphism). (http:// en.wikipedia.org/wiki/Singlenucleiotide_polymorphism)   Spit  in  a  tube,  pay  $99,  learn  your  past  Derrick  Harris  -­‐  GigaOm     6

User  Behavior  Data   User  Behavior  Data   40  million  searches  /  day   10  million  people  added  to  trees  /  day   5  million    Hints  accepted  /  day   3.5  million    Records  aMached  /  day     1/12   7 12/12   1/12   12/12  

Real-­‐Ome  data  feed   8

Technology   Machine  Learning     9

Person  and  record  search   •  Search  query   10

Hint  suggesOons  system   •  Hints  -­‐  sugges2ons    to  aMach  a  record     11

Record  linkage   •  Record  linkage  –  finding  and  matching  records  in  mul2ple  data  sets     with  non-­‐unique  iden2fiers   •  Goal:  bring  together  informa2on  about  the  same  person   •  Some    non-­‐unique  iden2fiers:   –  Names:  first  name,  last  name  (John  Smith  –  300,000  records)   –  Dates:    date  of  birth,  date  of  death         –  Places:  place  of  birth,  residence,  place  of  death     –  Extra:  family  members,  life  events   •  Records  o[en  incomplete     •  Records  contains  mistakes   •  Exact  and  fuzzy  match   12

Life  events  in  collecOons   •  Life  events   –  Birth:  2.59  bln   –  Marriage:    114  mln   –  Census:    2.74  bln   –  Death:    467  mln   •  Total:    5.91  bln  events   13

Candidate  set  funnel:  exact  match   John  Smith:    300,000     John  Smith,  1870:   2,200   John  Smith,  1870,     Boston,  MA:    10   Search:    high  precision   14

Candidate  set  funnel:  fuzzy  match   John  Smith:    380,000     John  Smith,  1870:   97,000   John  Smith,  1870,     Boston,  MA:    1400   Explora2on:  large  recall   15

Results  set   Name se distan d i t ce Exact match es t nam Shor ls initia Exten de dates d Missing fields 16

Hints  suggesOon  system   •  User  feedback  loop:   –  Accept  sugges2on   –  Reject  sugges2on   17

A  place  for  machine  learning   •  Supervised  machine  learning   •  Learn  similarity  measure     Person ?   Record (how  to  combine  iden2fiers)   •  Training  &  tes2ng  sets:   –  User  accepts,  rejects   •  Features  (>  500):   –  First  last  name,  DOB,  POB,  DOD,  POD     –  Parents,  children,  siblings,  spouses   –  Fuzzy  matches   •  Similar  to  “learning  to  rank”  problem   18 ML suggest Candidate  k-­‐set  

Similarity  measure  learning   Training   Label Person ID Feature generation Record ID Index Ancestry collections ML Random forest Hadoop   Hive   Member trees Scoring   Top-k records candidate set Person ID 19 Feature generation Model Ranked List

Large  scale  machine  learning   Hadoop  HDFS   Hadoop  streaming   Random forest (R) Random forest (R) Random forest (R) Model 20 Random forest (R)

Data   Big  Data  –  Big  Picture     21

Family  tree   •  User  generated  family  trees:   –   45  mln  family  trees   –   4.9  bln    profiles   22

Family  tree  as  a  graph  (DAG)   2020  nodes   572  marriage  edges   2910  family  edges     23

Family  trees   24

Family  trees  staOsOcs   “Power  law”  distribu2on   44  mln  trees   25

History  from  family  trees   500  nodes   700  edges   55  genera2ons     2me   26

Historical  immigraOon  to  the  US   •  ImmigraOon  is  the  movement  of  people  into  a  country  or  region  to  which  they   are  not  na2ve  in  order  to  seMle  there   •  Immigrants  are  those  who  were  born  outside  the  US  and  died  in  the  US   •  Based  on  family  tree  profiles:   –  Birth/death  dates  range    1500-­‐1990   –  Select  only  complete  profiles  with  FLN,  POB,  DOB,  POD,  DOD   –  Perform  de-­‐duplica2on,  remove  same  ancestors  from  different  family  trees   –  Select  only  those  with  POB  !=  US,  POD  ==  US   •  15  mln  profiles  (  0.3  %  from  4.9  bln  profiles)   27

ImmigraOon  to  the  USA  1500-­‐1990   28

29

ImmigraOon  map     30

Ports  of  arrival    (1800-­‐1980)     31

Data  Science     •  Ancestry  is  building  data  science  team   •  We  work  on  product  data  and  BI   •  We  are  hiring   •  Special  thanks  to  Mercator  Group  for  inforgraphics       32

Add a comment

Related presentations