Google Cluster Innards

50 %
50 %
Information about Google Cluster Innards

Published on May 20, 2007

Author: stone


Slide1:  Google Cluster Innards Martin Dvorak Agenda:  Agenda Inventing Google The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page (founders) / 1998 Cluster Anatomy Web Search for a Planet: The Google Cluster Architecture Luiz André Barroso, Jeffrey Dean and Urs Hoelzle / 2003 Google's secret of success? Dealing with failure Urs Hoelzle (Vice President of Engineering and Operations) / 2004 Programming for Google Cluster MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat (Google staff) / 2004 Slide3:  Inventing Google Inventing Google:  Inventing Google Sergey & Larry - Ph.D. students at Stanford University Prototype (1998) 24,000,000 pages (8,058,044,651 today) Google “We chose our system name, Google, because it is a common spelling of googol, or 10100 and fits well with our goal of building very large-scale search engines.” Page Rank An objective measure of its citation importance that corresponds well with people’s subjective idea of importance. Inventing Google: Foundation:  Inventing Google: Foundation PageRank*: We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. There are more details about d... Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows: PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) *) Larry Page Inventing Google: Foundation:  Inventing Google: Foundation Page Rank formula informally PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) PageRank can be thought of as a model of user behavior. We assume there is a "random surfer" who is given a web page at random and keeps clicking on links, never hitting "back" but eventually gets bored and starts on another random page. The probability that the random surfer visits a page is its PageRank. High PR has a page if… there are many pages that point to it or if there are some pages that point to it and have a high PR Note recursive weight propagation through web link structure. Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages’ PageRanks will be one. Damping factor d is the probability at each page the "random surfer" will get bored and request another random page. Personalization  Inventing Google: Foundation:  Inventing Google: Foundation PageRank relevancy tuning Page title Anchor text Meta Font Size Weight Capitalization … Inventing Google: Anatomy:  Inventing Google: Anatomy Inventing Google: Anatomy:  Inventing Google: Anatomy URL Server Providers list of URLs to be fetched to crawlers Google Crawlers (GoogleBot) Multiple distributed crawlers Own DNS cache 300 connections open at once Send fetched pages to Store Server Originally written in Python Store Server Compresses and stores files to repository. DOCID is created for each page. Repository Stores fetched pages for further processing by Indexer Inventing Google: Anatomy:  Inventing Google: Anatomy Indexer Reads pages from Repository (uncompress) Parses each document (Flex on top of own stack): Page converted to set of Hits (position, font, capitalization, title/achor/meta) / 2B Added to Document Index Hits are distributed to Barrels (i.e. one document to multiple barrels) Every link found in page is stored to Anchors file Forward and Inverted Barrels (2*64) Forward Index Barrel keeps range of Hits sorted by DOCIDs (DOCID, (WORDID, word’s Hit reference+)+) Processed by Sorter: Generates inverted index from forward index – sorts Hits by WORDIDs Creates (WORDID, offsets) used by Lexicon Inverted Index (short/full) (WORDID, (DOCID reference, Hit list reference)+)) Short: DOCIDs sorted by/contains just quality Hits (word in title, anchor,...); optimal single word search Full: DOCIDs sorted by DOCID; optimal Hit lists merging i.e. multi-word search Anchors file Anchor (from, to, text) URL Resolver Reads anchors file: Relation 2 absolute URL conversion + DOCID assignment Creates links file Links file (url, target: DOCID) Inventing Google: Anatomy:  Inventing Google: Anatomy Searcher uses… Lexicon Keeps map saying which Barrel to use. Originally kept in memory (256MB). IMHO now must be used something like Multi-level VM Page Table It is is/was of fixed size (14,000,000 words) Barrels Each barrel keeps range of WORDIDs WORID 2 DOCID map PageRank pool Keeps counted page rank for each DOCID Doc Index DOCID ordered information about each document (DOCID, status, repository pointer, checksum, stat, URL, title) Slide12:  Cluster Innards Cluster Innards: Global Google:  Cluster Innards: Global Google Over 30 Google clusters around the world. DNS based & geo location driven load-balancing: Domain Name: GOOGLE.COM Registrar: ALLDOMAINS.COM INC. Whois Server: Referral URL: Name Server: NS2.GOOGLE.COM Name Server: NS1.GOOGLE.COM Name Server: NS3.GOOGLE.COM Name Server: NS4.GOOGLE.COM Status: REGISTRAR-LOCK Updated Date: 03-oct-2002 Creation Date: 15-sep-1997 Expiration Date: 14-sep-2011 2005, May 7: Google DNS hack speculations Total PCs > 5,000 in 2000 >15,000 in 2003 >79,000* in 2004 *) I’m not sure about this number, it was taken from an external resource. Cluster Innards: HW:  Cluster Innards: HW Basics cluster design insights Reliability in SW rather then server-class HW. Commodity PCs used to build high-end computing cluster at a low end prices. Example: $287,000 – 176x 2GHz Xeon, 176GB RAM, 7TB HDD $758,000 – 8x 2GHZ Xeon, 64GB RAM, 8TB HDD Design is tailored for best aggregate request throughput, not peak server response time – individual request parallelization. Google has inexpensively built out its computing infrastructure by using thousands of "commodity" servers <2,000 servers in single cluster. Dual-processor x86 servers (starting at 533MHz Celeron) with 2-4 GB of memory per machine, 1+ 80GB IDE drive. Rack: 40-80 of x86-based servers. Cluster Innards: HW:  Cluster Innards: HW Optimistically, a consumer PC might crash once in three years from a software glitch or hardware problem. "At Google scale...if you have thousands of PCs, you can expect one (failure) a day,…" 1,000,000s not 1,000,000,000s of dollars. “The trick is to make these racks of hardware work together and to ensure that the failure of one machine doesn't derail an operation.” Switched Ethernet Commodity networking hardware is used - typically either 100 megabits/second or 1 gigabit/second at the machine level, but averaging considerably less in overall bisection bandwidth. Locality optimizations (GFS) Cluster Innards: SW:  Cluster Innards: SW Stripped-down version of Linux, which is based on the Red Hat distribution but is really just the operating system kernel modified for Google. Google File System is optimized for handling large blocks of data. 64MB block The file system was designed to assume that a failure, such as a failed disk or unplugged network cable, can happen at any time. Data is replicated in three places, and there is a "master" machine that can locate copies of a piece of data, such as a keyword index, if the original is out of commission. Google has created "batch" job scheduling software that acts as a sort of taskmaster for millions of operations called the Global Work Queue. Another important engineering feat done by Google is to make writing programs that run across thousands of servers very straightforward… Slide17:  Programming for Cluster Programming For Cluster:  Programming For Cluster Google's MapReduce is a programming model and an associated implementation for processing and generating large data sets. Automates the task of recovering a program in case of a failure. It is critical to keeping the company's costs down. MR in brief: Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Features Functional style programming. Automatic parallelization. Programming For Cluster:  Programming For Cluster Map Reduce runtime… takes care of the details of partitioning the input data scheduling the program's execution across a set of machines, handling machine failures managing the required inter-machine communication. …and more. MR hides machines the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library. Therefore even programmers without any experience with parallel and distributed systems can easily utilize the resources of a large distributed system. Numbers… TB of data processed (>20) On 1,000s of machines 100s of MR programs in place Programming For Cluster:  Programming For Cluster Special purpose computations examples: Inputs: crawled documents web request logs … Outputs: inverted indices various representations of the graph structure of web documents summaries of the number of pages crawled per host (dumper) the set of most frequent queries in a given day, etc. Programming For Cluster:  Programming For Cluster LISP roots Remind map and reduce primitives: map(func, list, ...) Creates new list from the results of applying func to each element of each list. There must be one list per argument to the function. map(lambda x, y: x+y, [1,2],[3,4]) --> [4,6] map(None, [1,2],[3,4]) --> [[1,3],[2,4]] reduce(func, list {,init}) Applies func to each pair of items in turn. The results are accumulated. reduce(lambda x, y: x+y, [1,2,3,4],5) --> 15 reduce(lambda x, y: x&y, [1,0,1]) --> 0 reduce(None, [], 1) --> 1 Programming model Key/values 2 key/values Map & reduce functions written by user are linked with MR library. map (k1,v1)  list(k2,v2) reduce (k2,list(v2))  list(v2) Input/output file, tuning parameters, … Programming For Cluster:  Programming For Cluster Example (pseudocode): map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); Programming For Cluster:  Programming For Cluster Example solves the problem of counting the number of occurrences of each word in a large collection of documents. More examples: Distributed Grep Map: (URL, true); Reduce: id Count of URL Access Frequency Access page log is used as input for map. Map: (URL,1); Reduce: (URL; total count) Reverse Web-Link Graph Link database is processed by map. Map: (target; source); Reduce: (target; list(source)) Term-Vector per Host A term vector summarizes the most important words that occur in a document or a set of documents as a list of (word; frequency) pairs. Hostname is determined for each document by map and term vector created (so there is multiple entries); reduce function then merges all entries associated with particular host and throws away infrequent terms. Map: (hostname; term vector); Reduce: (hostname; term vector) Programming For Cluster:  Programming For Cluster Programming For Cluster:  Programming For Cluster Also… Multiple tasks performed by single worker (load-balancing) Master idle, in-progress, completed (workers) worker failure (re-execution) master failure (rare, restart) MR locality optimization Save network bandwidth GFS 64MB block replication MR scheduler takes GFS replicas location into account Stragglers Overload, HW problems, etc. Backup task – fork twin execution for straggler Task granularity Number of workers driven by M&R pairs number For example: M=200,000; R=5,000 using 2,000 workers Slide26:  Putting Things Together I’m Feeling Lucky:  I’m Feeling Lucky Pre-phase: Browser requests e.g. DNS-based load-balancing selects cluster according to the geographical location of the user & actual cluster utilization The rest of the evaluation is entirely local to the that cluster Phase 1: Index servers... Parse the query Perform spell-check and fork Ad task Convert words into WORDIDs Choose inverted Barrel(s) using Lexicon Barrel index is formed by number of servers whose data are randomly distributed and replicated (full index/index shards) so search is highly parallelizable Inverted barrel maps each query word to a matching list of documents (Hit list) Seek to the start of the document list in the short barrel for every word (multiple tasks) Scan through document list until there is document that matches all search terms If we are in the short barrels and at the end of any document list, seek to the start of the document list to the full barrel for every word and go to the step 1 If we are not at the end of any document list, go to the step 1 Sort the DOCIDs that have matched Phase 2: Document servers... For each DOCID compute actual title, URL and query-specific document summary (matched words context). Document servers are used to dispatch this completion – also documents are randomly distributed and replicated, so the completion is highly parallelizable Slide28:  Bonus Stanford lab (around 1996):  Stanford lab (around 1996) The Original Google Storage: 10x4GB (1996):  The Original Google Storage: 10x4GB (1996) Google San Francisco (2004):  Google San Francisco (2004) A cluster of coolness Google History :  A cluster of coolness Google History Google Results Page Per Day :  Google Results Page Per Day References:  References Sergey Brin, Lawrence Page; The Anatomy of a Large-Scale Hypertextual Web Search Engine; 1998 Luiz André Barroso, Jeffrey Dean and Urs Hoelzle:Web Search for a Planet: The Google Cluster Architecture; 2003 Urs Hoelzle:Google's secret of success? Dealing with failure; 2004 Jeffrey Dean and Sanjay Ghemawat: MapReduce: Simplified Data Processing on Large Clusters; 2004 Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung: The Google File System; 2003 GoogleBot Uniquely Google ™ Stanford Gadgets Google hacked?

Add a comment

Related presentations

Related pages

Google Alerts Aktienkurse - forex scalping markt tiefe

google alerts aktienkurse Wir verfügen über umfangreiche Erfahrungen mit Währungen und habe noch ... Innards Cluster Dispatch google alerts ...
Read more

MindForger: 7/10/05 - 7/17/05

As a part of my personal research related to Vodyanoi, I created a presentation describing Google cluster internals - SW, HW and programming methodology.
Read more 17 Distributed Systems and Web ...

17 Distributed Systems and Web Scalability Resources. ... Google Cluster Innards
Read more

CassandraCluster using FailoverOperator - Google Groups

CassandraCluster using FailoverOperator Showing 1-4 of 4 messages. ... innards, etc. I did this because I was running out of time, and was getting
Read more

Eric Brosch | Facebook

Eric Brosch is on Facebook. Join Facebook to connect with Eric Brosch and others you may know. ... Living Towers, Google Cluster Innards, ...
Read more

Blog de Sistemas Distribuidos - HPC-HA ...

12t14 7203972022966351912 askqtp blogger blogspot cluster clusterknoppix column colums curso distribuidos express finally google hardware innards liate ...
Read more

Saiyed Atiq Raza (@merlin1x) | Twitter

The latest Tweets from Saiyed Atiq Raza (@merlin1x): " #INDvAUS"
Read more