Information about Measurement and modeling of the web and related data sets

- Web Measurement

- Self similarity on the web

- Extraction of information from large graphs

- A word on evolution

- Self similarity on the web

- Extraction of information from large graphs

- A word on evolution

Setup This hour: data analysis on the web Next hour: probabilistic generative models, particularly focused on models that generate distributions that are power laws in the limit

This hour: data analysis on the web

Next hour: probabilistic generative models, particularly focused on models that generate distributions that are power laws in the limit

Context Data Analysis on the web… … as a hyperlinked corpus Note: Many areas of document analysis are highly relevant to the web, and should not be ignored (but will be): Supervised/unsupervised classification (Jon – combinatorial side) Machine learning (Jon – a little) Information retrieval (Jon – dimensionality reduction) Information extraction NLP Discourse analysis Relationship induction etc

Data Analysis on the web…

… as a hyperlinked corpus

Note: Many areas of document analysis are highly relevant to the web, and should not be ignored (but will be):

Supervised/unsupervised classification (Jon – combinatorial side)

Machine learning (Jon – a little)

Information retrieval (Jon – dimensionality reduction)

Information extraction

NLP

Discourse analysis

Relationship induction

etc

Focus Areas Web Measurement Self similarity on the web Extraction of information from large graphs A word on evolution

Web Measurement

Self similarity on the web

Extraction of information from large graphs

A word on evolution

One view of the Internet: Inter-Domain Connectivity Core: maximal clique of high-degree nodes Shells: nodes in 1-neighborhood of core, or of previous shell, with degree > 1 Legs: 1-degree nodes Core Shells: 1 2 3 [Tauro, Palmer, Siganos, Faloutsos, 2001 Global Internet]

Core: maximal clique of high-degree nodes

Shells: nodes in 1-neighborhood of core, or of previous shell, with degree > 1

Legs: 1-degree nodes

Another view of the web: the hyperlink graph Each static html page = a node Each hyperlink = a directed edge Currently ~10 10 nodes (mostly junk), 10 11 edges

Each static html page = a node

Each hyperlink = a directed edge

Currently ~10 10 nodes (mostly junk), 10 11 edges

Getting started – structure at the hyperlink level Measure properties of the link structure of the web. Study a sample of the web that contains a reasonable fraction of the entire web. Apply tools from graph theory to understand the structure. [Broder, Kumar, Maghoul, Raghavan, Rajagopalan, Stata, Tomkins, Wiener, 2001]

Measure properties of the link structure of the web.

Study a sample of the web that contains a reasonable fraction of the entire web.

Apply tools from graph theory to understand the structure.

Terminology SCC – strongly connected component WCC – “weakly connected component” – connected component in the underlying undirected graph

SCC – strongly connected component

WCC – “weakly connected component” – connected component in the underlying undirected graph

Data Altavista crawls, up to 500M pages Ran strong and weak connected component algorithms Ran random directed breadth-first searches from 1000 starting nodes, both forwards and backwards along links

Altavista crawls, up to 500M pages

Ran strong and weak connected component algorithms

Ran random directed breadth-first searches from 1000 starting nodes, both forwards and backwards along links

Breadth-first search from random starts How many vertices are reachable from a random vertex?

How many vertices are reachable from a random vertex?

A Picture of (~200M) pages.

Some distance measurements Pr[ u reachable from v ] ~ 1/4 Max distance between 2 SCC nodes: 28 Max distance between 2 nodes (if there is a path) > 900 Avg distance between 2 SCC nodes: 16

Pr[ u reachable from v ] ~ 1/4

Max distance between 2 SCC nodes: 28

Max distance between 2 nodes (if there is a path) > 900

Avg distance between 2 SCC nodes: 16

Facts (about the crawl). Indegree and Outdegree distributions satisfy the power law. Consistent over time and scale. The distribution of indegrees on the web is given by a Power Law --- Heavy-tailed distribution, with many high-indegree pages (eg, Yahoo)

Indegree and Outdegree distributions satisfy the power law. Consistent over time and scale.

Analysis of power law Pr [ page has k inlinks ] =~ k -2.1 Pr [ page has > k inlinks ] =~ 1/ k Pr [ page has k outlinks ] =~ k -2.7 Corollary:

Component sizes. Component sizes are distributed by the power law.

Component sizes are distributed by the power law.

Other observed power laws in the web Depths of URLs Sizes of sites Eigenvalues of adjacency matrix of hyperlink graph [Mihail and Papadimitriou shed some light here] Many different traffic measures Linkage between hosts and domains Many of the above measures on particular subsets of the graph … [Faloutsos, Faloutsos, Faloutsos 99] [Bharat, Chang, Henzinger, Ruhl 02]

Depths of URLs

Sizes of sites

Eigenvalues of adjacency matrix of hyperlink graph [Mihail and Papadimitriou shed some light here]

Many different traffic measures

Linkage between hosts and domains

Many of the above measures on particular subsets of the graph

…

More Characterization: Self-Similarity

Ways to Slice the Web Domain (*.it) Host (www.ibm.com) Geography (pages with a geographical reference in the Western US) Content Keyword: Math, subdivided by Math Geometry Keyword: MP3, subdivided by MP3 Napster We call these slices “Thematically Unified Communities”, or TUCs

Domain (*.it)

Host (www.ibm.com)

Geography (pages with a geographical reference in the Western US)

Content

Keyword: Math, subdivided by Math Geometry

Keyword: MP3, subdivided by MP3 Napster

Self-Similarity on the Web Pervasive: holds for all reasonable characteristics Robust: holds for all reasonable slices “ Theorem:” TUCs share properties with the web at large TUCs are linked by a “navigational backbone”

Pervasive: holds for all reasonable characteristics

Robust: holds for all reasonable slices

“ Theorem:”

TUCs share properties with the web at large

TUCs are linked by a “navigational backbone”

In particular… All TUCs have: Power laws for degree, SCC, and WCC distributions Similar exponents for power laws Similar “bow tie” structure Large number of dense subgraphs

All TUCs have:

Power laws for degree, SCC, and WCC distributions

Similar exponents for power laws

Similar “bow tie” structure

Large number of dense subgraphs

Is this surprising? YES (for downsampling general graphs). Example: This graph has 1 SCC containing all nodes Remove any nonzero fraction of edges – graph has n components of size 1 Generally: random subset of size n 1/2 in a graph with O( n ) edges will have only constant number of edges

YES (for downsampling general graphs). Example:

This graph has 1 SCC containing all nodes

Remove any nonzero fraction of edges – graph has n components of size 1

Generally: random subset of size n 1/2 in a graph with O( n ) edges will have only constant number of edges

A structural explanation Each TUC has a “bow tie” – how do they relate?

Each TUC has a “bow tie” – how do they relate?

The Navigational Backbone Each TUC contains a large SCC that is well-connected to the SCCs of other TUCs

Information Extraction from Large Graphs

Overview WWW Distill KB1 KB2 KB3 Goal: Create higher-level "knowledge bases" of web information for further processing. [Kumar, Raghavan, Rajagopalan, Tomkins 1999]

Many approaches to this problem Databases over the web: Web SQL, Lore, ParaSite, etc Data mining A priori, Query flocks, etc Information foraging Community extraction [Lawrence et al] Authority-based search HITS, and variants

Databases over the web:

Web SQL, Lore, ParaSite, etc

Data mining

A priori, Query flocks, etc

Information foraging

Community extraction

[Lawrence et al]

Authority-based search

HITS, and variants

General approach It’s hard (though getting easier) to analyze the content of all pages on the web It’s easier (though still hard) to analyze the graph How successfully can we extract useful semantic knowledge (ie, community structure) from links alone?

It’s hard (though getting easier) to analyze the content of all pages on the web

It’s easier (though still hard) to analyze the graph

How successfully can we extract useful semantic knowledge (ie, community structure) from links alone?

Web Communities Fishing Outdoor Magazine Bill's Fishing Resources Linux Linux Links LDP Different communities appear to have very different structure.

Web Communities Fishing Outdoor Magazine Bill's Fishing Resources Linux Linux Links LDP But both contain a common “footprint”: two pages ( ) that both Point to three other pages in common ( )

Communities and cores Example K 2,3 Definition: A "core" K ij consists of i left nodes, j right nodes, and all left->right edges. Critical facts: 1. Almost all communities contain a core [expected] 2. Almost all cores betoken a community [unexpected]

Other footprint structures Newsgroup thread Web ring Corporate partnership Intranet fragment

Subgraph enumeration Goal: Given a graph-theoretic "footprint" for structures of interest, find ALL occurrences of these footprints.

Goal: Given a graph-theoretic "footprint" for structures of interest, find ALL occurrences of these footprints.

Enumerating cores a a belongs to a K 2,3 if and only if some node points to b1, b2, b3. b2 b1 b3 Inclusion/Exclusion Pruning Clean data by removing: mirrors (true and approximate) empty pages, too-popular pages, nepotistic pages Preprocessing When no more pruning is possible, finish using database techniques Postprocessing

Results for cores 3 5 7 9 0 20 40 60 80 100 Thousands i=3 i=4 i=5 i=6 Number of cores found by Elimination/Generation 3 5 7 9 0 20 40 60 80 Thousands i=3 i=4 Number of cores found during postprocessing

The cores are interesting (1) Implicit communities are defined by cores. (2) There are an order of magnitude more of these. (10 5+ ) (3) Can grow the core to the community using further processing. Explicit communities. Yahoo!, Excite, Infoseek webrings news groups mailing lists Implicit communities japanese elementary schools turkish student associations oil spills off the coast of japan australian fire brigades

Yahoo!, Excite, Infoseek

webrings

news groups

mailing lists

japanese elementary schools

turkish student associations

oil spills off the coast of japan

australian fire brigades

Elementary Schools in Japan The American School in Japan The Link Page ‰ ªès—§ˆä“c¬ŠwZƒz[ƒƒy[ƒW Kids' Space ˆÀés—§ˆÀé¼•”¬ŠwZ ‹ {é‹³ˆç‘åŠw•‘®¬ŠwZ KEIMEI GAKUEN Home Page ( Japanese ) Shiranuma Home Page fuzoku-es.fukui-u.ac.jp welcome to Miasa E&J school _“ÞìŒ§E‰¡•ls—§’†ì¼¬ŠwZ‚Ìƒy http://www...p/~m_maru/index.html fukui haruyama-es HomePage Torisu primary school goo Yakumo Elementary,Hokkaido,Japan FUZOKU Home Page Kamishibun Elementary School... schools LINK Page-13 “ ú–{‚ÌŠwZ a‰„¬ŠwZƒz[ƒƒy[ƒW 100 Schools Home Pages (English) K-12 from Japan 10/...rnet and Education ) http://www...iglobe.ne.jp/~IKESAN ‚ l‚f‚j¬ŠwZ‚U”N‚P‘g•¨Œê ÒŠ—’¬—§ÒŠ—“Œ¬ŠwZ Koulutus ja oppilaitokset TOYODA HOMEPAGE Education Cay's Homepage(Japanese) – y“ì¬ŠwZ‚Ìƒz[ƒƒy[ƒW UNIVERSITY ‰ J—³¬ŠwZ DRAGON97-TOP Â‰ª¬ŠwZ‚T”N‚P‘gƒz[ƒƒy[ƒW ¶µ°é¼ÂÁ© ¥á¥Ë¥å¡¼ ¥á¥Ë¥å¡¼

The American School in Japan

The Link Page

‰ ªès—§ˆä“c¬ŠwZƒz[ƒƒy[ƒW

Kids' Space

ˆÀés—§ˆÀé¼•”¬ŠwZ

‹ {é‹³ˆç‘åŠw•‘®¬ŠwZ

KEIMEI GAKUEN Home Page ( Japanese )

Shiranuma Home Page

fuzoku-es.fukui-u.ac.jp

welcome to Miasa E&J school

_“ÞìŒ§E‰¡•ls—§’†ì¼¬ŠwZ‚Ìƒy

http://www...p/~m_maru/index.html

fukui haruyama-es HomePage

Torisu primary school

goo

Yakumo Elementary,Hokkaido,Japan

FUZOKU Home Page

Kamishibun Elementary School...

schools

LINK Page-13

“ ú–{‚ÌŠwZ

a‰„¬ŠwZƒz[ƒƒy[ƒW

100 Schools Home Pages (English)

K-12 from Japan 10/...rnet and Education )

http://www...iglobe.ne.jp/~IKESAN

‚ l‚f‚j¬ŠwZ‚U”N‚P‘g•¨Œê

ÒŠ—’¬—§ÒŠ—“Œ¬ŠwZ

Koulutus ja oppilaitokset

TOYODA HOMEPAGE

Education

Cay's Homepage(Japanese)

– y“ì¬ŠwZ‚Ìƒz[ƒƒy[ƒW

UNIVERSITY

‰ J—³¬ŠwZ DRAGON97-TOP

Â‰ª¬ŠwZ‚T”N‚P‘gƒz[ƒƒy[ƒW

¶µ°é¼ÂÁ© ¥á¥Ë¥å¡¼ ¥á¥Ë¥å¡¼

So… Possible to extract order-of-magnitude more communities than currently known. Few (4%) of these appear coincidental. Entirely automatic extraction. Open question: how to use implicit communities?

Possible to extract order-of-magnitude more communities than currently known.

Few (4%) of these appear coincidental.

Entirely automatic extraction.

Open question: how to use implicit communities?

A word on evolution

A word on evolution Phenomenon to characterize: A topic in a temporal stream occurs in a “burst of activity” Model source as multi-state Each state has certain emission properties Traversal between states is controlled by a Markov Model Determine most likely underlying state sequence over time, given observable output [Kleinberg02]

Phenomenon to characterize: A topic in a temporal stream occurs in a “burst of activity”

Model source as multi-state

Each state has certain emission properties

Traversal between states is controlled by a Markov Model

Determine most likely underlying state sequence over time, given observable output

Example Time I’ve been thinking about your idea with the asparagus… Uh huh I think I see… Uh huh Yeah, that’s what I’m saying… So then I said “Hey, let’s give it a try” And anyway she said maybe, okay? Most likely “hidden” sequence: 0.005 1 2 0.01 State 1: Output rate: very low State 2: Output rate: very high Pr[2] ~ 10 Pr[2] ~ 10 Pr[2] ~ 7 Pr[2] ~ 2 Pr[2] ~ 5 Pr[2] ~ 2 Pr[2] ~ 5 Pr[1] ~ 2 Pr[1] ~ 1 Pr[1] ~ 2 Pr[1] ~ 10 Pr[1] ~ 5 Pr[1] ~ 10 Pr[1] ~ 1 2 2 2 1 1 1 1

More bursts Infinite chain of increasingly high-output states Allows hierarchical bursts Example 1: email messages Example 2: conference titles

Infinite chain of increasingly high-output states

Allows hierarchical bursts

Example 1: email messages

Example 2: conference titles

Integrating bursts and graph analysis Wired magazine publishes an article on weblogs that impacts the tech community Newsweek magazine publishes an article that reaches the population at large, responding to emergence, and triggering mainstream adoption [KNRT03] Number of communities identified automatically as exhibiting “bursty” behavior – measure of cohesiveness of the blogspace Number of blog pages that belong to a community Number of blog communities

IMA Tutorial (part III): Generative and probabilistic models of data May 5, 2003 Title slide

Probabilistic generative models Observation: These distributions have the same form: Fraction of laptops that fail catastrophically during tutorials, by city Fraction of pairs of shoes that spontaneously de-sole during periods of stress, by city Conclusion: The distribution arises because the same stochastic process is at work, and this process can be understood beyond the context of each example

Observation: These distributions have the same form:

Fraction of laptops that fail catastrophically during tutorials, by city

Fraction of pairs of shoes that spontaneously de-sole during periods of stress, by city

Conclusion: The distribution arises because the same stochastic process is at work, and this process can be understood beyond the context of each example

Models for Power Laws Power laws arise in many different areas of human endeavor, the “hallmark of human activity” (they also occur in nature) Can we find the underlying process (processes?) that accounts for this prevalence?

Power laws arise in many different areas of human endeavor, the “hallmark of human activity”

(they also occur in nature)

Can we find the underlying process (processes?) that accounts for this prevalence?

An Introduction to the Power Law Definition: a distribution is said to have a power law if Pr[ X >= x ] cx Normally: 0< <=2 (Var(X) infinite) Sometimes: 0< <=1 (Mean(X) infinite) Exponentially-decaying distribution Power law distribution

Definition: a distribution is said to have a power law if Pr[ X >= x ] cx

Normally: 0< <=2 (Var(X) infinite)

Sometimes: 0< <=1 (Mean(X) infinite)

Early Observations: Pareto on Income [Pareto1897] observed that the random variable I denoting the income of an individual has a power law distribution More strongly, he observed that Pr[ X>x ] = ( x/k) For density function f , note that ln f ( x ) = (- -1)ln( x ) + c for constant c Thus, in a plot of log(value) versus log(probability), power laws display a linear tail, while Pareto distributions are linear always.

[Pareto1897] observed that the random variable I denoting the income of an individual has a power law distribution

More strongly, he observed that Pr[ X>x ] = ( x/k)

For density function f , note that ln f ( x ) = (- -1)ln( x ) + c for constant c

Thus, in a plot of log(value) versus log(probability), power laws display a linear tail, while Pareto distributions are linear always.

Early Observations: Yule/Zipf [Yule26] observed (and explained) power laws in the context of number of species within a genus [Zipf32] and [Estoup16] studied the relative frequency of words in natural language, beginning a cottage industry that continues to this day. A “Yule-Zipf” distribution is typically characterized by rank rather than value: The i th most frequent word in English occurs with probability proportional to 1/i . This characterization relies on finite vocabulary

[Yule26] observed (and explained) power laws in the context of number of species within a genus

[Zipf32] and [Estoup16] studied the relative frequency of words in natural language, beginning a cottage industry that continues to this day.

A “Yule-Zipf” distribution is typically characterized by rank rather than value:

The i th most frequent word in English occurs with probability proportional to 1/i .

This characterization relies on finite vocabulary

Early Observations: Lotka on Citations [Lotka25] presented the first occurrence of power laws in the context of graph theory, showing a power law for the indegree of the citation graph

[Lotka25] presented the first occurrence of power laws in the context of graph theory, showing a power law for the indegree of the citation graph

Ranks versus Values Commonly encountered phrasings of the power law in the context of word counts: Pr[word has count >= W ] has some form Number of words with count >= W has some form The frequency of the word with rank r has some form The first two forms are clearly identical. What about the third?

Commonly encountered phrasings of the power law in the context of word counts:

Pr[word has count >= W ] has some form

Number of words with count >= W has some form

The frequency of the word with rank r has some form

The first two forms are clearly identical.

What about the third?

Equivalence of rank versus value formulation Given: number of words occurring t times ~ t Approach: Consider single most frequent word, with count T Characterize word occurring t times in terms of T Approximate rank of words occurring t times by counting number of words occurring at each more frequent count. Conclusion: Rank- j word occurs (c j + d) times (power law) But... high ranks correspond to low values – must keep straight the “head” and the “tail” [Bookstein90, Adamic99]

Given: number of words occurring t times ~ t

Approach:

Consider single most frequent word, with count T

Characterize word occurring t times in terms of T

Approximate rank of words occurring t times by counting number of words occurring at each more frequent count.

Conclusion: Rank- j word occurs (c j + d) times (power law)

But... high ranks correspond to low values – must keep straight the “head” and the “tail”

Early modeling work The characterization of power laws is a limiting statement Early modeling work showed approaches that provide the correct form of the tail in the limit Later work introduced the rate of convergence of a process to its limiting distribution

The characterization of power laws is a limiting statement

Early modeling work showed approaches that provide the correct form of the tail in the limit

Later work introduced the rate of convergence of a process to its limiting distribution

A model of Simon Following Simon [1955], described in terms of word frequences Consider a book being written. Initially, the book contains a single word, “the.” At time t , the book contains t words. The process of Simon generates the t+1 st word based on the current book.

Following Simon [1955], described in terms of word frequences

Consider a book being written. Initially, the book contains a single word, “the.”

At time t , the book contains t words. The process of Simon generates the t+1 st word based on the current book.

Constructing a book: snapshot at time t When in the course of human events, it becomes necessary… Current word frequencies: Let f(i,t) be the number of words of count i at time t Count Word Rank 11,325 4,791 … 3 2 1 “ ...” “ ...” 5 “ necessary” 1 “ neccesary” … “ ...” 300 “ from” 600 “ of” 1000 “ the”

The Generative Model Assumptions: Constant probability that a neologism will be introduced at any timestep Probability of re-using a word of count i is proportional to if(i,t) , that is, number of occurrences of count i words. Algorithm: With probability a new word is introduced into the text With remaining probability, a word with count i is introduced with probability proportional to if(i,t)

Assumptions:

Constant probability that a neologism will be introduced at any timestep

Probability of re-using a word of count i is proportional to if(i,t) , that is, number of occurrences of count i words.

Algorithm:

With probability a new word is introduced into the text

With remaining probability, a word with count i is introduced with probability proportional to if(i,t)

Constructing a book: snapshot at time t Current word frequencies: Let f(i,t) be the number of words of count i at time t Pr[“the”] = (1- ) 1000 / K Pr[“of”] = (1- ) 600 / K Pr[some count-1 word] = (1- ) 1 * f(1,t) / K K = if(i,t) Count Word Rank 11,325 4,791 … 3 2 1 “ ...” “ ...” 5 “ necessary” 1 “ neccesary” … “ ...” 300 “ from” 600 “ of” 1000 “ the”

What’s going on? One unique word (which occurs 1 or more times) 1 2 3 4 5 6 Each word in bucket i occurs i times in the current document … .

What’s going on? 1 With probability a new word is introduced into the text 2 3 4 5 6

What’s going on? 1 4 How many times do words in this bucket occur? With probability 1- an existing word is reused 2 3 5 6

What’s going on? 2 3 4 Size of bucket 3 at time t+1 depends only on sizes of buckets 2 and 3 at time t ? ? Must show: fraction of balls in 3 rd bucket approaches some limiting value

Models for power laws in the web graph Retelling the Simon model: “preferential attachment” Barabasi et al Kumar et al Other models for the web graph: [Aiello, Chung, Lu], [Huberman et al]

Retelling the Simon model: “preferential attachment”

Barabasi et al

Kumar et al

Other models for the web graph:

[Aiello, Chung, Lu], [Huberman et al]

Why create such a model? Evaluate algorithms and heuristics Get insight into page creation Estimate hard-to-sample parameters Help understand web structure Cost modeling for query optimization To find “surprises” means we must understand what is typical .

Evaluate algorithms and heuristics

Get insight into page creation

Estimate hard-to-sample parameters

Help understand web structure

Cost modeling for query optimization

To find “surprises” means we must understand what is typical .

Random graph models G(n,p) Web indeg > 1000 k23's 4-cliques 0 0 0 100000 125000 many Traditional random graphs [Bollobas 85] are not like the web! Is there a better model?

Desiderata for a graph model Succinct description Insight into page creation No a priori set of "topics", but... ... topics should emerge naturally Reflect structural phenomena Dynamic page arrivals Should mirror web's "rich get richer" property, and manifest link correlation.

Succinct description

Insight into page creation

No a priori set of "topics", but...

... topics should emerge naturally

Reflect structural phenomena

Dynamic page arrivals

Should mirror web's "rich get richer" property, and manifest link correlation.

Page creation on the web Some page creators will link to other sites without regard to existing topics, but Most page creators will be drawn to pages covering existing topics they care about, and will link to pages within these topics Model idea: new pages add links by "copying" them from existing pages

Some page creators will link to other sites without regard to existing topics, but

Most page creators will be drawn to pages covering existing topics they care about, and will link to pages within these topics

Generally, would require… Separate processes for: Node creation Node deletion Edge creation Edge deletion

Separate processes for:

Node creation

Node deletion

Edge creation

Edge deletion

A specific model Nodes are created in a sequence of discrete time steps e.g. at each time step, a new node is created with d 1) out-links Probabilistic copying links go to random nodes with probability copy d links from a random node with probability 1-

Nodes are created in a sequence of discrete time steps

e.g. at each time step, a new node is created with d 1) out-links

Probabilistic copying

links go to random nodes with probability

copy d links from a random node with probability 1-

Example New node arrives With probability , it links to a uniformly-chosen page

Example To copy, it first chooses a page uniformly Then chooses a uniform out-edge from that page Then links to the destination of that edge ("copies" the edge) Under copying, your rate of getting new inlinks is proportional to your in-degree. With probability (1- ), it decides to copy a link.

Degree sequences in this model Pr[page has k inlinks] =~ k Heavy-tailed inverse polynomial degree sequences. Pages like netscape and yahoo exist. Many cores, cliques, and other dense subgraphs ( = 1/11 matches web) -(2- ) (1- )

Model extensions Component size distributions. More complex copying. Tighter lower tail bounds. More structure results.

Component size distributions.

More complex copying.

Tighter lower tail bounds.

More structure results.

A model of Mandelbrot Key idea: Generate frequencies of English words to maximize information transferred per unit cost Approach: Say word i occurs with probability p(i) Set the transmission cost of word i to be log( i) Average information per word: – p(i) log(p(i)) Cost of a word with probability p(j): log (j) Average cost per word: p(j) log(j) Choose probabilities p(i) to maximize information/cost Result: p(j) = c j

Key idea: Generate frequencies of English words to maximize information transferred per unit cost

Approach:

Say word i occurs with probability p(i)

Set the transmission cost of word i to be log( i)

Average information per word: – p(i) log(p(i))

Cost of a word with probability p(j): log (j)

Average cost per word: p(j) log(j)

Choose probabilities p(i) to maximize information/cost

Result: p(j) = c j

Discussion of Mandelbrot’s model Trade-offs between communication cost ( log(p(j) ) and information. Are there other tradeoff-based models that drive similar properties?

Trade-offs between communication cost ( log(p(j) ) and information.

Are there other tradeoff-based models that drive similar properties?

Heuristically Optimized Trade-offs Goal: construction of trees (note: models to generate trees with power law behavior were first proposed in [Yule26]) Idea: New nodes must trade off connecting to nearby nodes, and connecting to central nodes. Model: Points arrive uniformly within the unit square New point arrives, and computes two measures for candidate connection points j d(j) : distance from new node to existing node j (“nearness”) h(j) : distance from node j to root of tree (“centrality”) New destination chosen to minimize d(j) + h(j) Result: for a wide variety of values of , distribution of degrees has a power law [Fabrikant, Koutsoupias, Papadimitriou 2002]

Goal: construction of trees (note: models to generate trees with power law behavior were first proposed in [Yule26])

Idea: New nodes must trade off connecting to nearby nodes, and connecting to central nodes.

Model:

Points arrive uniformly within the unit square

New point arrives, and computes two measures for candidate connection points j

d(j) : distance from new node to existing node j (“nearness”)

h(j) : distance from node j to root of tree (“centrality”)

New destination chosen to minimize d(j) + h(j)

Result: for a wide variety of values of , distribution of degrees has a power law

Monkeys on Typewriters Consider a creation model divorced form concerns of information and cost Model: Monkey types randomly, hits space bar with probability q , character chosen uniformly with remaining probability Result: Rank j word occurs with probability qj log(1-q)-1 = c j

Consider a creation model divorced form concerns of information and cost

Model:

Monkey types randomly, hits space bar with probability q , character chosen uniformly with remaining probability

Result:

Rank j word occurs with probability qj log(1-q)-1 = c j

Other Distributions “Power law” means a clean characterization of a particular property on distribution upper tails Often used to mean “heavy tailed,” meaning bounded away from an exponentially decaying distribution There are other forms of heavy-tailed distributions A commonly-occurring example: lognormal distribution

“Power law” means a clean characterization of a particular property on distribution upper tails

Often used to mean “heavy tailed,” meaning bounded away from an exponentially decaying distribution

There are other forms of heavy-tailed distributions

A commonly-occurring example: lognormal distribution

Quick characterization of lognormal distributions Let X be a normally-distributed random variable Let Y = ln X Then Y is lognormal Properties: Often occur in situations of multiplicative growth Prop2 Concern: There is a growing sequence of papers dating back several decades questioning whether certain observed values are best described by power law or lognormal (or other) distributions.

Let X be a normally-distributed random variable

Let Y = ln X

Then Y is lognormal

Properties:

Often occur in situations of multiplicative growth

Prop2

Concern: There is a growing sequence of papers dating back several decades questioning whether certain observed values are best described by power law or lognormal (or other) distributions.

One final direction… The Central Limit Theorem tells us how sums of independent random variables behave in the limit Example: ln X j = ln X 0 + ln F j X j well-approximated by a lognormal variable Thus, lognormal variables arise in situations of multiplicative growth Examples in biology, ecology, economics,… Example: [Huberman et al]: growth of web sites Similarly: the product The same result applies to the product of lognormal variables Each of these generative models is evolutionary What is the role of time?

The Central Limit Theorem tells us how sums of independent random variables behave in the limit

Example: ln X j = ln X 0 + ln F j

X j well-approximated by a lognormal variable

Thus, lognormal variables arise in situations of multiplicative growth

Examples in biology, ecology, economics,…

Example: [Huberman et al]: growth of web sites

Similarly: the product The same result applies to the product of lognormal variables

Each of these generative models is evolutionary

What is the role of time?

Descriptive statistics generally characterizes or describes a set of data ... the data might be pairs of measurements ... and nominee are related ...

Read more

The data set may comprise data for one or ... to refer to the data in a collection of closely related ... kinds described as a level of measurement.

Read more

The Digital Marketing & Measurement Model provides a 5 step ... well before they think data or ... 7 Incredible Web Design, Branding, Digital Marketing ...

Read more

Conjoint Analysis, Related Modeling, ... to new forms of data collection including web ... Conjoint analysis is based on measurements that inform ...

Read more

Further examining the data set in ... necessary for deriving results related to methods of ... (statistics) Structural equation modelling;

Read more

Data + Design A simple ... The term nominal is related to the Latin word ... you should collect the data at the highest level of measurement that you think ...

Read more

Congressional Budget Data ... or other site s identified as related to S&T ... Measurement and Modeling of Slip Transfer Across ...

Read more

CCSS.Math.Content.2.MD.D.9 Generate measurement data by measuring ... display a data set of measurements in ... Measurement & Dimension; Modeling ...

Read more

Data, Tools and Statistics ... This public data set contains information about services and ... to sources of more data, and to related web pages ...

Read more

## Add a comment