C gros-webscience-talk

53 %
47 %
Information about C gros-webscience-talk
Investor Relations

Published on October 21, 2011

Author: froubeac

Source: slideshare.net

Description

Talk by C. Gros on Websciences, the statistics of public file sizes and the files search engine FindFiles.net

How do I develop and found a search engine?Files and file search in the Internetfrom the perspective of FindFiles.netClaudius GrosInstitute for Theoretical PhysicsGoethe University Frankfurt, Germanyhttp://www.findfiles.net 1

overview data in the Internet– Mime types and statistics– the file search engine FindFiles.net science with data files– neuropsychological constraints to human data production 2

Internet – Statistics 3

Internet hostsHosts – Domains – Sites [source: Netcraft.com]• 2011 ∼ 100 active Mio domains 4

Internet users – emailusers in 2010• 2 · 109 worldwide• 825 · 106 Asia• 475 · 106 Europe• 266 · 106 North Americanemails in 2010• 107 · 1012 – number of emails sent• 89% – share of spam emails (success rate: 1:12 Mio ?)• 1.9 · 109 – number of email users [source: Pingdom.com] 5

Internet – social media2010• 152 · 106 – number of blogs• 25 · 109 – number of tweets on Twitter• 600 · 106 – Facebook acounts 30 · 109 – pieces of content: links, notes, images, ... 20 · 106 – number of activated apps (per day) [source: Pingdom.com] 6

social media – images and videosstreaming videos – tubes• 2 · 109 – watched per day on Youtube (one per Internet user)• 35 – hours of video uploaded (every minute)images, pictures• 5 · 109 – photos hosted by Flickr• 3000 – photos uploaded (per minute) [source: Pingdom.com] 7

social media – blogs & bookmarkingblogs are everywhere• 2010: 50 − 100 · 106 blogssocial is everthing• Digg, Mister Wong, Delicious, ...• social shopping, ... http://www.delicious.com 8

Internet – rules of thumb2010• 1 movie per day per user (Youtube)• 1 search query per day per every 2 users (Google)• every Internet user uses email• 5 ‘true’ emails per day per Internet user• 25% of Internet users use novel social media• most domains are blogs 9

Internet Startups 10

slow beginnings for startups in the InternetTwitter• linear scale – exponential or linear growth? Apr 2011 – 155 daily tweets 11

growth is generically not exponentialGoogle 30 revenues per year (billions, US-$) Google yearly revenues 25 20 15 10 5 0 2000 2002 2004 2006 2008 2010 year• linear scale 12

Internet: the winner takes allflow of attention in complex networks www.small.net www.small.org www.big.com www.medium.com www.small.com www.small.de• in-degree distribution pk heavy tails• preferential attachment 13

in-degree distributionpower law – scale invariant 8 6 log(number of hosts) 4 2 0 -2 number of incomming links -4 linear fit, slope -2.2: 7.51-2.2*x 0 1 2 3 4 5 6 10 10 10 10 10 10 10 number of incomming links [source: Findfiles.net]• scaling constant for 20 years – starting at one! 14

limiting diverging in-degree distribution ∞ 1 k k2−α pk ∝ α , k ∝ dk ∼ k k α 2−α Kc• diverging mean in-degree lim k → ∞ α→2 Internet: α ≈ 1.9 − 2.2 limiting dominating tail » limiting winners take all «• makes life difficult for small startups 15

the big two uphill fightsa new Internet startup needs to ...• fight for attention• fight for noveltytraffic and qualityheavy tail in-degree distribution makes it difficult to attract trafficeextremly high service standards act as effective entry barriers 16

FindFiles.net – a new file search engine 17

public data on the Internet280 Million domains in 2011 10-30 data files per domainInternet Media type – Mime type• categorization of all file types email attachments browser add-ons about ∼ 600 Mime types in use 18

Mime typesmajor Mime categories 33.2% application/ 2.9% audio/• together: 99% 58.0% image/ 5.1% text/ 0.7% video/Mime types – examples application/pdf audio/mpeg application/msword audio/midi application/vnd.android.package-archive chemical/x-pdb application/vnd.ms-powerpoint image/jpeg application/jar image/vnd.djvu application/x-deb text/xml application/x-gzip model/vrml 19

FindFiles.netsearch engine for data files G. Kaczor & C. Gros 2011• supports all Mine types http://www.findfiles.net 20

FindFiles.net – some stats daily queries [source: FindFiles.net]• 400 Mio data files 20 Mio host crawled 10 Million mp3 files 10 000 apps for Symbian/Android smartphones ... 21

blogs, legal issues & financingblog & press coverage http://www.findfiles.net/publicrelationscopyright & non-legal files• files protected by copyright/licence are not indexed (nofollow)• links to pirate files removed from indexfinancing• network – Unibator• banks are cautious – most startups fail 22

Science with Data Files 23

the Wikipedia/DMOZ corpusall outgoing links of• Wikipedia (all languages)• DMOZ – open directory project (all languages) 7.7 Mio hosts (domains) 252 Mio data files (FindFiles.net crawler)analysis of file size distribution• tails & scaling behaviour 24

number of files per domainfiles per host vs. in-degree• most files hosted on small domains 25

file size distributionnumber of files of given size 8 6 log(number of files) 4 2 0 all Mime categories Mime category application/ -2 Mime category audio/ Mime category image/ -4 Mime category text/ Mime category video/ 10 B 100 B 1 K 10 K 100 K 1 M 10 M 100 M 1 G 10 G file size [Bytes]• 252 Mio files in total – 9 orders of magnitude 26

power-law scaling of image-size distribution• compression gif: lossless; jpeg: lossy 6 4 log(number of files) 2 0 all Mime categories Mime type image/jpeg -2 linear fit, slope -2 linear fit, slope -4 -4 Mime type image/gif linear fit, slope -2.45 1K 10 K 100 K 1M 10 M 100 M 1G file size [Bytes]• kink at 4 Mbytes: amateur – professional 27

lognormal multimedia size distributionall audio and video Mime types 4 2 log(number of files) 0 -2 all Mime categories Mime category video/ quadratic fit (lognormal distribution) -4 Mime category audio/ quadratic fit (lognormal distribution) 1K 10 K 100 K 1M 10 M 100 M 1G 10 G file size [Bytes]• quadratic fit – lognormal distribution 28

lognormal distribution vs. powerlaw scalingfiles-size distribution p(s) [log(s)−µ]2 /σ2 e s−αnot a Taylor-series correction log (p(s)) ∝ α log(s) − β log2 (s) images: α < 0, β = 0. audio/video: α > 0, β>0 4 6 4 2 log(number of files) log(number of files) 2 0 0 all Mime categories -2 Mime type image/jpeg all Mime categories -2 linear fit, slope -2 Mime category video/ linear fit, slope -4 quadratic fit (lognormal distribution) Mime type image/gif -4 Mime category audio/ -4 linear fit, slope -2.45 quadratic fit (lognormal distribution) 1K 10 K 100 K 1M 10 M 100 M 1G 1K 10 K 100 K 1M 10 M 100 M 1G file size [Bytes] file size [Bytes] 29

one vs. two-dimensional cost functionseconomical cost functions for data production• size storage costs production costspsychophysical cost functions for data production• size (images) time needed to take an image is independent of resolution• size and time (audio & video) time and resolution are psychophysical distinct variables 30

Weber-Fechner law• neuopsychological cost functions are logarithmic in sensory stimulus intensity number of objects time perception music: tone pitch ∝ log(frequency) (octave) photometry: brightness ∝ log(intensity) (lumen) acoustics: sound level ∝ log(intensity) [decibel] information production: number of objects / time 31

information entropyShannon information entropy − p(s) log(p(s))ds p(s) ds = 1for a distribution function p(s)• a measure for the information contentShannon coding theorem Mimimal amount of bytes needed to encode a transmission is given by the information entropy of the signal statistics 32

neuropsychological cost functionsconditional entropy maximization δ − p(s) log(p(s)) ds − λ p(s)c(s) ds = 0 Shannon information entropy: − p(s) log(p(s))ds cost function: c(s) file size distribution: p(s)maximal file size distributions   exponential c(s) ∝ s  physical p(s) ∝ e−λc(s) ∼ power law c(s) ∝ log(s) 1-dim neuro c(s) ∝ log2 (s) 2-dim neuro   lognormal 33

physical vs. neuropsychological cost functions 6images 4 log(number of files) 2 physical exponential [not seen] 0 1-dim neuro power law [linear] all Mime categories Mime type image/jpeg -2 linear fit, slope -2 linear fit, slope -4 -4 Mime type image/gif linear fit, slope -2.45 1K 10 K 100 K 1M 10 M 100 M 1G file size [Bytes] 4audio/video 2 log(number of files) physical exponential [not seen] 0 2-dim neuro lognormal [quadradic] -2 all Mime categories Mime category video/ quadratic fit (lognormal distribution) -4 Mime category audio/ quadratic fit (lognormal distribution) 1K 10 K 100 K 1M 10 M 100 M 1G 10 G file size [Bytes] 34

global human data productionbasic assumptions• information production as underlying driving force information entropy as a suitable measure• law of large numbers average over production processes / producting agents compression/technology correspond to rescaling data production on a global level characterized by neuropsychological cost functions and not be eco- nomic constraints 35

the Internet & complex system theorycomplex system theory – still an emergent field many models and paradigms yet to be formulated network theory / game theory / allocation problems macroecology / systems biology / cognitive systems theory ...• information entropy maximization human data production on a global level neuropsychological cost functions ... 36

graduate level textbook • Information theory and complexity • Phase transitions and self-organized criticality • Life at the edge of chaos and punctuated equilibrium • Cognitive system theory and diffusive emotional control second edition 2010 37

Add a comment

Comments

veste moncler pas cher | 10/03/15
Did I want to together with you to the bathroom" Que Shilei looked dull response "Usually find is of. veste moncler pas cher http://www.diylegalprep.com/
moncler pas cher | 18/03/15
La la! heart tightly pulling pain once. 1% and 5. The more he thought the more is still. it is to see. moncler pas cher http://www.shoedropoff.com/moncler.html
timberland homme | 23/03/15
Okra Ichi dress. day crossed difficult. We expected in 2011.you want to find out Shu Wan such as play. timberland homme http://www.vilectric.com/
louboutin femme | 26/03/15
always meet funeral team through the high streets and back lanes. Peace of mind.which basically meet. louboutin femme http://www.vwmocambique.com/
basket louboutin | 02/04/15
Zhenjiang, cold brother you. Morgan Stanley predicted KWG from 2010 to 2011 net profit was 1206000000. basket louboutin http://www.gethelpmhs.com/

Related presentations