Investigating the Change of Web Pages’ Titles Over Time

50 %
50 %
Information about Investigating the Change of Web Pages’ Titles Over Time

Published on July 14, 2009

Author: martinklein0815

Source: slideshare.net

Investigating the Change of Web Pages’ Titles Over Time Martin Klein and Michael L. Nelson Old Dominion University {mklein,mln}@cs.odu.edu InDP 2009 Austin, TX 06/19/2009

The Problem 2

The Problem http://www.pspcentral.org/events/annual_meeting_2003.html 2

The Problem http://www.pspcentral.org/events/annual_meeting_2003.html 2

The Problem http://www.pspcentral.org/events/annual_meeting_2003.html 2

The Problem http://www.pspcentral.org/events/annual_meeting_2003.html 2

The Environment Web Infrastructure (WI) [McCown07] • Web search engines (Google, Yahoo!, MSN Live) and their caches • Research projects (CiteSeer) • Web archives (Internet Archive) [McCown07] - F. McCown “Lazy Preservation: Reconstructing Websites from the Web Infrastructure”, PhD thesis, Old Dominion University, 2007. 3

The Bigger Picture (1) DONE query for URL in: ·search engine caches present ! (2) ·Internet Archive results user is satisfied (3) ! ·identify dissimilar pages ·extract titles ·generate LSs ·obtain tags no results ·query search engines present found results ! user is ! (4) DONE satisfied (5) ·include link neighborhood ·relevance feedback ·user interaction: ! request keywords ! change number of terms in LS ! add/delete term from LS ! advanced search operators (6) present results ! DONE 4

The Bigger Picture (1) • System catches DONE 404 “Page not found” errors query for URL in: ·search engine caches present ! (2) ·Internet Archive results user is satisfied (3) ! ·identify dissimilar pages ·extract titles ·generate LSs ·obtain tags no results ·query search engines present found results ! user is ! (4) DONE satisfied (5) ·include link neighborhood ·relevance feedback ·user interaction: ! request keywords ! change number of terms in LS ! add/delete term from LS ! advanced search operators (6) present results ! DONE 4

The Bigger Picture (1) • System catches DONE 404 “Page not found” errors ! • query for URL in: (2) ·search engine caches ·Internet Archive present results user is Discovers copy of missing page in WI and provides to user satisfied (3) ! ·identify dissimilar pages ·extract titles ·generate LSs ·obtain tags no results ·query search engines present found results ! user is ! (4) DONE satisfied (5) ·include link neighborhood ·relevance feedback ·user interaction: ! request keywords ! change number of terms in LS ! add/delete term from LS ! advanced search operators (6) present results ! DONE 4

The Bigger Picture (1) • System catches DONE 404 “Page not found” errors ! • query for URL in: (2) ·search engine caches ·Internet Archive present results user is Discovers copy of missing page in WI and provides to user satisfied (3) ! ·identify dissimilar pages ·extract titles ·generate LSs • Obtains further data about missing page (LS, title, tags) and ·obtain tags no results ·query search engines present found feeds that back into WI results ! user is ! (4) DONE satisfied (5) ·include link neighborhood ·relevance feedback ·user interaction: ! request keywords ! change number of terms in LS ! add/delete term from LS ! advanced search operators (6) present results ! DONE 4

The Bigger Picture (1) • System catches DONE 404 “Page not found” errors ! • query for URL in: (2) ·search engine caches ·Internet Archive present results user is Discovers copy of missing page in WI and provides to user satisfied (3) ! ·identify dissimilar pages ·extract titles ·generate LSs • Obtains further data about missing page (LS, title, tags) and ·obtain tags no results ·query search engines present found feeds that back into WI results ! • ! user is (4) Provides page at its new location DONE satisfied (5) ·include link neighborhood or “good enough” alternative ·relevance feedback ·user interaction: ! request keywords page ! change number of terms in LS ! add/delete term from LS ! advanced search operators (6) present results ! DONE 4

The Bigger Picture (1) • System catches DONE 404 “Page not found” errors ! • query for URL in: (2) ·search engine caches ·Internet Archive present results user is Discovers copy of missing page in WI and provides to user satisfied (3) ! ·identify dissimilar pages ·extract titles ·generate LSs • Obtains further data about missing page (LS, title, tags) and ·obtain tags no results ·query search engines present found feeds that back into WI results ! • ! user is (4) Provides page at its new location DONE satisfied (5) ·include link neighborhood or “good enough” alternative ·relevance feedback ·user interaction: ! request keywords page • ! change number of terms in LS More sophisticated methods ! add/delete term from LS ! advanced search operators (6) present results ! DONE needed if unsuccessful so far 4

The Bigger Picture (1) DONE query for URL in: ·search engine caches present ! (2) ·Internet Archive results user is satisfied (3) ! ·identify dissimilar pages ·extract titles ·generate LSs ·obtain tags no results ·query search engines present found results ! user is ! (4) DONE satisfied (5) ·include link neighborhood ·relevance feedback ·user interaction: ! request keywords ! change number of terms in LS ! add/delete term from LS ! advanced search operators (6) present results ! DONE 4

The Bigger Picture (1) DONE query for URL in: ·search engine caches present ! (2) ·Internet Archive results user is satisfied (3) ! REAL TIME!!! ·identify dissimilar pages ·extract titles ·generate LSs ·obtain tags no results ·query search engines present found results ! user is ! (4) DONE satisfied (5) ·include link neighborhood ·relevance feedback ·user interaction: ! request keywords ! change number of terms in LS ! add/delete term from LS ! advanced search operators (6) present results ! DONE 4

Search Engine Queries • Lexical signatures (LSs) • Small set of terms capturing the “aboutness” of a document • Generated following the TF-IDF scheme • Phelps and Wilensky assumed ‘5’[Phelps00] • We have shown that 5- and 7-term LSs perform best [Klein08] [Phelps00] - T. A. Phelps and R. Wilensky “Robust Hyperlinks Cost Just Five Words Each”, Technical Report 2000 5 [Klein08] - M.Klein and M.L.Nelson ”Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008

Search Engine Queries • Lexical signatures (LSs) • Small set of terms capturing the “aboutness” of a document • Generated following the TF-IDF scheme • Phelps and Wilensky assumed ‘5’[Phelps00] • We have shown that 5- and 7-term LSs perform best [Klein08] BUT: [Phelps00] - T. A. Phelps and R. Wilensky “Robust Hyperlinks Cost Just Five Words Each”, Technical Report 2000 5 [Klein08] - M.Klein and M.L.Nelson ”Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008

Search Engine Queries • Lexical signatures (LSs) • Small set of terms capturing the “aboutness” of a document • Generated following the TF-IDF scheme • Phelps and Wilensky assumed ‘5’[Phelps00] • We have shown that 5- and 7-term LSs perform best [Klein08] BUT: • IDF can only be estimated when the entire web is the corpus [Phelps00] - T. A. Phelps and R. Wilensky “Robust Hyperlinks Cost Just Five Words Each”, Technical Report 2000 5 [Klein08] - M.Klein and M.L.Nelson ”Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008

Search Engine Queries • Lexical signatures (LSs) • Small set of terms capturing the “aboutness” of a document • Generated following the TF-IDF scheme • Phelps and Wilensky assumed ‘5’[Phelps00] • We have shown that 5- and 7-term LSs perform best [Klein08] BUT: • IDF can only be estimated when the entire web is the corpus • Expensive to generate [Phelps00] - T. A. Phelps and R. Wilensky “Robust Hyperlinks Cost Just Five Words Each”, Technical Report 2000 5 [Klein08] - M.Klein and M.L.Nelson ”Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008

Search Engine Queries • Lexical signatures (LSs) • Small set of terms capturing the “aboutness” of a document • Generated following the TF-IDF scheme • Phelps and Wilensky assumed ‘5’[Phelps00] • We have shown that 5- and 7-term LSs perform best [Klein08] BUT: • IDF can only be estimated when the entire web is the corpus • Expensive to generate Web pages’ titles [Phelps00] - T. A. Phelps and R. Wilensky “Robust Hyperlinks Cost Just Five Words Each”, Technical Report 2000 5 [Klein08] - M.Klein and M.L.Nelson ”Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008

Web Pages’ Titles • Easier/cheaper to obtain than LSs • High availability (1-2% of web pages have no title) • Also capturing “aboutness” of a web page 6

Web Pages’ Titles • Easier/cheaper to obtain than LSs • High availability (1-2% of web pages have no title) • Also capturing “aboutness” of a web page • We have shown that LSs decay over time and their retrieval performance decreases [Klein08] 6

Web Pages’ Titles • Easier/cheaper to obtain than LSs • High availability (1-2% of web pages have no title) • Also capturing “aboutness” of a web page • We have shown that LSs decay over time and their retrieval performance decreases [Klein08] • Investigate change of titles over time 6

Web Pages’ Titles • Easier/cheaper to obtain than LSs • High availability (1-2% of web pages have no title) • Also capturing “aboutness” of a web page • We have shown that LSs decay over time and their retrieval performance decreases [Klein08] • Investigate change of titles over time • General frequency of change 6

Web Pages’ Titles • Easier/cheaper to obtain than LSs • High availability (1-2% of web pages have no title) • Also capturing “aboutness” of a web page • We have shown that LSs decay over time and their retrieval performance decreases [Klein08] • Investigate change of titles over time • General frequency of change • Degree of change as Levenshtein score 6

Dataset • 6k URLs randomly sampled from DMOZ • Parsed the pages and extracted up to three URLs referencing to in-domain pages • Applied filter for: • Inaccessible pages • Pages not containing any links • Pages not in the .com, .net, .org or .edu domain • Pages without copies in the IA 7

Dataset • 6k URLs randomly sampled from DMOZ • Parsed the pages and extracted up to three URLs referencing to in-domain pages • Applied filter for: • Inaccessible pages • Pages not containing any links • Pages not in the .com, .net, .org or .edu domain • Pages without copies in the IA 1090 URLs and more than 100K observations 7

Dataset Length = 1 Length = 2 foo.bar/ foo.bar/bar/ foo.bar/index.html foo.bar/bar/index.html 8

Dataset Length = 1 Length = 2 foo.bar/ foo.bar/bar/ foo.bar/index.html foo.bar/bar/index.html 8

Dataset Length = 1 Length = 2 foo.bar/ foo.bar/bar/ foo.bar/index.html foo.bar/bar/index.html 8

Dataset Length = 1 Length = 2 foo.bar/ foo.bar/bar/ foo.bar/index.html foo.bar/bar/index.html 8

Frequency of Change Number of Changes and Observations in the IA ordered in Number of Changes increasing order by: Number of Observations 10000 1) observations 2) changes Number of Changes/Observations 1000 100 10 1 0 200 400 600 800 1000 URLs 9

Frequency of Change Number of Changes and Observations in the IA ordered in Number of Changes increasing order by: Number of Observations • generally low number of 10000 1) observations 2) changes change Number of Changes/Observations 1000 100 10 1 0 200 400 600 800 1000 URLs 9

Frequency of Change Number of Changes and Observations in the IA ordered in Number of Changes increasing order by: Number of Observations • generally low number of 10000 1) observations 2) changes change • max changes: 25 Number of Changes/Observations 1000 100 10 1 0 200 400 600 800 1000 URLs 9

Frequency of Change Number of Changes and Observations in the IA ordered in Number of Changes increasing order by: Number of Observations • generally low number of 10000 1) observations 2) changes change • max changes: 25 Number of Changes/Observations 1000 • number of observations does not impact the 100 number of changes 10 1 0 200 400 600 800 1000 URLs 9

Times of Change Mean Time Delta Between Changes Time Span Between First and Last Observation in the IA Changes Observations 10000 10000 Mean Time Delta Time Span ●●● ● ●● ● ●● ● ● ●● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ●●●● ● ● ● ● ●● ●● ● ● ●● ●● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ●●●●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ●●● ● ●● ●● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● Number of Changes/Observations ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● 1000 1000 ● ● ● ● ● ● ●● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ●●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ● ●● ●

Add a comment

Related pages

Investigating the Change of Web Pages’ Titles Over Time

Investigating the Change of Web Pages’ Titles Over Time Martin Klein Department of Computer Science Old Dominion University Norfolk, VA, 23529 mklein@cs ...
Read more

Title: Investigating the Change of Web Pages' Titles Over Time

Abstract: Inaccessible web pages are part of the browsing experience. The content of these pages however is often not completely lost but rather ...
Read more

Investigating the Change of Web Pages’ Titles Over Time ...

CiteSeerX - Scientific documents that cite the following paper: Investigating the Change of Web Pages’ Titles Over Time
Read more

HTML title tag - W3Schools Online Web Tutorials

THE WORLD'S LARGEST WEB DEVELOPER SITE ...