Search Engine Spiders

50 %
50 %
Information about Search Engine Spiders
Technology

Published on November 20, 2008

Author: mcjenkins

Source: slideshare.net

Description

Full tutorial on what spiders are, how they work, why we need them and how to create and maintain your own.

Search Engine Spiders http://scienceforseo.blogspot.com IR tutorial series: Part 2

...programs which scan the web in a methodical and automated way. ...they copy all the pages they visit and leave them to the search engine for indexing. ...not all spiders have the same job though, some check links, or collect email addresses, or validate code for example. Spiders are... ...some people call them crawlers, bots and even ants or worms. (“Spidering” means to request every page on a site)

A spider's architecture: Downloads web pages Stuff is stored URLs get queued Co-ordinates the processes

An example

The crawl list would look like this (although it would be much much bigger than this small sample): http://www.techcrunch.com/ http://www.crunchgear.com/ http://www.mobilecrunch.com/ http://www.techcrunchit.com/ http://www.crunchbase.com/ http://www.techcrunch.com/# http://www.inviteshare.com/ http://pitches.techcrunch.com/ http://gillmorgang.techcrunch.com/ http://www.talkcrunch.com/ http://www.techcrunch50.com/ http://uk.techcrunch.com/ http://fr.techcrunch.com/ http://jp.techcrunch.com/ The spider will also save a copy of each page it visits in a database. The search engine will then index those. The first URLs given to the spider as a starting point are called “seeds”. The list gets bigger and bigger and in order to make sure that the search engine index is current, the spider will need to re-visit those links often to track any changes. There are 2 lists: a list of URLs visited and a list of URLs to visit. This list is known as “The crawl frontier”.

Difficulties The web is enormous: no search engine indexes more than 16% of the web. The spider will download only the most relevant pages. The rate of change is phenomenal – a spider needs to re-visit pages often to check for updates and changes. Server-side scripting languages do not often return unique content but give a lot of URLs for the spider to visit,which is a waste of time for it

The web is enormous: no search engine indexes more than 16% of the web. The spider will download only the most relevant pages.

The rate of change is phenomenal – a spider needs to re-visit pages often to check for updates and changes.

Server-side scripting languages do not often return unique content but give a lot of URLs for the spider to visit,which is a waste of time for it

Solutions Spiders will use the following policies: A selection policy that states which pages to download. A re-visit policy that states when to check for changes to the pages. A politeness policy that states how to avoid overloading websites. A parallelization policy that states how to coordinate distributed web crawlers.

A selection policy that states which pages to download.

A re-visit policy that states when to check for changes to the pages.

A politeness policy that states how to avoid overloading websites.

A parallelization policy that states how to coordinate distributed web crawlers.

Build a spider You can use any programming language that you feel comfortable with, although JAVA, Perl and C# ones are the most popular. You can also use these tutorials: Java sun spider - http://tiny.cc/e2KAy Chilkat in python - http://tiny.cc/WH7eh Swish-e in Perl - http://tiny.cc/nNF5Q Remember that a poorly designed spider can impact overall network and server performance.

OpenSource spiders You can use one of these for free (some knowledge of programming can help in setting them up): OpenWebSpider in C# - http://www.openwebspider.org Arachnid in Java - http://arachnid.sourceforge.net/ Java-web-spider - http://code.google.com/p/java-web-spider/ MOMSpider in perl - http://tiny.cc/36XQA

Robots.txt This is a file that allows webmasters to give instructions to visiting spiders who must respect it. Some areas are off-limits. Disallow spider from everything User-agent: * Disallow: / Disallow all except Googlebot and BackRub, which can access /private User-agent: Googlebot User-agent: BackRub Disallow: /private and churl, which can access everything User-agent: churl Disallow:

Spider ethics There is code for spiders that developers must follow and you can read them here: http://www.robotstxt.org/guidelines.html In (very) short: Are you sure the world needs another spider? Identify the spider, yourself and publish your documentation. Test locally Moderate the speed and frequency of runs to a given host Only retrieve what you can handle (format & scale) Monitor your runs Share your results List your spider in the database http://www.robotstxt.org/db.html

Are you sure the world needs another spider?

Identify the spider, yourself and publish your documentation.

Test locally

Moderate the speed and frequency of runs to a given host

Only retrieve what you can handle (format & scale)

Monitor your runs

Share your results

Spider traps Intentionally and non-intentionally, traps crop up on the spider's path sometimes and stop it functioning properly. Dynamic pages, deep directories that never end, pages with special links and commands pointing the spider to other directories...anything that can put the spider into an infinite loop is an issue. You might however want to deploy a spider trap if you know that one is visiting your site and not respecting your robots.txt for example or because it's a spambot.

Fleiner's spider trap <html><head><title> You are a bad netizen if you are a web bot! </title> <body><h1><b> You are a bad netizen if you are a web bot! </h1></b> <!--#config timefmt=&quot;%y%j%H%M%S&quot; --> <!-- of date string --> <!--#exec cmd=&quot;sleep 20&quot; --> <!-- make this page sloooow to load --> To give robots some work here some special links: these are <a href=a<!--#echo var=&quot;DATE_GMT&quot; -->.html> some links </a> to this <a href=b<!--#echo var=&quot;DATE_GMT&quot; -->.html> very page </a> but with <a href=c<!--#echo var=&quot;DATE_GMT&quot; -->.html> different names </a> You can download spider traps and find out more at Fleiner's page: http://www.fleiner.com/bots/#trap

Web 3.0 crawling Web 3.0 allows machines to store, exchange, and use machine-readable information. A website parse template is based on XML and provides HTML structure description of web pages. This allows a spider to generate RDF triples for web pages. These have a subject (i,e:dog), a predicate (i.e. blue), and an object (equivalent to the value of the resource property type or the specific subject) – 3 is the magic number! Each RDF triple is a complete and unique fact. This way humans and machines can share the same information. We call these “scutters” rather than Spiders (or all other variants). They consume and act upon RDF documents, which is why they're different.

Web 3.0 allows machines to store, exchange, and use machine-readable information.

A website parse template is based on XML and provides HTML structure description of web pages.

This allows a spider to generate RDF triples for web pages. These have a subject (i,e:dog), a predicate (i.e. blue), and an object (equivalent to the value of the resource property type or the specific subject) – 3 is the magic number! Each RDF triple is a complete and unique fact.

This way humans and machines can share the same information.

We call these “scutters” rather than Spiders (or all other variants). They consume and act upon RDF documents, which is why they're different.

Other resources O'Reilley &quot;Spidering hacks&quot; BotSpot Web crawling by Castillo Finding what people want by Pinkerton Sphinx crawler by Miller and bharat Help web crawlers crawl your website by IBM Bean software spider components and info Ubi crawler Search engines and web dynamics

O'Reilley &quot;Spidering hacks&quot;

BotSpot

Web crawling by Castillo

Finding what people want by Pinkerton

Sphinx crawler by Miller and bharat

Help web crawlers crawl your website by IBM

Bean software spider components and info

Ubi crawler

Search engines and web dynamics

Add a comment

Related presentations

Related pages

Web crawler - Wikipedia

A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose of Web indexing (web spidering). Web search ...
Read more

What is Spider? Webopedia Definition

Spiders are used to feed pages to search engines. It's called a spider because it crawls over the Web. ... Invite Search Engine Spiders Into Your Dynamic ...
Read more

Sphider - a php spider and search engine

A lightweight search engine in PHP. Includes details of features, documentation, support forum, and download. [GPL]
Read more

What is a search engine spider? | HowStuffWorks

Search engine spiders, sometimes called crawlers, are used by Internet search engines to collect information about Web sites and individual Web pages. The ...
Read more

Web search engine - Wikipedia

A web search engine is a software system that is designed to search for information on the World Wide Web. The search results are generally presented in a ...
Read more

Search Engine Spider Simulator - WebConfs

Search Engine Spider Simulator ... This tool Simulates a Search Engine by displaying the contents of a webpage exactly how a Search Engine would see it.
Read more

search engine spiders - lifewire.com

Search engines are complicated entities. Here is a basic breakdown of how search engines work to bring you the results you're looking for.
Read more

About Google's regular crawling of the web - Search ...

Google's spiders regularly crawl the web to rebuild our index. Crawls are based on many factors such as PageRank, links to a page, and crawling constraints ...
Read more

Search Engine Spiders and Robots - Metamend Search Engine ...

Search Engine Bots. Search engines are, for the most part, entities that rely on automated software agents called spiders, crawlers, robots and bots.
Read more

Googlebot - Search Console Help - Google Support

Googlebot and all respectable search engine bots will respect the directives in robots.txt, but some nogoodniks and spammers do not. Report spam to Google.
Read more