When RSS Fails: Web Scraping with HTTP

55 %
45 %
Information about When RSS Fails: Web Scraping with HTTP

Published on March 5, 2009

Author: tobias382

Source: slideshare.net


A brief introduction to the HTTP protocol for use in web scraping, best practices, and availability of PHP-based HTTP client libraries.

When RSS Fails: Web Scraping with HTTP Matthew Turland Senior Consultant Blue Parabola LLC February 27, 2009

What is Web Scraping? A 2 Step Process

Its Goal: Data

Obtain It

Transform It

Automate It

Step 1: Retrieval

The Client

The Server

The Request

The Response

Or In Your Case

Step #2: Analysis

Locate Desired Data

Extract It

Use It

So To Recap 2 Step Process Step 1: GET /some/resource Retrieval ... HTTP/1.1 200 OK Resource ... with data you want Usable Raw Step 2: data resource Analysis

How Is It Different? Consuming Web Services Web service data formats Web scraping data formats Data Mining Focus in data mining Focus in web scraping

What Is It Used For? System integration Crawlers and indexers Integration testing


One small change to markup...

... may break your application.

Or in modern terms...

Reverse Engineering Required

Multiple Requests

No Nice Neat Data Package

Quite the Opposite, In Fact

Know enough HTTP to... Use one like this: To do this:

Know enough HTTP to... Learn to use and troubleshoot one like this: PEAR::HTTP_Client pecl_http Zend_Http_Client Or roll your own! Filesystem + Streams cURL

Let's GET Started request line protocol version in method URI address for the use by the client or operation desired resource GET /wiki/Main_Page HTTP/1.1 Host: en.wikipedia.org header name header value header more headers follow...

URI vs URL URI 1. Uniquely identifies a resource URL 2. Indicates how to locate a resource 3. Does both and is thus human-usable. More info in RFC 3986 Sections 1.1.3 and 1.2.2

Warning about GET GET In principle: quot;Let's do this by the book.quot; GET In reality: quot;'Safe operation'? Whatever.quot;

Query Strings Value Ampersands to separate Parameter parameter name-value pairs. URL Query String http://en.wikipedia.org/w/index.php? title=Query_string&action=edit Question mark to separate Equal signs to separate parameter the resource address and names and respective values query string

URL Encoding Also called percent encoding. Parameter Value first this is a field second is it clear enough (already)? Query String first=this+is+a+field&second=is+it+clear+%28already%29%3F parse_str, urlencode, urldecode: Handy PHP URL functions $_SERVER['QUERY_STRING'] / http_build_query($_GET) More info on URL encoding in RFC 3986 Section 2.1

POST Requests Most Common POST HTTP Operations /w/index.php 1. GET 2. POST ... /new/resource GET /some/resource HTTP/1.1 -or- Header: Value ... /updated/resource POST /some/resource HTTP/1.1 none Header: Value request body

POST Request Example Content type for data submitted via HTML form Blank line separates (multipart/form-data for file uploads) request headers and body POST /w/index.php?title=Wikipedia:Sandbox HTTP/1.1 Content­Type: application/x­www­form­urlencoded wpStarttime=20080719022313&wpEdittime=20080719022100 ... Note: Most browsers have a query string length limit. Lowest known common denominator: IE7 strlen(entire URL) <= 2,048 bytes. Request body This limit is not standardized. It applies ... look familiar? to query strings, but not request bodies.

HEAD Request Same as GET with two exceptions: HEAD /wiki/Main_Page HTTP/1.1 Host: en.wikipedia.org ? 1 HEAD vs GET HTTP/1.1 200 OK Header: Value Sometimes headers are all you want 2 No response body Headers Body

Responses Response Lowest protocol version Response status code required to process the status description response Status line HTTP/1.0 200 OK Server: Apache X­Powered­By: PHP/5.2.5 ... Same header format as requests, but different [body] headers are used (see RFC 2616 Section 14)

Response Status Codes 1xx Informational Request received, continuing process. 2xx Success Request received, understood, and accepted. 3xx Redirection Client must take additional action to complete the request. 4xx Client Error Request is malformed or could not be fulfilled. 5xx Server Error Request was valid, but the server failed to process it. See RFC 2616 Section 10 for more info.

Headers Set-Cookie See RFC 2109 or RFC 2965 for more info. Cookie Location Watch out for infinite loops! Last-Modified ETag OR If-Modified-Since If-None-Match 304 Not Modified

More Headers WWW-Authenticate See RFC 2617 Authorization for more info. 200 OK / 403 Forbidden Some servers perform User-Agent user agent sniffing Some clients perform User-Agent: user agent spoofing

Best Practices

Simulate User Behavior

Minimize Requests

Batch Jobs, Non-Peak Hours

Questions?  No heckling... OK, maybe just a little.  I generally blog about my experiences with web scraping and PHP at http://ishouldbecoding.com. </shameless_plug> Thanks for coming!

Add a comment

Related presentations

Presentación que realice en el Evento Nacional de Gobierno Abierto, realizado los ...

In this presentation we will describe our experience developing with a highly dyna...

Presentation to the LITA Forum 7th November 2014 Albuquerque, NM

Un recorrido por los cambios que nos generará el wearabletech en el futuro

Um paralelo entre as novidades & mercado em Wearable Computing e Tecnologias Assis...

Microsoft finally joins the smartwatch and fitness tracker game by introducing the...

Related pages

When RSS Fails: Web Scraping with HTTP - Technology

When RSS Fails: Web Scraping with HTTP Matthew Turland Senior Consultant Blue Parabola LLCFebruary 27, 2009 . 2. What is Web Scraping? A 2 Step Process .
Read more

Scraping your way to RSS feeds! | Technosiastic!

... When RSS Fails: Web Scraping with HTTP and How To: Scrape a Web Page to RSS Feed for doing the ... up here Scraping your way to RSS feeds! ...
Read more

Web Scraping | Technosiastic!

... When RSS Fails: Web Scraping with HTTP ... There’s another gem I figured which actually lets you run XPath query for scraping into a web page for RSS.
Read more

WebScraping.com blog

... the text Web Scrapping rather than the usual spelling Web Scraping. ... probably fail anyway! So don’t ... = re. sub ('http://web .archive .org/web ...
Read more

Web Scraping | LinkedIn

Web... web scraping, can be very useful for a variety of reasons. ... When RSS Fails: Web Scraping with HTTP. 4,869 Views. michelleminkoff. Almost Scraping
Read more

Web Scraping - Columbia University

Web scraping refers to extracting data elements from webpages. I adapted this little tutorial from a blog post I came across on R bloggers. The poster ...
Read more

Five easy steps for scraping data from web pages. - UNT

Five easy steps for scraping data from web pages. As published in Benchmarks RSS Matters, November 2013 http: ...
Read more

Scraping | LinkedIn

Web scraping is important and required skill for a data analyst / scientist. It comes necessary when you need to get ... When RSS Fails: Web Scraping with ...
Read more