Published on March 5, 2009
When RSS Fails: Web Scraping with HTTP Matthew Turland Senior Consultant Blue Parabola LLC February 27, 2009
What is Web Scraping? A 2 Step Process
Its Goal: Data
Step 1: Retrieval
Or In Your Case
Step #2: Analysis
Locate Desired Data
So To Recap 2 Step Process Step 1: GET /some/resource Retrieval ... HTTP/1.1 200 OK Resource ... with data you want Usable Raw Step 2: data resource Analysis
How Is It Different? Consuming Web Services Web service data formats Web scraping data formats Data Mining Focus in data mining Focus in web scraping
What Is It Used For? System integration Crawlers and indexers Integration testing
One small change to markup...
... may break your application.
Or in modern terms...
Reverse Engineering Required
No Nice Neat Data Package
Quite the Opposite, In Fact
Know enough HTTP to... Use one like this: To do this:
Know enough HTTP to... Learn to use and troubleshoot one like this: PEAR::HTTP_Client pecl_http Zend_Http_Client Or roll your own! Filesystem + Streams cURL
Let's GET Started request line protocol version in method URI address for the use by the client or operation desired resource GET /wiki/Main_Page HTTP/1.1 Host: en.wikipedia.org header name header value header more headers follow...
URI vs URL URI 1. Uniquely identifies a resource URL 2. Indicates how to locate a resource 3. Does both and is thus human-usable. More info in RFC 3986 Sections 1.1.3 and 1.2.2
Warning about GET GET In principle: quot;Let's do this by the book.quot; GET In reality: quot;'Safe operation'? Whatever.quot;
Query Strings Value Ampersands to separate Parameter parameter name-value pairs. URL Query String http://en.wikipedia.org/w/index.php? title=Query_string&action=edit Question mark to separate Equal signs to separate parameter the resource address and names and respective values query string
URL Encoding Also called percent encoding. Parameter Value first this is a field second is it clear enough (already)? Query String first=this+is+a+field&second=is+it+clear+%28already%29%3F parse_str, urlencode, urldecode: Handy PHP URL functions $_SERVER['QUERY_STRING'] / http_build_query($_GET) More info on URL encoding in RFC 3986 Section 2.1
POST Requests Most Common POST HTTP Operations /w/index.php 1. GET 2. POST ... /new/resource GET /some/resource HTTP/1.1 -or- Header: Value ... /updated/resource POST /some/resource HTTP/1.1 none Header: Value request body
POST Request Example Content type for data submitted via HTML form Blank line separates (multipart/form-data for file uploads) request headers and body POST /w/index.php?title=Wikipedia:Sandbox HTTP/1.1 ContentType: application/xwwwformurlencoded wpStarttime=20080719022313&wpEdittime=20080719022100 ... Note: Most browsers have a query string length limit. Lowest known common denominator: IE7 strlen(entire URL) <= 2,048 bytes. Request body This limit is not standardized. It applies ... look familiar? to query strings, but not request bodies.
HEAD Request Same as GET with two exceptions: HEAD /wiki/Main_Page HTTP/1.1 Host: en.wikipedia.org ? 1 HEAD vs GET HTTP/1.1 200 OK Header: Value Sometimes headers are all you want 2 No response body Headers Body
Responses Response Lowest protocol version Response status code required to process the status description response Status line HTTP/1.0 200 OK Server: Apache XPoweredBy: PHP/5.2.5 ... Same header format as requests, but different [body] headers are used (see RFC 2616 Section 14)
Response Status Codes 1xx Informational Request received, continuing process. 2xx Success Request received, understood, and accepted. 3xx Redirection Client must take additional action to complete the request. 4xx Client Error Request is malformed or could not be fulfilled. 5xx Server Error Request was valid, but the server failed to process it. See RFC 2616 Section 10 for more info.
Headers Set-Cookie See RFC 2109 or RFC 2965 for more info. Cookie Location Watch out for infinite loops! Last-Modified ETag OR If-Modified-Since If-None-Match 304 Not Modified
More Headers WWW-Authenticate See RFC 2617 Authorization for more info. 200 OK / 403 Forbidden Some servers perform User-Agent user agent sniffing Some clients perform User-Agent: user agent spoofing
Simulate User Behavior
Batch Jobs, Non-Peak Hours
Questions? No heckling... OK, maybe just a little. I generally blog about my experiences with web scraping and PHP at http://ishouldbecoding.com. </shameless_plug> Thanks for coming!
When RSS Fails: Web Scraping with HTTP Matthew Turland Senior Consultant Blue Parabola LLCFebruary 27, 2009 . 2. What is Web Scraping? A 2 Step Process .
... When RSS Fails: Web Scraping with HTTP and How To: Scrape a Web Page to RSS Feed for doing the ... up here Scraping your way to RSS feeds! ...
... When RSS Fails: Web Scraping with HTTP ... There’s another gem I figured which actually lets you run XPath query for scraping into a web page for RSS.
... the text Web Scrapping rather than the usual spelling Web Scraping. ... probably fail anyway! So don’t ... = re. sub ('http://web .archive .org/web ...
Web... web scraping, can be very useful for a variety of reasons. ... When RSS Fails: Web Scraping with HTTP. 4,869 Views. michelleminkoff. Almost Scraping
Web scraping refers to extracting data elements from webpages. I adapted this little tutorial from a blog post I came across on R bloggers. The poster ...
Five easy steps for scraping data from web pages. As published in Benchmarks RSS Matters, November 2013 http: ...
Web scraping is important and required skill for a data analyst / scientist. It comes necessary when you need to get ... When RSS Fails: Web Scraping with ...