Scraping talk public

50 %
50 %
Information about Scraping talk public

Published on February 28, 2014

Author: NESTA_UK

Source: slideshare.net

Getting data from the web for research Andrew Whitby 27 February 2014

Web data projects I’ve worked on A project… Website Examining the global trade in music Various websites incl. Wikipedia, Musicbrainz Data items Scrape API 8 million chart entries ~50k unique artists   Analysing promotion A social network techniques for artists in foreign markets 5k users with 2+ million user preferences (similar to ‘likes’)   Investigation of data skills University course database 20,000 courses  Modelling political orientation of various organisations* Twitter 10ks of followers  * Not at Nesta

Do you really need to scrape? Easiest Bulk download: Some sites make their data available as a download. Check! Use an API: A programming interface designed to expose data directly. Manually collect the data: for up to 100s of items, this can be quicker (intern, contract researcher?) Contact the site owner: For smaller sites this can be surprisingly effective. Hardest Scrape the website: Do this as a last resort.

Can it be scraped? Structured or semi-structured = Scraping Unstructured text = A different problem

Scrapers

Web 101 • Clients (your browser) send requests to servers (e.g. www.nesta.org.uk) using HyperText Transfer Protocol (HTTP) • Depending on the request, the server might return – – – – A web page, in HTML An image (e.g. a PNG or JPG) Some data, as XML or JSON Etc • Scraping and APIs both use HTTP

So how does web scraping work? • In the (good) old days web pages were very simple, handcrafted, marked-up text • Now most automatically generated from databases of content according to templates, so they naturally have a repetitive structure • Scraping exploits the regularities of this (semi-) structure to extract data using text-manipulation algorithms

Scraping example: Nesta People Ordinary URL that you would browse to Extraneous information, formatting, etc The data you actually want: either as a table or list here, or possibly as a link to a pageper-item Pagination, e.g. <<First <Prev 1 2 3 Next> Last>>

Scraping example: under the bonnet

Adam Scraping example: under the bonnet Albert Start of an entry Photo link End of an entry Link to Albert’s main page Name text

Scraping: legal considerations • Jurisdiction issues • Laws that have been relied upon – – – – – Contract: terms of service Copyright law EU Databases Directive (research exemption?) US Computer Fraud & Abuse Act US Digital Millennium Copyright Act • Case law – Unsettled - conflicting decisions Bottom line: this is a grey area and not without legal risk (Also: I’m not a lawyer, this is not legal advice)

Scraping: ethical considerations • Remember, the site wasn’t designed for this purpose: be sympathetic to the site owner • Avoid putting an unreasonable burden on the site – Some run on massive datacentres, others a single machine. – Rule of thumb: don’t scrape multiple items in parallel • Ask permission if you can – But be realistic, and remember a lot of web traffic is scraping (Google, Bing, etc) • Observe robots.txt – But this is (probably) not legally binding either way This is before even thinking about privacy (if user data involved)

Scraping courtesy: robots.txt If this file exists it will be at http://sitename.com/robots.txt

Scraping: practical issues Sites may reject connections, or challenge your humanity with CAPTCHAs

Getting around limits The simple options – Slow down requests, introduce random delays – Use ‘user agent’ to pretend to be human The serious option – Tor (“the onion router”) – Anonymises your network location. – Ethical consideration though • Tor is a fragile community with better uses If these don’t work, give up. If they’ve gone to this much trouble to prevent scraping, they’re more likely to get upset and possibly take action against you. These aren’t the droids you’re looking for

APIs (Application Programming Interfaces)

How do APIs work? • Way of extracting structured data from a web site or service – A service intentionally made available by the data owner • Just a set of rules for communicating / exchanging data – Request is usually made as a specially-constructed web address – Response is usually encoded as JSON or XML • You can access an API: – – – – directly in your browser (good for testing) using a tool like curl by programming it directly by using a ‘wrapper’ in your language of choice (Python, Ruby, Java, etc)

An API is a set of rules

API example: Companies House Specially constructed URL (‘request’) Structured, unformatted data returned (‘response’) A RESTful request using HTTP with data returned in JSON format

API example: Companies House Formatted, humanfriendly page returned The same data rendered in a human-friendly web format.

APIs: legal issues • Situation is simpler/safer than scraping • Publishing an API means a data provider is encouraging use, and explicitly controlling the amount of data you can collect • With an API you are more likely to have to expressly agree to something (“clickwrap”); with a paid API you’ll have a formal contract

APIs: ethical issues • As with scraping, avoid putting an unreasonable burden on the site • But often API owners will be explicit about what a reasonable burden is – This may be voluntary – Or enforced via a ‘rate limit’ • Easier for API owner to enforce, so responsibility is shifted somewhat

APIs: practical issues • APIs will often be ‘rate limited’: that is, a limit is imposed on how many requests you can make per minute/hour. • This can increase the elapsed time it takes to collect large quantities of information – But often free registration will increase your rate limit – And paid accounts may increase it further – Don’t try to work around this any other way • APIs may not provide all the same fields web users see – they are often designed for third-party apps rather than research – In which case, scraping may be an option

DIY web data access Scraping API access Point and click Import.io Yahoo Dapper Yahoo Pipes Various browser extensions (e.g. Chrome Scraper) Kimono? Scraperwiki (Twitter) Some code Scraperwiki Morphi.io Krake.io Scraperwiki Lots of code Scrapy BeautifulSoup Your language of choice (Python+Requests is good) Also see this list of non-code scraping things to try courtesy of a pair of US journalists: here

Contracted web data access • How much: – e.g. ScraperWiki: $3-10k upfront, $200-500 per month • Think about – How will you receive/analyse the data? – What is the time period of interest? – Is it a well-known API (e.g. Twitter) or something exotic (e.g. Douban)? Case study Data: Twitter public API (~800 users, 1m tweets over Jan-Oct 2012, plus network snapshots at 3 times Cost: £10-15k Time: months (limitations of API history + rate limits) Issues • Lack of transparency/documentation about data processing decisions (what’s in, what’s out) – getting from complex to flat data structures • Need for iteration, constant communication • Data collection skills may not coexist with report-writing skills

Summary 1. Consider your non-scraping options 2. A legally grey area - be aware of this 3. If you scrape, scrape ethically 4. Scraping starts simply, but can get complicated 5. Life is easier with open APIs

Glossary API Key a secret string that you use to identify yourself to an API CAPTCHA Completely Automated Public Turing Test to tell Computers and Humans Apart HTML (HyperText Markup Language) the language in which web pages are constructed HTTP (HyperText Transfer Protocol) the communications protocol that is used to transfer web pages from the server to your browser. APIs use this too JSON a very simple data format based on the Javascript language, that is quite readable to humans too rate limit a limit on how frequently you can make requests to the API REST a popular semantic approach to using HTTP for APIs XML a more complex data format that predates JSON

Add a comment

Related presentations

Related pages

Talk:Web scraping - Wikipedia, the free encyclopedia

Talk:Web scraping WikiProject ... See Talk:Screen_scraping#Merge_web_scraping_into_screen ... As most legal force is exerted out of the public eye ...
Read more

Scraping talk public - Documents

November 7th: Navajo Nation Public Health Talk, Harvard Medical School
Read more

Talk:Data scraping - Wikipedia, the free encyclopedia

Talk:Data scraping The ... most of this article should be moved to the 'Data scraping ... As most legal force is exerted out of the public eye ...
Read more

Scraping talk public - Documents

Scraping talk public; Download. of 27
Read more

Scraping talk public, SlideSearchEngine.com

Scraping talk publicby NESTA_UK ... Getting data from the web for research Andrew Whitby 27 February 2014 Web data projects I’ve worked on A project…
Read more

Tutorial: Scraping LinkedIn for public company data – Web ...

... Scraping LinkedIn for public ... scrapers.rnWe have real humans that will talk to you within hours of your request and help ...
Read more

djyldyz / Job_scraping_public — Bitbucket

Job_scraping_public; Overview. Clone in SourceTree. Clone in SourceTree Atlassian SourceTree is a free Git and Mercurial client for Windows.
Read more

Google Scraper - an overview of techniques

This is because Google has measures in place to intentionally detect Google scraper scripts ... Public proxies are ... Google scraping you can talk to us ...
Read more

I Don't Need No Stinking API: Web Scraping For Fun and Profit

The biggest one is that site owners generally care way more about maintaining their public ... Web scraping is ... which we’ll talk ...
Read more