Published on February 28, 2014
Getting data from the web for research Andrew Whitby 27 February 2014
Web data projects I’ve worked on A project… Website Examining the global trade in music Various websites incl. Wikipedia, Musicbrainz Data items Scrape API 8 million chart entries ~50k unique artists Analysing promotion A social network techniques for artists in foreign markets 5k users with 2+ million user preferences (similar to ‘likes’) Investigation of data skills University course database 20,000 courses Modelling political orientation of various organisations* Twitter 10ks of followers * Not at Nesta
Do you really need to scrape? Easiest Bulk download: Some sites make their data available as a download. Check! Use an API: A programming interface designed to expose data directly. Manually collect the data: for up to 100s of items, this can be quicker (intern, contract researcher?) Contact the site owner: For smaller sites this can be surprisingly effective. Hardest Scrape the website: Do this as a last resort.
Can it be scraped? Structured or semi-structured = Scraping Unstructured text = A different problem
Web 101 • Clients (your browser) send requests to servers (e.g. www.nesta.org.uk) using HyperText Transfer Protocol (HTTP) • Depending on the request, the server might return – – – – A web page, in HTML An image (e.g. a PNG or JPG) Some data, as XML or JSON Etc • Scraping and APIs both use HTTP
So how does web scraping work? • In the (good) old days web pages were very simple, handcrafted, marked-up text • Now most automatically generated from databases of content according to templates, so they naturally have a repetitive structure • Scraping exploits the regularities of this (semi-) structure to extract data using text-manipulation algorithms
Scraping example: Nesta People Ordinary URL that you would browse to Extraneous information, formatting, etc The data you actually want: either as a table or list here, or possibly as a link to a pageper-item Pagination, e.g. <<First <Prev 1 2 3 Next> Last>>
Scraping example: under the bonnet
Adam Scraping example: under the bonnet Albert Start of an entry Photo link End of an entry Link to Albert’s main page Name text
Scraping: legal considerations • Jurisdiction issues • Laws that have been relied upon – – – – – Contract: terms of service Copyright law EU Databases Directive (research exemption?) US Computer Fraud & Abuse Act US Digital Millennium Copyright Act • Case law – Unsettled - conflicting decisions Bottom line: this is a grey area and not without legal risk (Also: I’m not a lawyer, this is not legal advice)
Scraping: ethical considerations • Remember, the site wasn’t designed for this purpose: be sympathetic to the site owner • Avoid putting an unreasonable burden on the site – Some run on massive datacentres, others a single machine. – Rule of thumb: don’t scrape multiple items in parallel • Ask permission if you can – But be realistic, and remember a lot of web traffic is scraping (Google, Bing, etc) • Observe robots.txt – But this is (probably) not legally binding either way This is before even thinking about privacy (if user data involved)
Scraping courtesy: robots.txt If this file exists it will be at http://sitename.com/robots.txt
Scraping: practical issues Sites may reject connections, or challenge your humanity with CAPTCHAs
Getting around limits The simple options – Slow down requests, introduce random delays – Use ‘user agent’ to pretend to be human The serious option – Tor (“the onion router”) – Anonymises your network location. – Ethical consideration though • Tor is a fragile community with better uses If these don’t work, give up. If they’ve gone to this much trouble to prevent scraping, they’re more likely to get upset and possibly take action against you. These aren’t the droids you’re looking for
APIs (Application Programming Interfaces)
How do APIs work? • Way of extracting structured data from a web site or service – A service intentionally made available by the data owner • Just a set of rules for communicating / exchanging data – Request is usually made as a specially-constructed web address – Response is usually encoded as JSON or XML • You can access an API: – – – – directly in your browser (good for testing) using a tool like curl by programming it directly by using a ‘wrapper’ in your language of choice (Python, Ruby, Java, etc)
An API is a set of rules
API example: Companies House Specially constructed URL (‘request’) Structured, unformatted data returned (‘response’) A RESTful request using HTTP with data returned in JSON format
API example: Companies House Formatted, humanfriendly page returned The same data rendered in a human-friendly web format.
APIs: legal issues • Situation is simpler/safer than scraping • Publishing an API means a data provider is encouraging use, and explicitly controlling the amount of data you can collect • With an API you are more likely to have to expressly agree to something (“clickwrap”); with a paid API you’ll have a formal contract
APIs: ethical issues • As with scraping, avoid putting an unreasonable burden on the site • But often API owners will be explicit about what a reasonable burden is – This may be voluntary – Or enforced via a ‘rate limit’ • Easier for API owner to enforce, so responsibility is shifted somewhat
APIs: practical issues • APIs will often be ‘rate limited’: that is, a limit is imposed on how many requests you can make per minute/hour. • This can increase the elapsed time it takes to collect large quantities of information – But often free registration will increase your rate limit – And paid accounts may increase it further – Don’t try to work around this any other way • APIs may not provide all the same fields web users see – they are often designed for third-party apps rather than research – In which case, scraping may be an option
DIY web data access Scraping API access Point and click Import.io Yahoo Dapper Yahoo Pipes Various browser extensions (e.g. Chrome Scraper) Kimono? Scraperwiki (Twitter) Some code Scraperwiki Morphi.io Krake.io Scraperwiki Lots of code Scrapy BeautifulSoup Your language of choice (Python+Requests is good) Also see this list of non-code scraping things to try courtesy of a pair of US journalists: here
Contracted web data access • How much: – e.g. ScraperWiki: $3-10k upfront, $200-500 per month • Think about – How will you receive/analyse the data? – What is the time period of interest? – Is it a well-known API (e.g. Twitter) or something exotic (e.g. Douban)? Case study Data: Twitter public API (~800 users, 1m tweets over Jan-Oct 2012, plus network snapshots at 3 times Cost: £10-15k Time: months (limitations of API history + rate limits) Issues • Lack of transparency/documentation about data processing decisions (what’s in, what’s out) – getting from complex to flat data structures • Need for iteration, constant communication • Data collection skills may not coexist with report-writing skills
Summary 1. Consider your non-scraping options 2. A legally grey area - be aware of this 3. If you scrape, scrape ethically 4. Scraping starts simply, but can get complicated 5. Life is easier with open APIs
Talk:Web scraping WikiProject ... See Talk:Screen_scraping#Merge_web_scraping_into_screen ... As most legal force is exerted out of the public eye ...
November 7th: Navajo Nation Public Health Talk, Harvard Medical School
Talk:Data scraping The ... most of this article should be moved to the 'Data scraping ... As most legal force is exerted out of the public eye ...
Scraping talk public; Download. of 27
... Scraping LinkedIn for public ... scrapers.rnWe have real humans that will talk to you within hours of your request and help ...
Job_scraping_public; Overview. Clone in SourceTree. Clone in SourceTree Atlassian SourceTree is a free Git and Mercurial client for Windows.
This is because Google has measures in place to intentionally detect Google scraper scripts ... Public proxies are ... Google scraping you can talk to us ...
The biggest one is that site owners generally care way more about maintaining their public ... Web scraping is ... which we’ll talk ...
Scraping talk public from Nesta. It belongs to a growing set of ‘data’ themed skills resources Nesta is collecting. Posted in Uncategorized and tagged ...