Notes on web scraping

Well hello there. You may remember me from blogs written well over a year ago! Glad to be back.

Yesterday I tweeted about web scraping, mentioning that it's always a bad idea. What I mean by "web scraping" is the practice of using automated tools to harvest data from a website. Usually this is done in order to try to gather data for research or for use in a product.

That was, I'm sure a bit too strong. After all, universals are always incorrect. But I think it's basically the right approach and I'd like to explain a little bit why that is.

The problems with scraping

Web scraping is problematic for several reasons, among them:

  • Legality, in that scraping is usually prohibited by terms of service, and the use of scraped content is usually a copyright violation baring fair use exceptions. So in scraping you are risking running afoul of wonderful laws like (in the US) the CFAA and DMCA. If you've heard of Aaron Schwartz, you've heard of someone who got slammed with the CFAA for web scraping.
  • Ethics, in that regardless of law it can be unethical to use content that doesn't belong to you, and in scraping you risk bypassing restrictions on content use that may have been put in place intentionally and which you may not even be aware of. Remember, the fact that information is technically accessible, or even public, doesn't mean that it is not sensitive. Example: Your whereabouts in a public street is technically public information, at least in the US. Your neighbor waving at you when you leave your house is great. Someone sitting outside your house 24/7 and tweeting whenever you come and go is grounds for a restraining order.
  • Technical and financial concerns such as denial of service. You don't know what kind of capacity the website you have has available, nor how they are paying for it. It could well be that every request you send costs the website incremental money, or that their server will simply crash due to your scraping. You can't know this ahead of time if you haven't consulted with the person or organization running the website. You can be careful, and you can even respect technical indicators like robots.txt files, but you still won't really know unless you coordinate.

Instructors: Keep in mind that when you indicate to your students that it is OK to web scrape to gather data for their projects, you are opening them up to real liability. Encouraging your students to violate the CFAA is no small thing, no matter how bad a law it is.

What are the alternatives?

First, I'd like to remind us all to keep in mind that we are not generally entitled to use resources we don't own in the service of our own projects. Now, there are certainly situations in which using resources without the owner's permission can be justified, and we'll get to that down around step 6 or 7, but all other things being equal, if there is a better option, you should use it.

My recommendation when looking to acquire bulk data for research purposes is to do the following in this order:

  1. Look for bulk download options provided by the content owner. It is very common to see people talk about scraping sites like Wikipedia, IMDB, Github, or StackOverflow. Guess what! No need, as all of these sites provide bulk data download options.
  2. If no bulk download is available, see if there is an API that might provide what you need within the API terms of use.
  3. Check the terms of service or user agreement to see if by some chance it actually explicitly allows automated scraping in some form. If it does, then go ahead, but this is a rare case.
  4. Email the organization or person running the site and ask if it would be possible to arrange access to the data you are interested in.
    1. You may want to do this well ahead of the time you'll need the data so that there is time for negotiation, legal agreements, etc.
    2. Simultaneously, you may want to ask around among your network and advisors as to whether there is a version of the bulk data available already that you may have missed.
  5. If the organization is not willing to allow use of their data, see if there is an alternative data set that could serve the same purpose. Note that this may require changing your research plan. That's fine! Part of building a good research plan is being able to execute it ethically and legally. It's a good skill to develop.
  6. Consult a lawyer. And an ethicist if there is anything conceivably sensitive about the data you are accessing.
  7. If you've determined that the merits of your need outweigh the legal and ethical issues around ownership and permission, and you are willing to accept the potential consequences of being wrong, scrape.

But...

... some people won't agree with the ethical points above, and keep in mind that I'm not a lawyer, nor do I really know anything about the law in this area. I'm laying out a fairly conservative position. If you disagree and your eyes are open to the possible consequences of being wrong, ranging from a mild slap on the wrist to career suicide and possible jail time, then who am I to stop you?

But if you are an instructor teaching a class, a lead researcher advising your subordinates, or an employee or contractor doing a project for a company, there is more at stake than just yourself. You could well get your students in trouble or subject your company to a damaging lawsuit without realizing that what you are doing is problematic. So think twice and look for alternatives.