Friday, April 26, 2013

Topic Exploration Group #5: The Promise and Failings of Screen Scraping



      The link I have placed above explains in a bit more detail what screen scraping is, what goes into screen scraping, how to write a script that scrapes data, and the impact such scripts can have. As explained in the article, screen scraping is the process of using automated scripts to extract the data contained on web-pages, so that data can be used in the script owner’s site. The article walks the reader through how to create a basic scraper script in Python.
      Unfortunately, such scripts are not perfect and screen scrapers have a number of problems. Mainly, they can severely slow down or even crash some sites. Since scrapers are running constantly to collect data, they send a number of requests to a site just like a regular user with a browser would. But, because they are scripts, they can send a large number of these requests, much more than can be generated by regular humans. Trying to handle all the requests can bog down or even crash some servers. There are a number of methods a script writer can employ to ensure that the script is not crashing the site it is trying to scrape from, as explained in the article above, but that does not completely alleviate the problem of scrapers eating up bandwidth on the server side. Such changes just make it so that they eat up less.
      Screen scraping strikes me as a dangerous practice. It certainly has its perks, but the drawbacks make me very nervous. Not just that a poor script can impact the other site negatively; a poor script will impact the site it is supposed to serve negatively as well. If the site that is being scraped from crashes, the script cannot get its data, or worse, it could even send back bad data. Both would have a negative impact on the reputation of the site that gets data from scrapers. Websites are essentially services and if a service is spotty, users will go elsewhere. After extended periods of spotty service on the side of the site being scraped from, the website being served by scrapers could start hemorrhaging users. In my mind, if one offers a service one must make sure that the service works as often as possible and it doesn’t seem like screen scrapers will help that. No matter how helpful the data they provide is, there is still the risk of a site crash and loss of data.

      So, what are the thoughts of others on this matter? Are screen scrapers worth the trouble? What do you think is more important for a site to survive: good data or good service? Are you of the opinion that as long as only well programmed screen scrapers are used, the risk is worth the rewards?

-Noel Hansen

No comments:

Post a Comment