The link I have
placed above explains in a bit more detail what screen scraping is, what goes
into screen scraping, how to write a script that scrapes data, and the impact
such scripts can have. As explained in the article, screen scraping is the
process of using automated scripts to extract the data contained on web-pages,
so that data can be used in the script owner’s site. The article walks the
reader through how to create a basic scraper script in Python.
Unfortunately,
such scripts are not perfect and screen scrapers have a number of problems.
Mainly, they can severely slow down or even crash some sites. Since scrapers
are running constantly to collect data, they send a number of requests to a
site just like a regular user with a browser would. But, because they are
scripts, they can send a large number of these requests, much more than can be
generated by regular humans. Trying to handle all the requests can bog down or
even crash some servers. There are a number of methods a script writer can
employ to ensure that the script is not crashing the site it is trying to
scrape from, as explained in the article above, but that does not completely alleviate
the problem of scrapers eating up bandwidth on the server side. Such changes
just make it so that they eat up less.
Screen scraping
strikes me as a dangerous practice. It certainly has its perks, but the
drawbacks make me very nervous. Not just that a poor script can impact the
other site negatively; a poor script will impact the site it is supposed to
serve negatively as well. If the site that is being scraped from crashes, the
script cannot get its data, or worse, it could even send back bad data. Both
would have a negative impact on the reputation of the site that gets data from
scrapers. Websites are essentially services and if a service is spotty, users
will go elsewhere. After extended periods of spotty service on the side of the
site being scraped from, the website being served by scrapers could start hemorrhaging
users. In my mind, if one offers a service one must make sure that the service
works as often as possible and it doesn’t seem like screen scrapers will help
that. No matter how helpful the data they provide is, there is still the risk
of a site crash and loss of data.
So, what are the
thoughts of others on this matter? Are screen scrapers worth the trouble? What
do you think is more important for a site to survive: good data or good
service? Are you of the opinion that as long as only well programmed screen
scrapers are used, the risk is worth the rewards?
-Noel Hansen
No comments:
Post a Comment