Everyone does web scraping wrong

Why rely on some strange library to access online data when you can just download it yourself.

March 18, 2021

In an earlier post I mentioned that I use FetchRSS.com to generate RSS feeds for sites that don't have them built in. In that post I did say that using that was not the best options, but it was the best I could find. But because I'm not entirely happy with FetchRSS I've been looking into web scraping so that I can write my own software to solve this problem for myself and for others.

How people web scrape

After familiarizing myself with how people handle web scraping I've come to the conclusion that everyone is doing it the hard way. Here is the pattern people will follow to do all their web scraping needs generalized so that it describes the process for doing it in any language:

Import a bunch of bloated libraries that you'll only use a few functions from
Using one of these libraries start an instance of a headless browser which will remain running in the background the whole time your program is running
Open the URL of the page you want to pull data from in the headless browser
Using more functions from these bloated libraries, pull out whatever data you need and do what you want with it
Close your headless browser

Doing this of course works but not as well or as fast as it could. The last language I looked at a tutorial to do this in was JavaScript (since it is specifically designed to get and manipulate data from HTML documents) and nearly every statement the guy wrote in his code had to be prefixed with the await operator which of course alludes to the fact that the functions in these libraries can take a while to return stuff since they are relying on a remote server to get data.

How people should web scrape

This whole process seems ridiculous to me after I realized the way this should be done. Here is my method:

Download the web page you will be scraping data from
Using the built in file reading features of your preferred programming language (or a simple library designed to make working with HTML easier), pull out whatever data you need and do what you want with it
Delete the original file you downloaded since you no longer need it

This process makes way more sense. It takes advantage of the functionality already built into most programming languages, so you don't have to learn how to use new libraries, it does not require any background processes to run keeping you connected to some server. Both these things mean that a program built to handle web scraping this way will require less system and network resources to function meaning it will run much faster. This should be the obvious way to do this sort of thing.

Why people like to do things the hard way

There are a few reasons why developers will use the first method rather than the method I propose. The first few outline cases in which my method wouldn't word and the use of a headless browser would be necessary, the last reason is a more philosophical reason that anyone would benefit from recognizing more often.

Limitations of my method

The problem with my method lies in the first step. The best way to download a web page is to use either a program called curl, or a program called wget. It is possible for website developers to block you from doing using these programs to do this, There have been times where I have been blocked from using curl on a web page, and while I can't remember a time when I haven't been able to use wget I do still believe it can be blocked. If neither of these programs worked you'd have to use a headless web browser for your web scraping program to access the page and you'd have to do everything the slow way. Now could you use the headless browser to download the page, probably (I haven't actually looked into it) but I feel like when you get to the point where you need to use such a thing you might as well fully take advantage of it.

Another time when my method simply wouldn't work is when you need to access data on a web page that requires a login. You can't really do that without using a headless browser.

Regardless of these limitations I do think that my proposed method should be much more common. When you are working on a car you don't take out your impact driver for every nut and bolt, you only use it for the parts that need the extra torque. The same should be true when it comes to programming. If you can do something without bringing in some big library, or running some heavy background process then you should do it without that stuff. Unfortunately we live in a world where many developers feel more comfortable using tools that they get from external libraries or unnecessary software packages than they do taking advantage of the functionality built into the languages and softwares they claim to be experts at.

Our consoomersit bias towards new things

The more important reason why the bloated way of doing web scraping is much more common lies in the fact that people, especially tech people, have a heavy bias towards using the "latest and greatest" things. In one of my programming classes we were advised not to include fancy new libraries and packages unless it was actually necessary, developers should do stuff that is best for the customer not the things that are best for themselves or just satisfy their curiosities.

Several months ago (back when I wasted my time watching that sort of stuff) Linus Tech Tips did a video comparing three different types of SSDs (gen 4 NVME connections or something like that had just came out, don't quote me on that I don't really care I use a Thinkpad that was made back when people thought Mit Romney could be President). In this video they found that the older style of SSD performed just a good as the newer ones did, and in some cases the older one did better. They concluded that there was really no reason to upgrade your system with one of these new SSDs.

I don't understand why so many people always get so obsessed with the need to always get the newest things. What does an iPhone 12 offer you that an iPhone 11 doesn't? Or even an iPhone 8? The ability to take pictures in 4K? Sure, but do you really need that? Of course not. Increased storage capacity? Yeah, but you only need that if you fill your phone up with massive 4K pictures and videos that you don't need. The only thing that could explain why an average person would think they need this is brainwashing.

Peoples' obsession with new movies, TV shows, books, and video games is another thing I don't understand, I've thought about writing an entire thing about this before. Why would anyone bother hyping themselves up for some movie that won't come out for several months when there exist hundreds of great movies they've never seen?

Before committing yourself to something spend some time to see if it is really worth it.