Converting Web Pages to PDFs

Using pandoc to decrease your reliance on an internet connection.

Back to Home

So, I'm starting up a new semester next week and I got an email with details about one of my classes which had a link to the sylabus. When I clicked on the link I was expecting to get a PDF download, but instead I was taken to a web page which conained the sylabus. And while I actually prefer a rather standard looking web page to a crazily formatted PDF, I do beleive that you should not have to be reliant on an internet connection to acess a sylabus or other resources that you may need to access frequently and thanks to pandoc converting simple web pages into easy to read PDFs is quite simple.

I've been thinking about putting this guide together for several weeks now but never had the time or motivation to do it, but now I have both so let's get started.

Software requirements

In order to convert HTML into a PDF you will need two pieces of software, pandoc and a LaTex compiler. Pandoc is a document conversion software which can convert all sorts of files into all sorts of formats (I recomend checking it out to see what else you can do with it). LaTex is a markup language which is quite common in accedemia, a LaTex compiler is the engine that pandoc uses by default to convert files into PDFs (there are others that pandoc can use, I often like using pdfroff better but we will stick to LaTex for now). Both of these pieces of software are rather large so be paitent when downloading.

The other software that will be used should already be built into your operating system, curl is a command that all operating systems should have while wget is built into Unix based systems, these two commands are what we will use to download the web pages we will be converting, either of them work but you will soon understand why you probably want to use wget if it is avalible to you.

Downloading your HTML file

First you are going to want to open up a terminal or command prompt and navigate to the folder you will be working in, I suggest using an empty folder then moving your finished PDF to a new location when finished, all other files that will be created can be deleted when finished.

Next you need to get the URL for the page you want to convert. For this example I will use the URL to my RSS guide, then you will use either curl or wget on that URL to download it to the folder your terminal is in (working directory).



curl >> rss_guide.html

After running one of those two commands use the ls command (or dir if you are using Windows) to make sure your file is there.

Running pandoc

Once you have your file all you need to do is convert it with this command (with the correct file names of course):

pandoc rss_guide.html -o rss_guide.pdf

This syntax is quite simple, "pandoc" of course is the program you are running, the first argument is the input file, and the "-o" indicates that the next argument will be the output file. If you only cared about the web page's text you are all set.

What about pictures?

If you did the above procedure with a web page that had images in it (like my post about web bloat) then after running pandoc you will get warnings like these:

[jacob@t420 ~/projects/wget_test]$ pandoc 200818.html -o test1.pdf
[WARNING] Could not fetch resource '../pictures/website_pyramid.png': replacing image with description
[WARNING] Could not fetch resource '../pictures/real_website_pyramid.png': replacing image with description
[WARNING] Could not fetch resource '../pictures/william_howard_taft.jpeg': replacing image with description
[WARNING] Could not fetch resource '../pictures/add_vs_content_bandwidth_cost.png': replacing image with description
[WARNING] Could not fetch resource '../pictures/minecraft_spaceship.png': replacing image with description
[WARNING] Could not fetch resource '../pictures/call_of_duty.png': replacing image with description

What you will do to get the pictures will be different if you are using wget of curl.

Getting pictures with wget

It is easiest to use the recursive option in wget to retrieve the pictures, but if you are not careful you will end up downloading a lot more than you actually need because the recursive option could download every public file on a website. To get everything you would need to convert the page complete with pictures for my web bloat post you would run this command:

wget -r --level=1

The "-r" specifies we want it do download things recursively meaning that it will download that page along with every other file that page links to. The "--level=1" tells wget that we only want to download the links on that specific page, if that level were higher it would download things that are linked in the suff your first page links to, the default level is 5, which is deep enough to download my entire site from anywhere, you probably don't want to download that many files.

Running wget with the recursive option will create a new folder which will be titled as the domain name of your site, within that folder will be all the neccecary files (plus some you don't need) structued the same way they are structured on that site. Navigate throught that folder to find the HTML file you want to convert and run pandoc on it, after that you shoiuld have your PDF and will be free to move it somewhere else and to delete all the files wget got for you.

Getting pictues with curl

Dowloading Pictures

Using curl will be a bit more involed, luckily the pandoc warning already told us what we need to get, we just need to go about doing it. You'l notice that each of the resources pandoc list as missing look like this: '../pictures/website_pyramid.png', these are all relative file paths from the file that you pulled from the internet. All URLS are really just file paths, all you need to do to get the pictues you need is to take your URL (in this case and replace the stuff after the last forward slash with your relative file path from pandoc and curl that into a file with the same name that is at the end of your relative file path. Repeat this for all the files pandoc gave you and then you've downloaded everything you need.

From here you have two options, you could either edit the HTML, or move where your files are stored on your computer so that they match the relative file paths, if you already know a thing or two about relative file paths and are running Linux option two would probably be easier, but if you are running Linux you should have just used wget instead of curl. I will only describe how to tackle the first option, especially since if you are running Windows you will need to edit the HTML to get the pictures to come up anyways.

Editing the HTML

Open your HTML file with some sort of text editor. Assuming that you put all your pictures in the same folder as your HTML file all we need to do is change the relative file paths corresponding to the pictures to reflect that. So in my example all that would need done is to replace '../pictures/website_pyramid.png' with './website_pyramid.png' if you are on a Unix based system, or '.\website_pyramid.png' if you are on Windows. (You may be able to get away with just 'website_pyramid.png' but I'm not sure and I can't be bothered to check right now)

There is a good chance that all of your relative file paths will look similar, so you will likely be able to use the find and replace feature that should be built into your text editor to make quick work of this task.

What about links?

I typically don't bother with fixing the links in these, but if there are any internal links on the web page they will most likely be relative file paths. If you wanted to get those to work you could edit them in the HTML file the same way we edited them to download pictures using curl. I won't be going into any more detail on that since it is not something I've found neccecary for myself.

Final Thoughts

Of course, please only do this on web pages which you are permitted to do this on, I beleive that any author of a free refrence site should want you to be able to do access this offline, but who knows how they'll actually react. Also remember that your PDF will not update if the web page does, so this isn't appropriate for pages wich are expected to change. This method also may not work for pages containing a lot of javascript, or pages that are more complex. And again, don't get yourself into any trouble with this.