3

I'm trying to save a Reddit page for OFFLINE viewing as a single HTML file, EXACTLY as it's displayed on the browser and after having already manually expanded some comment threads. This issue is a subset of the general question of how one can save the entire web DOM in its current state while preserving the CSS effects and layout. For example, here are a multitude of posts across the Stack Exchange platform that ask this general question:


Almost all answers are of one of the following forms:

  • Right click and select Save as... and then save as either Web Page, Complete (*.htm;*.html) or Web page, Single File (*.mhtml).

  • Open Chrome DevTools and copy the entire HTML (Copy outerHTML) from the Elements tab.

  • You'll never be able to save a file that looks exactly as the live website version due to many links being "relative" links, and many links to external scripts can be contained inside CSS and JS files.

  • Use a tool such as HTTrack. (As far as I know, however, HTTrack doesn't support saving everything in a single HTML file.)

  • Saving a webpage as a single HTML file exactly as it appears to the user during a live render is simply impossible for many websites.

  • Use a browser extension, such as "Single File" (the developer's GitHub page is here), "Save Page WE", or "WebScrapBook".

  • Try the "WebRecorder" Chrome extension.

Several of these answers do actually achieve some level of saving the webpage's layout as a single HTML file exactly as it appears when rendered live, but there is a HUGE downside: they do not save the HTML file in such a way that makes it possible for the user to view the page OFFLINE. The offline viewing part is essentially what I'm after, and is the crux of my issue.

For example, opening Chrome DevTools and saving the entire outerHTML from the Elements tab does actually allow the user to save the page exactly how it looks like when rendered live, but as soon as the user tries opening the HTML file in offline mode, none of the external scripts are able to load, and thus the entire comment section of the Reddit page literally doesn't even display. I did some manual inspection of the HTML file itself, and I found out that the comments themselves are actually present in the HTML file, but they just don't render when the user loads the file, since they depend on external scripts to dictate how to display to the user.

A solution (almost...)

In my experience, I have found that using the SingleFile chrome extension does exactly the task that I'm after (almost), and it does it best. It's able to save the page precisely as it looks like to the user during a live render (even when viewed offline), and I've found that it's better than both the "Save Page WE" and the "WebScrapBook" extensions. SingleFile handles many sites flawlessly, but it fails miserably when attempting to save a Reddit page that has a huge comment thread. In such cases, the extension consumes too much memory and simply crashes the tab (an Out of Memory error occurs). The sad part is that the extension works well on Reddit posts that have a very small comment section, but rather mockingly, most of the time when I want to save a Reddit post, the Reddit post has a very large comment section, and thus the SingleFile extension can't handle it.

The SingleFile developer has a command-line variant of the tool on his GitHub page, but that simply just launches a headless browser and downloads the requested URL. This approach is useless in my case since I want to save the Reddit page with the modifications that I've personally and manually made (i.e., with the desired comment threads manually expanded). Moreover, I've had the same Out of Memory issue with this approach.

Dirty workaround

I've found that a super dirty workaround to my issue is to simply save the page in PDF format, but I don't want a PDF format. I want an HTML format.

Any ideas on how to save a Reddit page for offline viewing, even in instances wherein the comment section is rather large?

Giacomo1968
  • 58,727

2 Answers2

5

TL;DR Use WebScrapBook ≥ 2.12.0 with options: NOT Style images: Save used, NOT Fonts: Save used, NOT Scripts: Save/Link. (Disclaimer: I am the developer of WebScrapBook)

The root cause of the excessive memory/volume consumption during a capture of SingleFile or many other similar tools is that Reddit pages largely use shadow DOMs with shared constructed stylesheets, which are both modern script-driven techniques, and the related content cannot be directly expressed by HTML.

Take the recent example I've done with WebScrapBook 2.12.0 in Chrome 126 / Firefox 129 for the page provided by OP, which has been scrolled down and has "View more comments" clicked for 20 times before invoking the capture:

The saved page is 79.1MB and contains 1987 comments, each of which has around 29 shadow DOMs, each of which references to several shared constructed stylesheets. In particular, around 6 of the shadow DOMs reference to a large shared constructed stylesheets which is around 200KB.

The way SingleFiles stores a constructed stylesheet is to generate a corresponding STYLE element in the bound document or shadow root. As a result, a constructed stylesheet referenced by multiple shadow DOMs is duplicated over and over.

In such case the estimated volume of the duplicated large constructed stylesheet in this page is 1987 * 6 * 200 ≈ 2.3GB! This doesn't yet include the HTML content and other minor stylesheets! That's why an "out of memory" issue can easily be triggered.

WebScrapBook 2.12.0 has reworked the strategy of handling constructed stylesheets so that they don't duplicate among each referenced shadow DOMs anymore.

Nevertheless, certain computations during the capture could be complicated. For WebScrapBook it's Style images: Save used and Fonts: Save used (there may be similar feature/options for SingleFile), which have to check lots of CSS rules against each referencing shadow DOM to make sure whether an image/font is really used or not. Until finally integrated into the final page file, intermediate relational mappings are generated during each run, and increasingly consumes the memory. This may also exhaust the CPU/memory during a capture, and thus such options should be avoided.

Danny Lin
  • 214
0

They are using your typical "lazy loader".
So, you have to load it in order to save it.
Scroll and load, until you have no more to load. Don't scroll back up.

Then, you can:

  • Ctrl+A > Right-Click on the selection (on the blue highlight)
  • "View selection source". That'll take awhile, go for a coffee.
  • Ctrl+A > Copy > Paste (in a notepad)

Save as my-saved-post.html.

Open with your browser.
How broken the layout is without loading all the external components?
Usually not too bad. You'll now have every post.

Clean up the HTML as much as you'd like. Now you have it in .html format


web page complete

You'll have everything but the loader content.


I was looking at that image and noticed it's a 2.2mb .html file?! You might have the lazy loaders content. You just don't have any server side functionality.

You should try running the page with Five-Server. Once you have it installed, rename data.html to index.html. Then, open a terminal in that directory and type: five-server.


I may have an idea for your dirty workaround PDF file. I think Ubuntu's repository has it. The link below is Fedora & FreeBSD versions. You can also get the source from Poppler, if preferred.

pdftohtml version 24.02.0
Copyright 2005-2024 The Poppler Developers - http://poppler.freedesktop.org

pdftohtml 'input.pdf' 'output.html' -s -nomerge -dataurls -noframes

It does a reasonable job. I tested it on a textual PDF file. Here is the output:pdf vs html