Make Offline Mirror of a Site using `wget`

Sometimes you want to create an offline copy of a site that you can take and view even without internet access. Using wget you can make such copy easily:

wget --mirror --convert-links --adjust-extension --page-requisites 
--no-parent http://example.org

Explanation of the various flags:

  • --mirror – Makes (among other things) the download recursive.
  • --convert-links – convert all the links (also to stuff like CSS stylesheets) to relative, so it will be suitable for offline viewing.
  • --adjust-extension – Adds suitable extensions to filenames (html or css) depending on their content-type.
  • --page-requisites – Download things like CSS style-sheets and images required to properly display the page offline.
  • --no-parent – When recursing do not ascend to the parent directory. It useful for restricting the download to only a portion of the site.

Alternatively, the command above may be shortened:

wget -mkEpnp http://example.org

Note: that the last p is part of np (--no-parent) and hence you see p twice in the flags.

30 thoughts on “Make Offline Mirror of a Site using `wget`”

  1. wget usually doesn’t work very well for complete offline mirrors of website. Due to its parser there is always somethings missing, i.e. stylesheets, scripts, images. It simply isn’t the right tool for this task.
    HTTrack is much slower than wget but a powerful parser. It’s GPL and available in most Linux-Distributions.
    Documentation and sorce-code is available at http://www.httrack.com

  2. I second David Wolski’s comment. HTTrack is an outstanding website mirroring tool. I like it because it performs incremental updates. Nothing like sucking down the Washington Post without adverts.

  3. I saw the comment related to HTTrack only after reading this very useful article (and successfully copying 99% of a website written in ColdFusion, the remaining 1% being embedded JavaScript which had to be done manually; also, moving everything to HTTPS took me a minute or so!).

    Unfortunately, HTTrack made sense in 2014 (when this article was written), but it stopped being developed in 2017 (last commit on github) and has 112 pending issues (a bad sign — it’s probably abandoned by now). One major issue with HTTrack is the apparent lack of support of HTML5 (or at least incomplete support for the new tags).

    wget continues to be thoroughly developed, and, although I haven’t tried it personally (I’m mostly copying ‘legacy’ websites…), it seems to be able to deal with HTML5 tags so far as one ‘forces’ wget to identify itself as a recent version of, say, Chrome or Firefox; if it identifies itself by default, the webserver it connects too may simply think that it’s a very old browser trying to access the site and ‘simplify’ the HTML being passed back (i.e. ‘downgrading’ it to HTML4 or so). This, of course, is not an issue with wget per se but rather the way webservers (and web designers!) are getting more and more clever in dealing with a vast variety of users, browsers, and platforms.

    Also, contemporary versions of wget (which means mid-2019 by the time I’m writing this comment!) will have no trouble ‘digging deep’ to extract JS and CSS files etc. Obviously it cannot make miracles and doesn’t deal with everything; I had some issues with imagemaps, for instance (something nobody uses these days), as well as HTML generated on-the-go by Javascript. And of course there is a limit to what it can actually do with very complex and dynamic websites which adjust their content to whatever browser the user has, page by page — especially on those cases where the different versions of the same page all have the same URL (a bad practice IMHO). Still, it remains useful for a lot of situations, and the results are better than what you get out of archive.org…

    I just wanted to point this out since this article is old but still relevant for today’s wget — and sadly HTTrack was abandoned and isn’t an option any longer…

  4. I tried pavuk, which can handle javascript but it got confused so I went back to wget. Somebody may want to further research pavuk with js.

    I tried wget with different parameters, saw a lot of errors. Your setup, -mkEpnp. is now downloading smoothly.
    alias wgetMirror=”/usr/bin/wget -o wget.log -mkEpnp –wait=9 –user-agent=’Mozilla/5.0 (compatible; Googlebot/2.1; +http://www. google.com/bot.html)’ –no-check-certificate”

    I’m trying to save somebody’s wordpress site which he lost control of, so we cannot ssh into vps and tar gzip. I have wordpress on another vps for him.

  5. After many unsuccessful parameter combination attempts,
    your command worked perfectly on my system!
    Scientific Linux release 6.10 (Carbon)
    GNU Wget 1.12

  6. sometimes an additional --compression=auto is required to handle gzip. otherwise, you’d get a single index.html.gz

  7. I just created a static site with some 900 pages from a Ruby on Rails application using wget with the proposed parameters. Worked perfectly! The two things I really like are that it managed to transform internal links to relative (adding “../” or “../../” as needed) and that extensions were added to the file names.
    This only took a few minutes ( on OS X Big Sur ) with my Rails application running in development mode on the same machine.

  8. hi guys,
    Thanks for advice. It does not work for me. There is this company which shares some files on a https:// page where I have to login with login and password.
    If I enter something like this:
    wget –mirror –convert-links –adjust-extension –page-requisites –no-parent –compression=auto –http-user=user –http-password=password https://thispage.com/folder1
    (or without –compression=auto)
    it makes folder thispage
    Inside I find robots.txt where is written
    User-agent: *
    Disallow: /
    Next to it is directory folder1 and there is index.html.
    wget can nicely download each file if I enter
    wget –http-user=user –http-password=password https://thispage.com/folder1/file1
    but I do not know how to make it download all files at once. If I load page in browser it gives me listing
    File name File Size Date
    and I could copy names.
    How do I make wget (or any other tool) download files from folder and I would list file names in some txt.
    I guess I could make some bash script for it but can some tool do it for me as well?

  9. ok solution was
    wget –http-user=user –http-password=password –input-file=https://thatsite.com/folder1/
    Johnoo

  10. I suggest adding “–restrict-file-names=windows” as well, it is not exactly necessary under linux or osx, but you can run into problem if you later want to copy the downloaded files to a partition in Windows file systems.

Leave a Reply

Your email address will not be published. Required fields are marked *