Sometimes you want to create an offline copy of a site that you can take and view even without internet access. Using wget
you can make such copy easily:
wget --mirror --convert-links --adjust-extension --page-requisites
--no-parent http://example.org
Explanation of the various flags:
--mirror
– Makes (among other things) the download recursive.--convert-links
– convert all the links (also to stuff like CSS stylesheets) to relative, so it will be suitable for offline viewing.--adjust-extension
– Adds suitable extensions to filenames (html
orcss
) depending on their content-type.--page-requisites
– Download things like CSS style-sheets and images required to properly display the page offline.--no-parent
– When recursing do not ascend to the parent directory. It useful for restricting the download to only a portion of the site.
Alternatively, the command above may be shortened:
wget -mkEpnp http://example.org
Note: that the last p
is part of np
(--no-parent
) and hence you see p
twice in the flags.
wget usually doesn’t work very well for complete offline mirrors of website. Due to its parser there is always somethings missing, i.e. stylesheets, scripts, images. It simply isn’t the right tool for this task.
HTTrack is much slower than wget but a powerful parser. It’s GPL and available in most Linux-Distributions.
Documentation and sorce-code is available at http://www.httrack.com
I second David Wolski’s comment. HTTrack is an outstanding website mirroring tool. I like it because it performs incremental updates. Nothing like sucking down the Washington Post without adverts.
Thanks You for helping us 🙂
It’s wget -mkEpnp in my server
thankss
Wget is a great tool, very helpfull to make some website backups for my private archive.
Thanks for the article!
You can also –wait and –random-wait to reduce the load on the server.
Thank you so much Guy!
Thanks!
Thank you very much!
Si non il y a un script RipTool sur https://www.opendesktop.org/p/1218850/
how to use this for https websites ?
@Ashutosh
For https website, just add parameter –no-check-certificate
Example:
wget -mkEpnp –no-check-certificate https://example.com
I saw the comment related to HTTrack only after reading this very useful article (and successfully copying 99% of a website written in ColdFusion, the remaining 1% being embedded JavaScript which had to be done manually; also, moving everything to HTTPS took me a minute or so!).
Unfortunately, HTTrack made sense in 2014 (when this article was written), but it stopped being developed in 2017 (last commit on github) and has 112 pending issues (a bad sign — it’s probably abandoned by now). One major issue with HTTrack is the apparent lack of support of HTML5 (or at least incomplete support for the new tags).
wget continues to be thoroughly developed, and, although I haven’t tried it personally (I’m mostly copying ‘legacy’ websites…), it seems to be able to deal with HTML5 tags so far as one ‘forces’ wget to identify itself as a recent version of, say, Chrome or Firefox; if it identifies itself by default, the webserver it connects too may simply think that it’s a very old browser trying to access the site and ‘simplify’ the HTML being passed back (i.e. ‘downgrading’ it to HTML4 or so). This, of course, is not an issue with wget per se but rather the way webservers (and web designers!) are getting more and more clever in dealing with a vast variety of users, browsers, and platforms.
Also, contemporary versions of wget (which means mid-2019 by the time I’m writing this comment!) will have no trouble ‘digging deep’ to extract JS and CSS files etc. Obviously it cannot make miracles and doesn’t deal with everything; I had some issues with imagemaps, for instance (something nobody uses these days), as well as HTML generated on-the-go by Javascript. And of course there is a limit to what it can actually do with very complex and dynamic websites which adjust their content to whatever browser the user has, page by page — especially on those cases where the different versions of the same page all have the same URL (a bad practice IMHO). Still, it remains useful for a lot of situations, and the results are better than what you get out of archive.org…
I just wanted to point this out since this article is old but still relevant for today’s wget — and sadly HTTrack was abandoned and isn’t an option any longer…
I tried pavuk, which can handle javascript but it got confused so I went back to wget. Somebody may want to further research pavuk with js.
I tried wget with different parameters, saw a lot of errors. Your setup, -mkEpnp. is now downloading smoothly.
alias wgetMirror=”/usr/bin/wget -o wget.log -mkEpnp –wait=9 –user-agent=’Mozilla/5.0 (compatible; Googlebot/2.1; +http://www. google.com/bot.html)’ –no-check-certificate”
I’m trying to save somebody’s wordpress site which he lost control of, so we cannot ssh into vps and tar gzip. I have wordpress on another vps for him.
After many unsuccessful parameter combination attempts,
your command worked perfectly on my system!
Scientific Linux release 6.10 (Carbon)
GNU Wget 1.12
sometimes an additional
--compression=auto
is required to handle gzip. otherwise, you’d get a singleindex.html.gz
I just created a static site with some 900 pages from a Ruby on Rails application using wget with the proposed parameters. Worked perfectly! The two things I really like are that it managed to transform internal links to relative (adding “../” or “../../” as needed) and that extensions were added to the file names.
This only took a few minutes ( on OS X Big Sur ) with my Rails application running in development mode on the same machine.
hi guys,
Thanks for advice. It does not work for me. There is this company which shares some files on a https:// page where I have to login with login and password.
If I enter something like this:
wget –mirror –convert-links –adjust-extension –page-requisites –no-parent –compression=auto –http-user=user –http-password=password https://thispage.com/folder1
(or without –compression=auto)
it makes folder thispage
Inside I find robots.txt where is written
User-agent: *
Disallow: /
Next to it is directory folder1 and there is index.html.
wget can nicely download each file if I enter
wget –http-user=user –http-password=password https://thispage.com/folder1/file1
but I do not know how to make it download all files at once. If I load page in browser it gives me listing
File name File Size Date
and I could copy names.
How do I make wget (or any other tool) download files from folder and I would list file names in some txt.
I guess I could make some bash script for it but can some tool do it for me as well?
ok solution was
wget –http-user=user –http-password=password –input-file=https://thatsite.com/folder1/
Johnoo
I suggest adding “–restrict-file-names=windows” as well, it is not exactly necessary under linux or osx, but you can run into problem if you later want to copy the downloaded files to a partition in Windows file systems.