Retrieving Google’s Cache for a Whole Website

Some time ago, as some of you noticed, the web server that hosts my blog went down. Unfortunately, some of the sites had no proper backup, so some thing had to be done in case the hard disk couldn’t be recovered. My efforts turned to Google’s cache. Google keeps a copy of the text of the web page in it’s cache, something that is usually useful when the website is temporarily unavailable. The basic idea is to retrieve a copy of all the pages of a certain site that Google has a cache of.

While this is easily done manually when only few pages are cached, the task needs to be automated when a need for retrieving several hundreds of pages rises. This is exactly what the following Python script does.

#!/usr/bin/python
import urllib
import urllib2
import re
import socket
import os
socket.setdefaulttimeout(30)
#adjust the site here
search_term="site:guyrutenberg.com"
def main():
    headers = {'User-Agent': 'Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.1.4) Gecko/20070515 Firefox/2.0.0.4'}
    url = "http://www.google.com/search?q="+search_term
    regex_cache = re.compile(r'<a href="(http://\d*\.\d*\.\d*\.\d*/search\?q\=cache.*?)".*?>Cached</a>')
    regex_next = re.compile('<a href="([^"]*?)"><span id=nn></span>Next</a>')

    #this is the directory we will save files to
    try:
        os.mkdir('files')
    except:
        pass
    counter = 0
    pagenum = 0
    more = True
    while(more):
        pagenum += 1
        print "PAGE "+str(pagenum)+": "+url
        req = urllib2.Request(url, None, headers)
        page = urllib2.urlopen(req).read()
        matches = regex_cache.findall(page)
        for match in matches:
            counter+=1
            tmp_req = urllib2.Request(match.replace('&amp;','&'), None, headers)
            tmp_page = urllib2.urlopen(tmp_req).read()
            print counter,": "+match
            f = open('files/'+str(counter)+'.html','w')
            f.write(tmp_page)
            f.close()
        #now check if there is more pages
        match = regex_next.search(page)
        if match == None:
            more = False
        else:
            url = "http://www.google.com"+match.group(1).replace('&amp;','&')

if __name__=="__main__":
    main()

# vim: ai ts=4 sts=4 et sw=4

Before using the script you need to adjust the search_term variable near the beginning of the script. In this variable goes the search term for which all the available cache would be downloaded. E.g. to retrieve the cache of all the pages of http://www.example.org you should set search_term to site:www.example.org

71 thoughts on “Retrieving Google’s Cache for a Whole Website”

Stuart says:

November 16, 2008 at 11:53

This just saved my bacon. Thanks!
WOW ! says:

December 18, 2008 at 00:42

This saved my butt like you cant imagine. If I could send you $$, I would.
Guy says:

December 18, 2008 at 08:01

I’m glad that it helped.
I had luck in my case as the HD data-recovery was successful, but this script helped me sleep better while the data was recovered.
Nice says:

January 30, 2009 at 03:18

This is what I have been looking for days. Recently I lost all data of my site and want to get it back from google cached pages. I used a perl program to retrieve pages, but can easily get banned by google. so i have to use different proxy servers. Each IP can only retrieve about 20-40 pages. Does your code have the same problem? Thanks.
Guy says:

January 30, 2009 at 08:42

I’ve used the script to retrieve several thousands of pages and I didn’t get banned. The script uses a normal user-agent to disguise itself, which I guess helps a bit.

Good luck with retrieving the pages.
Allen says:

April 28, 2009 at 05:17

Hello, I’m trying to retrieve 5 pages from a website that is now parked at godaddy.com – a few months ago I could see the cached pages – now I can’t. I’ve also checked archive.org – waybackmachine search but it only goes to 2005 – I’m looking for cached pages from Aprill 2007 to Sept 2008. I’m not a programmer so I don’t understand your code above. Can you point me in a direction to find these cached web pages. The website I’m looking for is http://www.digital-dynamix.com

thanks!!!
Guy says:

April 29, 2009 at 09:20

Hi Allen,

I’m afraid I can’t help you. Google cache and the Way Back Machine are the caches I use to retrieve old sites. I suggest you search google for web archives and see if you come up with some relevant results.
J says:

May 26, 2009 at 00:33

Hi,

This script works brilliantly except that it only downloads the first 10 (or on a couple of searches I tested, 9) results for a search. Do you know why it might be doing that and how I could get it to download all of them?

Thanks
Guy says:

May 26, 2009 at 07:36

This sound weird, as I tested it, it would allow you to download as many results as Google can display (which is limited to the first 1000). I actually downloaded couple of times many more than 10 results. For which query did you try to download the cache?
J says:

May 27, 2009 at 22:09

It was a livejournal, so, for example, I have the line with the search term reading:

search_term=”site:news.livejournal.com”

and I get either 9 or 10 results downloaded. This happens for any site, not just livejournals, so for example when I tried

search_term=”site:ebay.com”

I got 10 files downloaded.

This is Python 2.3.5 on OS X.4.11.
Guy says:

May 29, 2009 at 11:57

Hi J,

I’ve recheck what you’re saying and you right. It seems that Google changed their page markup, thus the script doesn’t recognize there are more result pages.

If you want the script to work for you, you’ll have to rewrite the regex_next regular expression in line 4 of main(), to capture the existence of the next button (it shouldn’t be hard).
MP says:

June 9, 2009 at 17:27

Has anyone fixed this? I’m a regex dummy and desperate to get this to work.
Guy says:

June 10, 2009 at 08:49

No one fixed the regex yet (at least that I know). Unfortunately, I’m too busy to fix it myself these days.
Ed says:

June 10, 2009 at 16:36

The following line is complete overkill, but seems to do the trick for now:

regex_next = re.compile(‘Next‘)

(of course, Google now seems to have flagged me as a malicious service…)
Ed says:

June 10, 2009 at 16:38

regex_next = re.compile(‘<a href=”([^”]*?)”><span class=”csb ch” style=”background-position:-76px 0;margin-right:34px;width:66px”></span>Next</a>’)

Sorry, that last post got stripped
Guy says:

June 10, 2009 at 21:37

Thanks Ed!

Don’t worry by the malicious service flagging, it usually disappears in a few minutes (at least it did for me when I developed the script).
Linda says:

June 13, 2009 at 19:38

errr …

Where do you run this script?
Guy says:

June 13, 2009 at 19:46

You run it using Python from the command line.
Linda says:

June 13, 2009 at 22:20

duh! Thanks.

I edited your script with Ed’s fix and ran it at root. It still only returned 10 results. (BTW – Ed’s fix has a syntax error).

Too bad it doesn’t work – I need to retrieve 129 pages – alas a lot of time wasted doing it one at a time!
Guy says:

June 14, 2009 at 05:55
Linda, if it doesn’t work it means that either Ed’s regex is wrong or you copied it incorrectly. Make sure when you copy it you replace the single and double quotes with the plain one, this what caused the syntax error.
```
regex_next = re.compile('<a href="([^"]*?)"><span class="csb ch" style="background-position:-76px 0;margin-right:34px;width66px"></span>Next</a>')
```
Don’t forget to replace < with < and > with > (I couldn’t have both them and the quotes correct).
Dan says:

December 10, 2009 at 03:55

In that last post, also add a colon between “width” and “66px”
Guy says:

December 11, 2009 at 17:37

@Dan, Thanks for pointing it out.
Lee says:

January 4, 2010 at 16:03

I’m not a programmer but REALLY need to download an entire cache (1000) web pages from Google. I’ve downloaded Python 2.6 and run the command line but I have 3 problems:

– Python won’t allow me to paste code??
– When I type the code I get an Indentationerror on os.mkdir(‘files’)
– How do I “run” the code once completed.

Can someone please explain where I have gone wrong because I am running out of time.
I’m a complete idiot when it comes to programming and scared will lose Google Cache if I can’t get this fixed? Please help??
Guy says:

January 7, 2010 at 22:14

@Lee, you should save the code to a file and then run it with python. Please see the comments above for possible changes to the code. Running python code is a bit different from one operating system to the other, but it is usually done by opening a command prompt (or a terminal) and typing:

python name_of_the_file_you_would_like_to_run.py
disney says:

January 22, 2010 at 14:54

I have tried this script, with all the modifications mentioned, but it still saves only 10 pages 🙁
Is there anyone who has been successful in retrieving more than 10 pages? If so, please email me the script to mgalus@szm.sm thanks a lot! I really need this 🙁
James Revillini says:

March 21, 2010 at 03:14

@disney: change

regex_next = re.compile(‘Next‘)

to

regex_next = re.compile(‘]*)>Next‘)

I got 26 pages this way, but there should be more. It quit with this output:

Traceback (most recent call last):
File “/home/james/scripts/gcache-site-download.py”, line 46, in
main()
File “/home/james/scripts/gcache-site-download.py”, line 33, in main
tmp_page = urllib2.urlopen(tmp_req).read()
File “/usr/lib/python2.6/urllib2.py”, line 126, in urlopen
return _opener.open(url, data, timeout)
File “/usr/lib/python2.6/urllib2.py”, line 391, in open
response = self._open(req, data)
File “/usr/lib/python2.6/urllib2.py”, line 409, in _open
‘_open’, req)
File “/usr/lib/python2.6/urllib2.py”, line 369, in _call_chain
result = func(*args)
File “/usr/lib/python2.6/urllib2.py”, line 1161, in http_open
return self.do_open(httplib.HTTPConnection, req)
File “/usr/lib/python2.6/urllib2.py”, line 1136, in do_open
raise URLError(err)
urllib2.URLError:
James Revillini says:

March 21, 2010 at 04:02

oh man … it ate my text. i’m working on tweaking a few other things, so I’ll post again when it’s ready
James Revillini says:

March 21, 2010 at 04:22
```
can i post using code bbcode?
```
James Revillini says:

March 21, 2010 at 04:23

well, since I can’t post the code without wordpress rewriting it, I guess just get in touch with me via my site if you want the new script.
James Revillini says:

March 22, 2010 at 01:07

http://james.revillini.com/wp-content/uploads/2010/03/gcache-site-download.py_.tar.gz

^^ my updated version.
changes:
* fixed bug ‘cannot download more than 10 pages’ – the regular expression to find the ‘next’ link was a little off, so I made it more liberal and it seems to work for now.
* added random timeout, which makes it run slower, but I thought it might help to trick google into thinking it was a regular user, not a bot (if they flag you as a bot, you are prevented from viewing cached pages for a day or so)
* creates ‘files’ subfolder in /tmp/ … windows users should probably change this to c:/files or something
* catches exceptions if there is an http timeout
* timeout limit set to 60 seconds, as 30 was timeing out fairly often
Tom says:

March 23, 2010 at 22:34

Hi,

I’am running Phyton for windows x64
http://www.python.org/download/releases/3.1.2/

And when running your latest script I got the following error:
[img]
http://i44.tinypic.com/okzwqd.jpg
[/img]

Please tell me what is wrong. I dont no much about phyton except from running the script!

Thanks in advance
Guy says:

March 24, 2010 at 06:53

Python 3 isn’t back-compatible with Python 2, so there were incompatible syntax changes. One of those changes concerns the print command. As a result, you will have to use Python 2, or adjust the script yourself to Python 3.
Tom says:

March 25, 2010 at 21:12

Thank you very much “Guy”. I downloaded Windows x86 MSI Installer (2.7a4) as it seems to be the latest 2.x version and it works.

Cheers 🙂
Jeremiah says:

May 24, 2010 at 18:55

script just echos
PAGE 1: http://www.google.com/search?q=site:thesite.com
and no file is saved
Ryan says:

May 25, 2010 at 19:07

Same issue as Jeremiah. Nothing is saved, but I get the same type of echo message.
Mark Wilson says:

July 26, 2010 at 18:19

The google cache is now served up over a proper URl and not an IP. You need to change the regex_cache from

\d*\.\d*\.\d*\.\d*

to

\w*\.\w*\.\w*
Farid Peterson says:

August 4, 2010 at 18:06

This really works at 4 August 2010. I used James Revillini code, with mar wilson regex_cache changes, also with Windows Python 2.7.

However i have the following issue:I want a retrieve a subsite with a key such as
http://www.mysite.com/forum/codechange?showforum=23

what are the changes required in the code for make this?

Thanks everybody specially to guy
Farid Peterson says:

August 6, 2010 at 08:25

Actualization: Nope only the first ten again!!
What happened?? I need to retrieve a whole website
Martin says:

August 30, 2010 at 11:12

by adding num=100, so changing line 15 to:
url = “http://www.google.com/search?num=100&q=”+search_term
you can at least fetch 100 pages instead of 10.
that was enough for me so i didn’t try to fix the page-fetching anymore.
hope it helps.
thanks to the author for this great script, saved my life as i had a server-crash and no backups! 😉
Ashley says:

September 9, 2010 at 02:18

hey, nice blog…really like it and added to bookmarks. keep up with good work

Here is the idea behind geting results with this scrip:

Apparently Google doesn’t return more than 10 results and the code searching for “Next” page is broken. I am RegEx illiterate but I managed to fix the code using the following trick: We know that each page of results has 10 results at most. You can get the results starting with a specific page, for example

http://www.google.ca/search?&q=Microsoft&start=50

will get you the page 50 with results!

Now all you have to do is to cycle the code for 10 pages from 1 to 100

http://www.google.ca/search?&q=Microsoft&start=1
http://www.google.ca/search?&q=Microsoft&start=2
http://www.google.ca/search?&q=Microsoft&start=3
.
.
.
http://www.google.ca/search?&q=Microsoft&start=100

Here is the code, it might look primitive and crippled but it worked for me.
The code that the others pasted above has wrong indentation, spaces and tabs, I used only tabs.

#!/usr/bin/python
import urllib
import urllib2
import re
import socket
import os
import time
import random
socket.setdefaulttimeout(60)
#adjust the site here
search_term="site:www.wikipedia.com"
def main():
    #headers = {'User-Agent': 'Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.1.4) Gecko/20070515 Firefox/2.0.0.4'}
    headers = {'User-Agent': 'Mozilla/5.0 (X11; U; Linux x86_64; en-US) AppleWebKit/533.3 (KHTML, like Gecko) Chrome/5.0.360.0 Safari/533.3'}
    url = "http://www.google.com/search?num=100&q="+search_term
    regex_cache = re.compile(r'Cached')
    regex_next = re.compile(']*)>Next')
    
    #this is the directory we will save files to
    try:
        os.mkdir("D:/tmp/files")
    except:
        pass
    i=0
    while(i<100):
			i+=1
			url = "http://www.google.com/search?&start="+str(i)+"&q="+search_term
			counter = 0
			pagenum = 0
			while(pagenum<11):
				pagenum += 1
				print "PAGE "+str(pagenum)+": "+url
				req = urllib2.Request(url, None, headers)
				page = urllib2.urlopen(req).read()
				matches = regex_cache.findall(page)
				for match in matches:
					timeout = random.randint(3, 30)
					print '... resting for ', timeout, ' seconds ...'
					time.sleep(timeout)          
					counter+=1
					tmp_req = urllib2.Request(match.replace('&','&'), None, headers)
					try:
						tmp_page = urllib2.urlopen(tmp_req).read()
					except IOError, e:
						if hasattr(e, 'reason'):
							print 'We failed to reach a server.'
							print 'Reason: ', e.reason
						elif hasattr(e, 'code'):
							print 'The server couldn\'t fulfill the request.'
							print 'Error code: ', e.code
					else:
						# everything is fine
						print counter,": "+match
						f = open('/tmp/files/'+str(counter+i*10)+'.html','w')
						f.write(tmp_page)
						f.close()
			#now check if there is more pages
			#match = regex_next.search(page)
    	#if match == None:
			#	more = False
			#else:
			url = "http://www.google.com"+match.group(1).replace('&','&')
			
if __name__=="__main__":
	main()
 
# vim: ai ts=4 sts=4 et sw=4

1001qa.net says:

September 22, 2010 at 13:03

ignore my previous post, my script will download duplicate pages as Google returns the results in different order each time
Fentex says:

October 7, 2010 at 14:06
I just had a need for this scrip, but found the regexs out of date, so I tweaked them for Googles current output…
```
regex_cache = re.compile(r']*href="([^"]+)[^>]+>Cached<')
regex_next = re.compile(']*href="([^"]+)[^>]+id=pnnext')
```
Fentex says:

October 7, 2010 at 14:08
Hmmmm, okay, this blog input doesn’t escape html properly – here it is again…
```
regex_cache = re.compile(r'<a [^>]*href="([^"]+)[^>]+>Cached]*href="([^"]+)[^>]+id=pnnext')
```

Oh ffs…

regex_cache = re.compile(r'<a [^>]*href="([^"]+)[^>]+>Cached]*href="([^"]+)[^>]+id=pnnext')

Fentex says:

October 7, 2010 at 14:10

I give up. I don’t know what this blogs input filters are doing but they aren’t sane.
Les says:

October 20, 2010 at 13:28

Fentex, I need this script. Can you please email a copy to me? Thanks in advance. Please send to: leslie_chw@hotmail.com
simon says:

January 19, 2011 at 06:54

Hi Fentex.

If you’ve still got it, I could use a copy of that script. If you could send it to simon@seethroughweb.com

Thank you.

Simon
Dee says:

April 5, 2011 at 18:58

Hey, can someone tell me how to run that code – do you do it on a server or through cpanel or what – sorry I am not a coder and I just need to get about 300+ cached pages back from google (hosting nightmare caused loss of site functionality from the database being lost or something)
Dee says:

April 5, 2011 at 19:02

Oh never mind I see you run is using python at the command line – I’ll have to learn that!

Share this:

71 thoughts on “Retrieving Google’s Cache for a Whole Website”

Leave a Reply