Retrieving Google’s Cache for a Whole Website

Some time ago, as some of you noticed, the web server that hosts my blog went down. Unfortunately, some of the sites had no proper backup, so some thing had to be done in case the hard disk couldn’t be recovered. My efforts turned to Google’s cache. Google keeps a copy of the text of the web page in it’s cache, something that is usually useful when the website is temporarily unavailable. The basic idea is to retrieve a copy of all the pages of a certain site that Google has a cache of.

While this is easily done manually when only few pages are cached, the task needs to be automated when a need for retrieving several hundreds of pages rises. This is exactly what the following Python script does.

#!/usr/bin/python
import urllib
import urllib2
import re
import socket
import os
socket.setdefaulttimeout(30)
#adjust the site here
search_term="site:guyrutenberg.com"
def main():
    headers = {'User-Agent': 'Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.1.4) Gecko/20070515 Firefox/2.0.0.4'}
    url = "http://www.google.com/search?q="+search_term
    regex_cache = re.compile(r'<a href="(http://\d*\.\d*\.\d*\.\d*/search\?q\=cache.*?)".*?>Cached</a>')
    regex_next = re.compile('<a href="([^"]*?)"><span id=nn></span>Next</a>')

    #this is the directory we will save files to
    try:
        os.mkdir('files')
    except:
        pass
    counter = 0
    pagenum = 0
    more = True
    while(more):
        pagenum += 1
        print "PAGE "+str(pagenum)+": "+url
        req = urllib2.Request(url, None, headers)
        page = urllib2.urlopen(req).read()
        matches = regex_cache.findall(page)
        for match in matches:
            counter+=1
            tmp_req = urllib2.Request(match.replace('&amp;','&'), None, headers)
            tmp_page = urllib2.urlopen(tmp_req).read()
            print counter,": "+match
            f = open('files/'+str(counter)+'.html','w')
            f.write(tmp_page)
            f.close()
        #now check if there is more pages
        match = regex_next.search(page)
        if match == None:
            more = False
        else:
            url = "http://www.google.com"+match.group(1).replace('&amp;','&')

if __name__=="__main__":
    main()

# vim: ai ts=4 sts=4 et sw=4

Before using the script you need to adjust the search_term variable near the beginning of the script. In this variable goes the search term for which all the available cache would be downloaded. E.g. to retrieve the cache of all the pages of http://www.example.org you should set search_term to site:www.example.org

71 thoughts on “Retrieving Google’s Cache for a Whole Website”

  1. I’m glad that it helped.
    I had luck in my case as the HD data-recovery was successful, but this script helped me sleep better while the data was recovered.

  2. This is what I have been looking for days. Recently I lost all data of my site and want to get it back from google cached pages. I used a perl program to retrieve pages, but can easily get banned by google. so i have to use different proxy servers. Each IP can only retrieve about 20-40 pages. Does your code have the same problem? Thanks.

  3. I’ve used the script to retrieve several thousands of pages and I didn’t get banned. The script uses a normal user-agent to disguise itself, which I guess helps a bit.

    Good luck with retrieving the pages.

  4. Hello, I’m trying to retrieve 5 pages from a website that is now parked at godaddy.com – a few months ago I could see the cached pages – now I can’t. I’ve also checked archive.org – waybackmachine search but it only goes to 2005 – I’m looking for cached pages from Aprill 2007 to Sept 2008. I’m not a programmer so I don’t understand your code above. Can you point me in a direction to find these cached web pages. The website I’m looking for is http://www.digital-dynamix.com

    thanks!!!

  5. Hi Allen,

    I’m afraid I can’t help you. Google cache and the Way Back Machine are the caches I use to retrieve old sites. I suggest you search google for web archives and see if you come up with some relevant results.

  6. Hi,

    This script works brilliantly except that it only downloads the first 10 (or on a couple of searches I tested, 9) results for a search. Do you know why it might be doing that and how I could get it to download all of them?

    Thanks

  7. This sound weird, as I tested it, it would allow you to download as many results as Google can display (which is limited to the first 1000). I actually downloaded couple of times many more than 10 results. For which query did you try to download the cache?

  8. It was a livejournal, so, for example, I have the line with the search term reading:

    search_term=”site:news.livejournal.com”

    and I get either 9 or 10 results downloaded. This happens for any site, not just livejournals, so for example when I tried

    search_term=”site:ebay.com”

    I got 10 files downloaded.

    This is Python 2.3.5 on OS X.4.11.

  9. Hi J,

    I’ve recheck what you’re saying and you right. It seems that Google changed their page markup, thus the script doesn’t recognize there are more result pages.

    If you want the script to work for you, you’ll have to rewrite the regex_next regular expression in line 4 of main(), to capture the existence of the next button (it shouldn’t be hard).

  10. Has anyone fixed this? I’m a regex dummy and desperate to get this to work.

  11. No one fixed the regex yet (at least that I know). Unfortunately, I’m too busy to fix it myself these days.

  12. The following line is complete overkill, but seems to do the trick for now:

    regex_next = re.compile(‘Next‘)

    (of course, Google now seems to have flagged me as a malicious service…)

  13. regex_next = re.compile(‘<a href=”([^”]*?)”><span class=”csb ch” style=”background-position:-76px 0;margin-right:34px;width:66px”></span>Next</a>’)

    Sorry, that last post got stripped

  14. Thanks Ed!

    Don’t worry by the malicious service flagging, it usually disappears in a few minutes (at least it did for me when I developed the script).

  15. duh! Thanks.

    I edited your script with Ed’s fix and ran it at root. It still only returned 10 results. (BTW – Ed’s fix has a syntax error).

    Too bad it doesn’t work – I need to retrieve 129 pages – alas a lot of time wasted doing it one at a time!

  16. Linda, if it doesn’t work it means that either Ed’s regex is wrong or you copied it incorrectly. Make sure when you copy it you replace the single and double quotes with the plain one, this what caused the syntax error.

    regex_next = re.compile('<a href="([^"]*?)"><span class="csb ch" style="background-position:-76px 0;margin-right:34px;width66px"></span>Next</a>')
    

    Don’t forget to replace &lt; with < and &gt; with > (I couldn’t have both them and the quotes correct).

  17. I’m not a programmer but REALLY need to download an entire cache (1000) web pages from Google. I’ve downloaded Python 2.6 and run the command line but I have 3 problems:

    – Python won’t allow me to paste code??
    – When I type the code I get an Indentationerror on os.mkdir(‘files’)
    – How do I “run” the code once completed.

    Can someone please explain where I have gone wrong because I am running out of time.
    I’m a complete idiot when it comes to programming and scared will lose Google Cache if I can’t get this fixed? Please help??

  18. @Lee, you should save the code to a file and then run it with python. Please see the comments above for possible changes to the code. Running python code is a bit different from one operating system to the other, but it is usually done by opening a command prompt (or a terminal) and typing:

    python name_of_the_file_you_would_like_to_run.py

  19. I have tried this script, with all the modifications mentioned, but it still saves only 10 pages 🙁
    Is there anyone who has been successful in retrieving more than 10 pages? If so, please email me the script to mgalus@szm.sm thanks a lot! I really need this 🙁

  20. @disney: change

    regex_next = re.compile(‘Next‘)

    to

    regex_next = re.compile(‘]*)>Next‘)

    I got 26 pages this way, but there should be more. It quit with this output:

    Traceback (most recent call last):
    File “/home/james/scripts/gcache-site-download.py”, line 46, in
    main()
    File “/home/james/scripts/gcache-site-download.py”, line 33, in main
    tmp_page = urllib2.urlopen(tmp_req).read()
    File “/usr/lib/python2.6/urllib2.py”, line 126, in urlopen
    return _opener.open(url, data, timeout)
    File “/usr/lib/python2.6/urllib2.py”, line 391, in open
    response = self._open(req, data)
    File “/usr/lib/python2.6/urllib2.py”, line 409, in _open
    ‘_open’, req)
    File “/usr/lib/python2.6/urllib2.py”, line 369, in _call_chain
    result = func(*args)
    File “/usr/lib/python2.6/urllib2.py”, line 1161, in http_open
    return self.do_open(httplib.HTTPConnection, req)
    File “/usr/lib/python2.6/urllib2.py”, line 1136, in do_open
    raise URLError(err)
    urllib2.URLError:

  21. http://james.revillini.com/wp-content/uploads/2010/03/gcache-site-download.py_.tar.gz

    ^^ my updated version.
    changes:
    * fixed bug ‘cannot download more than 10 pages’ – the regular expression to find the ‘next’ link was a little off, so I made it more liberal and it seems to work for now.
    * added random timeout, which makes it run slower, but I thought it might help to trick google into thinking it was a regular user, not a bot (if they flag you as a bot, you are prevented from viewing cached pages for a day or so)
    * creates ‘files’ subfolder in /tmp/ … windows users should probably change this to c:/files or something
    * catches exceptions if there is an http timeout
    * timeout limit set to 60 seconds, as 30 was timeing out fairly often

  22. Python 3 isn’t back-compatible with Python 2, so there were incompatible syntax changes. One of those changes concerns the print command. As a result, you will have to use Python 2, or adjust the script yourself to Python 3.

  23. Thank you very much “Guy”. I downloaded Windows x86 MSI Installer (2.7a4) as it seems to be the latest 2.x version and it works.

    Cheers 🙂

  24. Same issue as Jeremiah. Nothing is saved, but I get the same type of echo message.

  25. The google cache is now served up over a proper URl and not an IP. You need to change the regex_cache from

    \d*\.\d*\.\d*\.\d*

    to

    \w*\.\w*\.\w*

  26. This really works at 4 August 2010. I used James Revillini code, with mar wilson regex_cache changes, also with Windows Python 2.7.

    However i have the following issue:I want a retrieve a subsite with a key such as
    http://www.mysite.com/forum/codechange?showforum=23

    what are the changes required in the code for make this?

    Thanks everybody specially to guy

  27. Actualization: Nope only the first ten again!!
    What happened?? I need to retrieve a whole website

  28. by adding num=100, so changing line 15 to:
    url = “http://www.google.com/search?num=100&q=”+search_term
    you can at least fetch 100 pages instead of 10.
    that was enough for me so i didn’t try to fix the page-fetching anymore.
    hope it helps.
    thanks to the author for this great script, saved my life as i had a server-crash and no backups! 😉

  29. Here is the idea behind geting results with this scrip:

    Apparently Google doesn’t return more than 10 results and the code searching for “Next” page is broken. I am RegEx illiterate but I managed to fix the code using the following trick: We know that each page of results has 10 results at most. You can get the results starting with a specific page, for example

    http://www.google.ca/search?&q=Microsoft&start=50

    will get you the page 50 with results!

    Now all you have to do is to cycle the code for 10 pages from 1 to 100

    http://www.google.ca/search?&q=Microsoft&start=1
    http://www.google.ca/search?&q=Microsoft&start=2
    http://www.google.ca/search?&q=Microsoft&start=3
    .
    .
    .
    http://www.google.ca/search?&q=Microsoft&start=100

    Here is the code, it might look primitive and crippled but it worked for me.
    The code that the others pasted above has wrong indentation, spaces and tabs, I used only tabs.

    #!/usr/bin/python
    import urllib
    import urllib2
    import re
    import socket
    import os
    import time
    import random
    socket.setdefaulttimeout(60)
    #adjust the site here
    search_term="site:www.wikipedia.com"
    def main():
        #headers = {'User-Agent': 'Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.1.4) Gecko/20070515 Firefox/2.0.0.4'}
        headers = {'User-Agent': 'Mozilla/5.0 (X11; U; Linux x86_64; en-US) AppleWebKit/533.3 (KHTML, like Gecko) Chrome/5.0.360.0 Safari/533.3'}
        url = "http://www.google.com/search?num=100&q="+search_term
        regex_cache = re.compile(r'Cached')
        regex_next = re.compile(']*)>Next')
        
        #this is the directory we will save files to
        try:
            os.mkdir("D:/tmp/files")
        except:
            pass
        i=0
        while(i<100):
    			i+=1
    			url = "http://www.google.com/search?&start="+str(i)+"&q="+search_term
    			counter = 0
    			pagenum = 0
    			while(pagenum<11):
    				pagenum += 1
    				print "PAGE "+str(pagenum)+": "+url
    				req = urllib2.Request(url, None, headers)
    				page = urllib2.urlopen(req).read()
    				matches = regex_cache.findall(page)
    				for match in matches:
    					timeout = random.randint(3, 30)
    					print '... resting for ', timeout, ' seconds ...'
    					time.sleep(timeout)          
    					counter+=1
    					tmp_req = urllib2.Request(match.replace('&','&'), None, headers)
    					try:
    						tmp_page = urllib2.urlopen(tmp_req).read()
    					except IOError, e:
    						if hasattr(e, 'reason'):
    							print 'We failed to reach a server.'
    							print 'Reason: ', e.reason
    						elif hasattr(e, 'code'):
    							print 'The server couldn\'t fulfill the request.'
    							print 'Error code: ', e.code
    					else:
    						# everything is fine
    						print counter,": "+match
    						f = open('/tmp/files/'+str(counter+i*10)+'.html','w')
    						f.write(tmp_page)
    						f.close()
    			#now check if there is more pages
    			#match = regex_next.search(page)
        	#if match == None:
    			#	more = False
    			#else:
    			url = "http://www.google.com"+match.group(1).replace('&','&')
    			
    if __name__=="__main__":
    	main()
     
    # vim: ai ts=4 sts=4 et sw=4
    
  30. Hmmmm, okay, this blog input doesn’t escape html properly – here it is again…

    regex_cache = re.compile(r'<a [^>]*href="([^"]+)[^>]+>Cached]*href="([^"]+)[^>]+id=pnnext')
    
  31. Oh ffs…

    regex_cache = re.compile(r'<a [^>]*href="([^"]+)[^>]+>Cached]*href="([^"]+)[^>]+id=pnnext')
    
  32. I give up. I don’t know what this blogs input filters are doing but they aren’t sane.

  33. Hey, can someone tell me how to run that code – do you do it on a server or through cpanel or what – sorry I am not a coder and I just need to get about 300+ cached pages back from google (hosting nightmare caused loss of site functionality from the database being lost or something)

  34. Oh never mind I see you run is using python at the command line – I’ll have to learn that!

Leave a Reply

Your email address will not be published. Required fields are marked *