Retrieving Google’s Cache for a Whole Website

Some time ago, as some of you noticed, the web server that hosts my blog went down. Unfortunately, some of the sites had no proper backup, so some thing had to be done in case the hard disk couldn’t be recovered. My efforts turned to Google’s cache. Google keeps a copy of the text of the web page in it’s cache, something that is usually useful when the website is temporarily unavailable. The basic idea is to retrieve a copy of all the pages of a certain site that Google has a cache of.

While this is easily done manually when only few pages are cached, the task needs to be automated when a need for retrieving several hundreds of pages rises. This is exactly what the following Python script does.

#!/usr/bin/python
import urllib
import urllib2
import re
import socket
import os
socket.setdefaulttimeout(30)
#adjust the site here
search_term="site:guyrutenberg.com"
def main():
    headers = {'User-Agent': 'Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.1.4) Gecko/20070515 Firefox/2.0.0.4'}
    url = "http://www.google.com/search?q="+search_term
    regex_cache = re.compile(r'<a href="(http://\d*\.\d*\.\d*\.\d*/search\?q\=cache.*?)".*?>Cached</a>')
    regex_next = re.compile('<a href="([^"]*?)"><span id=nn></span>Next</a>')

    #this is the directory we will save files to
    try:
        os.mkdir('files')
    except:
        pass
    counter = 0
    pagenum = 0
    more = True
    while(more):
        pagenum += 1
        print "PAGE "+str(pagenum)+": "+url
        req = urllib2.Request(url, None, headers)
        page = urllib2.urlopen(req).read()
        matches = regex_cache.findall(page)
        for match in matches:
            counter+=1
            tmp_req = urllib2.Request(match.replace('&amp;','&'), None, headers)
            tmp_page = urllib2.urlopen(tmp_req).read()
            print counter,": "+match
            f = open('files/'+str(counter)+'.html','w')
            f.write(tmp_page)
            f.close()
        #now check if there is more pages
        match = regex_next.search(page)
        if match == None:
            more = False
        else:
            url = "http://www.google.com"+match.group(1).replace('&amp;','&')

if __name__=="__main__":
    main()

# vim: ai ts=4 sts=4 et sw=4

Before using the script you need to adjust the search_term variable near the beginning of the script. In this variable goes the search term for which all the available cache would be downloaded. E.g. to retrieve the cache of all the pages of http://www.example.org you should set search_term to site:www.example.org

71 thoughts on “Retrieving Google’s Cache for a Whole Website”

Dee says:

April 5, 2011 at 19:34

Does anyone have a script that works for 300+ pages?
Dee says:

April 5, 2011 at 20:53

If someone could email me at ddhamm@yahoo.com, I have questions – can’t get any of the scripts to work without errors (actually have installed Python 2.7, can edit the .py files, etc.) I am so close to getting this to work!
Ajit Kumar Singh says:

August 27, 2011 at 10:00

Can we use the same script in PHP
Guy says:

August 28, 2011 at 23:07

No, you’ll have to rewrite it, it’s in Python.
Stephen says:

September 26, 2011 at 02:28

Hi, we’ve just lost the content off our site, too. Is there any luck with the new automated script?
Guy says:

September 26, 2011 at 07:31

There is s slight chance, as google changed things it was written. You may need to update it.
Isaac says:

November 9, 2011 at 11:00

Hi Guy,

where are the files saved, after running the Script.

Isaac
Guy says:

November 9, 2011 at 19:39

It saves the file under a directory named ‘files’ in the current working directory.
macdet says:

November 10, 2011 at 09:42

@Fentex

please put code tags around like osx rocks

regex_cache = re.compile(r']*href="([^"]+)[^>]+>Cached]*href="([^"]+)[^>]+id=pnnext')
macdet says:

November 10, 2011 at 09:47

ok, I see the prob 🙁 shame on me!
Isaac says:

November 10, 2011 at 10:06

Thank you Guy,

I have seen the files folder but it is empty.

I have a feeling I didn’t do it the correct way.

This is what I did.

a) I installed Python on my machine which is operating on windows 7.
b) Run the script via command line.
Khaja Minhajuddin says:

December 21, 2011 at 05:34

Thank you very much for your script, I have posted a modified version of this script in this gist: https://gist.github.com/1504425 . It adds random sleeping to make sure that google doesn’t block the crawler.
Dofs says:

December 28, 2011 at 15:42

If you don’t feel like scripting yourself (even though you can learn a lot from it) there is sites like http://recovermywebsite.com which recovers your site from the cache, and i think it is free.
counter strike 1.8 indir says:

June 3, 2012 at 06:57

I’m extremely impressed with your writing skills as well as with the layout on your weblog. Is this a paid theme or did you customize it yourself? Anyway keep up the excellent quality writing, it’s rare to see a great blog like this one these days..
Pingback: 360 percents - Google cache downloader in Python
Thang Pham says:

September 26, 2012 at 14:42

This is my version, it can run without trouble with 503 504 404 error (Google blocks IP that send many request): https://gist.github.com/3787790
Pingback: How To Scrape Google Cache With A Python Script | Code, Sleep, Shred.
eBa says:

October 31, 2012 at 15:47

Thank you Thang Pham, it works!
Brad says:

June 25, 2013 at 16:24

Hey Guy

Just wondering what the latest working code is? Have tried your original one and several of the others but I just keep getting an empty files folder on my desktop. I’ve saving the code into a notepad file as document.py. I then double click on it to get it to work. I’ve also tried dragging it to the commend line of the Python program. Is they way to do it?

Brad
Pingback: What Would You Do If Somehow You Lost all Your Blog’s Content? | Microsoft Freelancer
Pingback: What Would You Do If Somehow You Lost all Your Blog’s Content? | Pakkoda

Share this:

71 thoughts on “Retrieving Google’s Cache for a Whole Website”

Leave a Reply