Some time ago, as some of you noticed, the web server that hosts my blog went down. Unfortunately, some of the sites had no proper backup, so some thing had to be done in case the hard disk couldn’t be recovered. My efforts turned to Google’s cache. Google keeps a copy of the text of the web page in it’s cache, something that is usually useful when the website is temporarily unavailable. The basic idea is to retrieve a copy of all the pages of a certain site that Google has a cache of.
While this is easily done manually when only few pages are cached, the task needs to be automated when a need for retrieving several hundreds of pages rises. This is exactly what the following Python script does.
#!/usr/bin/python
import urllib
import urllib2
import re
import socket
import os
socket.setdefaulttimeout(30)
#adjust the site here
search_term="site:guyrutenberg.com"
def main():
headers = {'User-Agent': 'Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.1.4) Gecko/20070515 Firefox/2.0.0.4'}
url = "http://www.google.com/search?q="+search_term
regex_cache = re.compile(r'<a href="(http://\d*\.\d*\.\d*\.\d*/search\?q\=cache.*?)".*?>Cached</a>')
regex_next = re.compile('<a href="([^"]*?)"><span id=nn></span>Next</a>')
#this is the directory we will save files to
try:
os.mkdir('files')
except:
pass
counter = 0
pagenum = 0
more = True
while(more):
pagenum += 1
print "PAGE "+str(pagenum)+": "+url
req = urllib2.Request(url, None, headers)
page = urllib2.urlopen(req).read()
matches = regex_cache.findall(page)
for match in matches:
counter+=1
tmp_req = urllib2.Request(match.replace('&','&'), None, headers)
tmp_page = urllib2.urlopen(tmp_req).read()
print counter,": "+match
f = open('files/'+str(counter)+'.html','w')
f.write(tmp_page)
f.close()
#now check if there is more pages
match = regex_next.search(page)
if match == None:
more = False
else:
url = "http://www.google.com"+match.group(1).replace('&','&')
if __name__=="__main__":
main()
# vim: ai ts=4 sts=4 et sw=4
Before using the script you need to adjust the search_term
variable near the beginning of the script. In this variable goes the search term for which all the available cache would be downloaded. E.g. to retrieve the cache of all the pages of http://www.example.org you should set search_term
to site:www.example.org
Does anyone have a script that works for 300+ pages?
If someone could email me at ddhamm@yahoo.com, I have questions – can’t get any of the scripts to work without errors (actually have installed Python 2.7, can edit the .py files, etc.) I am so close to getting this to work!
Can we use the same script in PHP
No, you’ll have to rewrite it, it’s in Python.
Hi, we’ve just lost the content off our site, too. Is there any luck with the new automated script?
There is s slight chance, as google changed things it was written. You may need to update it.
Hi Guy,
where are the files saved, after running the Script.
Isaac
It saves the file under a directory named ‘files’ in the current working directory.
@Fentex
please put code tags around like osx rocks
regex_cache = re.compile(r']*href="([^"]+)[^>]+>Cached]*href="([^"]+)[^>]+id=pnnext')
ok, I see the prob 🙁 shame on me!
Thank you Guy,
I have seen the files folder but it is empty.
I have a feeling I didn’t do it the correct way.
This is what I did.
a) I installed Python on my machine which is operating on windows 7.
b) Run the script via command line.
Thank you very much for your script, I have posted a modified version of this script in this gist: https://gist.github.com/1504425 . It adds random sleeping to make sure that google doesn’t block the crawler.
If you don’t feel like scripting yourself (even though you can learn a lot from it) there is sites like http://recovermywebsite.com which recovers your site from the cache, and i think it is free.
I’m extremely impressed with your writing skills as well as with the layout on your weblog. Is this a paid theme or did you customize it yourself? Anyway keep up the excellent quality writing, it’s rare to see a great blog like this one these days..
This is my version, it can run without trouble with 503 504 404 error (Google blocks IP that send many request): https://gist.github.com/3787790
Thank you Thang Pham, it works!
Hey Guy
Just wondering what the latest working code is? Have tried your original one and several of the others but I just keep getting an empty files folder on my desktop. I’ve saving the code into a notepad file as document.py. I then double click on it to get it to work. I’ve also tried dragging it to the commend line of the Python program. Is they way to do it?
Brad