Some time ago, as some of you noticed, the web server that hosts my blog went down. Unfortunately, some of the sites had no proper backup, so some thing had to be done in case the hard disk couldn’t be recovered. My efforts turned to Google’s cache. Google keeps a copy of the text of the web page in it’s cache, something that is usually useful when the website is temporarily unavailable. The basic idea is to retrieve a copy of all the pages of a certain site that Google has a cache of.
While this is easily done manually when only few pages are cached, the task needs to be automated when a need for retrieving several hundreds of pages rises. This is exactly what the following Python script does.
#!/usr/bin/python
import urllib
import urllib2
import re
import socket
import os
socket.setdefaulttimeout(30)
#adjust the site here
search_term="site:guyrutenberg.com"
def main():
headers = {'User-Agent': 'Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.1.4) Gecko/20070515 Firefox/2.0.0.4'}
url = "http://www.google.com/search?q="+search_term
regex_cache = re.compile(r'<a href="(http://\d*\.\d*\.\d*\.\d*/search\?q\=cache.*?)".*?>Cached</a>')
regex_next = re.compile('<a href="([^"]*?)"><span id=nn></span>Next</a>')
#this is the directory we will save files to
try:
os.mkdir('files')
except:
pass
counter = 0
pagenum = 0
more = True
while(more):
pagenum += 1
print "PAGE "+str(pagenum)+": "+url
req = urllib2.Request(url, None, headers)
page = urllib2.urlopen(req).read()
matches = regex_cache.findall(page)
for match in matches:
counter+=1
tmp_req = urllib2.Request(match.replace('&','&'), None, headers)
tmp_page = urllib2.urlopen(tmp_req).read()
print counter,": "+match
f = open('files/'+str(counter)+'.html','w')
f.write(tmp_page)
f.close()
#now check if there is more pages
match = regex_next.search(page)
if match == None:
more = False
else:
url = "http://www.google.com"+match.group(1).replace('&','&')
if __name__=="__main__":
main()
# vim: ai ts=4 sts=4 et sw=4
Before using the script you need to adjust the search_term
variable near the beginning of the script. In this variable goes the search term for which all the available cache would be downloaded. E.g. to retrieve the cache of all the pages of http://www.example.org you should set search_term
to site:www.example.org
This just saved my bacon. Thanks!
This saved my butt like you cant imagine. If I could send you $$, I would.
I’m glad that it helped.
I had luck in my case as the HD data-recovery was successful, but this script helped me sleep better while the data was recovered.
This is what I have been looking for days. Recently I lost all data of my site and want to get it back from google cached pages. I used a perl program to retrieve pages, but can easily get banned by google. so i have to use different proxy servers. Each IP can only retrieve about 20-40 pages. Does your code have the same problem? Thanks.
I’ve used the script to retrieve several thousands of pages and I didn’t get banned. The script uses a normal user-agent to disguise itself, which I guess helps a bit.
Good luck with retrieving the pages.
Hello, I’m trying to retrieve 5 pages from a website that is now parked at godaddy.com – a few months ago I could see the cached pages – now I can’t. I’ve also checked archive.org – waybackmachine search but it only goes to 2005 – I’m looking for cached pages from Aprill 2007 to Sept 2008. I’m not a programmer so I don’t understand your code above. Can you point me in a direction to find these cached web pages. The website I’m looking for is http://www.digital-dynamix.com
thanks!!!
Hi Allen,
I’m afraid I can’t help you. Google cache and the Way Back Machine are the caches I use to retrieve old sites. I suggest you search google for web archives and see if you come up with some relevant results.
Hi,
This script works brilliantly except that it only downloads the first 10 (or on a couple of searches I tested, 9) results for a search. Do you know why it might be doing that and how I could get it to download all of them?
Thanks
This sound weird, as I tested it, it would allow you to download as many results as Google can display (which is limited to the first 1000). I actually downloaded couple of times many more than 10 results. For which query did you try to download the cache?
It was a livejournal, so, for example, I have the line with the search term reading:
search_term=”site:news.livejournal.com”
and I get either 9 or 10 results downloaded. This happens for any site, not just livejournals, so for example when I tried
search_term=”site:ebay.com”
I got 10 files downloaded.
This is Python 2.3.5 on OS X.4.11.
Hi J,
I’ve recheck what you’re saying and you right. It seems that Google changed their page markup, thus the script doesn’t recognize there are more result pages.
If you want the script to work for you, you’ll have to rewrite the regex_next regular expression in line 4 of main(), to capture the existence of the next button (it shouldn’t be hard).
Has anyone fixed this? I’m a regex dummy and desperate to get this to work.
No one fixed the regex yet (at least that I know). Unfortunately, I’m too busy to fix it myself these days.
The following line is complete overkill, but seems to do the trick for now:
regex_next = re.compile(‘Next‘)
(of course, Google now seems to have flagged me as a malicious service…)
regex_next = re.compile(‘<a href=”([^”]*?)”><span class=”csb ch” style=”background-position:-76px 0;margin-right:34px;width:66px”></span>Next</a>’)
Sorry, that last post got stripped
Thanks Ed!
Don’t worry by the malicious service flagging, it usually disappears in a few minutes (at least it did for me when I developed the script).
errr …
Where do you run this script?
You run it using Python from the command line.
duh! Thanks.
I edited your script with Ed’s fix and ran it at root. It still only returned 10 results. (BTW – Ed’s fix has a syntax error).
Too bad it doesn’t work – I need to retrieve 129 pages – alas a lot of time wasted doing it one at a time!
Linda, if it doesn’t work it means that either Ed’s regex is wrong or you copied it incorrectly. Make sure when you copy it you replace the single and double quotes with the plain one, this what caused the syntax error.
Don’t forget to replace
<
with < and>
with > (I couldn’t have both them and the quotes correct).In that last post, also add a colon between “width” and “66px”
@Dan, Thanks for pointing it out.
I’m not a programmer but REALLY need to download an entire cache (1000) web pages from Google. I’ve downloaded Python 2.6 and run the command line but I have 3 problems:
– Python won’t allow me to paste code??
– When I type the code I get an Indentationerror on os.mkdir(‘files’)
– How do I “run” the code once completed.
Can someone please explain where I have gone wrong because I am running out of time.
I’m a complete idiot when it comes to programming and scared will lose Google Cache if I can’t get this fixed? Please help??
@Lee, you should save the code to a file and then run it with python. Please see the comments above for possible changes to the code. Running python code is a bit different from one operating system to the other, but it is usually done by opening a command prompt (or a terminal) and typing:
python name_of_the_file_you_would_like_to_run.py
I have tried this script, with all the modifications mentioned, but it still saves only 10 pages 🙁
Is there anyone who has been successful in retrieving more than 10 pages? If so, please email me the script to mgalus@szm.sm thanks a lot! I really need this 🙁
@disney: change
regex_next = re.compile(‘Next‘)
to
regex_next = re.compile(‘]*)>Next‘)
I got 26 pages this way, but there should be more. It quit with this output:
Traceback (most recent call last):
File “/home/james/scripts/gcache-site-download.py”, line 46, in
main()
File “/home/james/scripts/gcache-site-download.py”, line 33, in main
tmp_page = urllib2.urlopen(tmp_req).read()
File “/usr/lib/python2.6/urllib2.py”, line 126, in urlopen
return _opener.open(url, data, timeout)
File “/usr/lib/python2.6/urllib2.py”, line 391, in open
response = self._open(req, data)
File “/usr/lib/python2.6/urllib2.py”, line 409, in _open
‘_open’, req)
File “/usr/lib/python2.6/urllib2.py”, line 369, in _call_chain
result = func(*args)
File “/usr/lib/python2.6/urllib2.py”, line 1161, in http_open
return self.do_open(httplib.HTTPConnection, req)
File “/usr/lib/python2.6/urllib2.py”, line 1136, in do_open
raise URLError(err)
urllib2.URLError:
oh man … it ate my text. i’m working on tweaking a few other things, so I’ll post again when it’s ready
well, since I can’t post the code without wordpress rewriting it, I guess just get in touch with me via my site if you want the new script.
http://james.revillini.com/wp-content/uploads/2010/03/gcache-site-download.py_.tar.gz
^^ my updated version.
changes:
* fixed bug ‘cannot download more than 10 pages’ – the regular expression to find the ‘next’ link was a little off, so I made it more liberal and it seems to work for now.
* added random timeout, which makes it run slower, but I thought it might help to trick google into thinking it was a regular user, not a bot (if they flag you as a bot, you are prevented from viewing cached pages for a day or so)
* creates ‘files’ subfolder in /tmp/ … windows users should probably change this to c:/files or something
* catches exceptions if there is an http timeout
* timeout limit set to 60 seconds, as 30 was timeing out fairly often
Hi,
I’am running Phyton for windows x64
http://www.python.org/download/releases/3.1.2/
And when running your latest script I got the following error:
[img]
http://i44.tinypic.com/okzwqd.jpg
[/img]
Please tell me what is wrong. I dont no much about phyton except from running the script!
Thanks in advance
Python 3 isn’t back-compatible with Python 2, so there were incompatible syntax changes. One of those changes concerns the
print
command. As a result, you will have to use Python 2, or adjust the script yourself to Python 3.Thank you very much “Guy”. I downloaded Windows x86 MSI Installer (2.7a4) as it seems to be the latest 2.x version and it works.
Cheers 🙂
script just echos
PAGE 1: http://www.google.com/search?q=site:thesite.com
and no file is saved
Same issue as Jeremiah. Nothing is saved, but I get the same type of echo message.
The google cache is now served up over a proper URl and not an IP. You need to change the regex_cache from
\d*\.\d*\.\d*\.\d*
to
\w*\.\w*\.\w*
This really works at 4 August 2010. I used James Revillini code, with mar wilson regex_cache changes, also with Windows Python 2.7.
However i have the following issue:I want a retrieve a subsite with a key such as
http://www.mysite.com/forum/codechange?showforum=23
what are the changes required in the code for make this?
Thanks everybody specially to guy
Actualization: Nope only the first ten again!!
What happened?? I need to retrieve a whole website
by adding num=100, so changing line 15 to:
url = “http://www.google.com/search?num=100&q=”+search_term
you can at least fetch 100 pages instead of 10.
that was enough for me so i didn’t try to fix the page-fetching anymore.
hope it helps.
thanks to the author for this great script, saved my life as i had a server-crash and no backups! 😉
hey, nice blog…really like it and added to bookmarks. keep up with good work
Here is the idea behind geting results with this scrip:
Apparently Google doesn’t return more than 10 results and the code searching for “Next” page is broken. I am RegEx illiterate but I managed to fix the code using the following trick: We know that each page of results has 10 results at most. You can get the results starting with a specific page, for example
http://www.google.ca/search?&q=Microsoft&start=50
will get you the page 50 with results!
Now all you have to do is to cycle the code for 10 pages from 1 to 100
http://www.google.ca/search?&q=Microsoft&start=1
http://www.google.ca/search?&q=Microsoft&start=2
http://www.google.ca/search?&q=Microsoft&start=3
.
.
.
http://www.google.ca/search?&q=Microsoft&start=100
Here is the code, it might look primitive and crippled but it worked for me.
The code that the others pasted above has wrong indentation, spaces and tabs, I used only tabs.
ignore my previous post, my script will download duplicate pages as Google returns the results in different order each time
I just had a need for this scrip, but found the regexs out of date, so I tweaked them for Googles current output…
Hmmmm, okay, this blog input doesn’t escape html properly – here it is again…
Oh ffs…
I give up. I don’t know what this blogs input filters are doing but they aren’t sane.
Fentex, I need this script. Can you please email a copy to me? Thanks in advance. Please send to: leslie_chw@hotmail.com
Hi Fentex.
If you’ve still got it, I could use a copy of that script. If you could send it to simon@seethroughweb.com
Thank you.
Simon
Hey, can someone tell me how to run that code – do you do it on a server or through cpanel or what – sorry I am not a coder and I just need to get about 300+ cached pages back from google (hosting nightmare caused loss of site functionality from the database being lost or something)
Oh never mind I see you run is using python at the command line – I’ll have to learn that!