I’ve been using python to write various bots and crawler for a long time. Few days ago I needed to write some simple bot to remove some 400+ spam pages in Sikumuna, I took an old script of mine (from 2006) in order to modify it. The script used ClientForm, a python module that allows you to easily parse and fill html forms using python. I quickly found that ClientForm is now deprecated in favor of mechanize. In the beginning I was partly set back by the change, as ClientForm was pretty easy to use, and mechanize
‘s documentation could use some improvement. However, I quickly changed my mind about mechanize
. The basic interface for mechanize
is a simple browser object, that litteraly allows you to browse using python. It takes care of handling cookies and such and it got similar form-filling abilities to ClientForm, but this time they are integrated into the browser object.
For future reference for myself, and as another code example to mechanizes
sparse documentation I’m giving below the gist of the simple bot I wrote:
self.browser = mechanize.Browser()
self.browser.set_handle_robots(False)
def login(self):
self.browser.open(self.login_url)
self.browser.select_form(name="userlogin")
self.browser["wpName"] = self.username
self.browser["wpPassword"] = self.password
res = self.browser.submit()
def find_pages(self, prefix):
self.browser.open(self.find_pages_url)
self.browser.select_form(nr=0)
self.browser["from"] = prefix
res = self.browser.submit()
data = res.read()
link_regex = re.compile('<td><a href="([^"]*)"[^<]*</a></td>')
return link_regex.findall(data)
def delete_page(self, page_url):
self.browser.open(page_url + "&action=delete")
if "Kindle" not in self.browser.title():
print self.browser.title()
if raw_input("Confirm: ") != "y":
return
self.browser.select_form(nr=0)
self.browser["wpReason"] = "Spam"
self.browser.submit()
def run(self, prefix):
self.login()
pages = self.find_pages(prefix)
print "Found %d page" % len(pages)
for i,page in enumerate(pages):
print "Deleting", i
self.delete_page(page)
This isn’t a complete code example, as the rest of the code is just mundane, but you can clearly see how simple it is to use mechanize
.
The interesting parts are:
- Initializing the browser object using
mechanize.Browser()
- Openning pages:
browser.open(url)
- Selecting forms:
browser.select_form(name="userlogin")
(selecting forms by name)browser.select_form(nr=0)
(selecting forms by their sequential number in the page). - Filling forms is done by assigning values to the form fields on the browser object:
browser["wpName"] = self.username
- Submitting:
browser.submit()
Hey there, You have performed an excellent job. I will definitely digg it and in my view recommend to my friends. I am sure they will be benefited from this site.
Hi,
I think you did not post the complete code.
Maybe, some of the code in the start is missing.
It’s intentional, this code snippet has everything needed to understand it, except some initialization code of the class which contains some confidential things like usernames and passwords.
Ervin, thats no complete code, but is very useful code for writing spider.
Could you possibly post the whole code without the confidential things, like putting username=username, password=password, or something like that?
Thank you Guy!
Was using Ruby, then discovered the dependency challenge and the antivirus warnings.
My interest is to create automation bots for both web and desktop automation.
I have looked into Rad Studio C++ and/or Delphi, Visual Studio and Qt.
Which of these three would you, yourself prefer to create automation bots to incorporate Python?
IronPython can be included within Visual Studio, and Qt can also incorporate Python, but am not sure about Rad Studio Berlin 10.