mechanize – Writing Bots in Python Made Simple

I’ve been using python to write various bots and crawler for a long time. Few days ago I needed to write some simple bot to remove some 400+ spam pages in Sikumuna, I took an old script of mine (from 2006) in order to modify it. The script used ClientForm, a python module that allows you to easily parse and fill html forms using python. I quickly found that ClientForm is now deprecated in favor of mechanize. In the beginning I was partly set back by the change, as ClientForm was pretty easy to use, and mechanize‘s documentation could use some improvement. However, I quickly changed my mind about mechanize. The basic interface for mechanize is a simple browser object, that litteraly allows you to browse using python. It takes care of handling cookies and such and it got similar form-filling abilities to ClientForm, but this time they are integrated into the browser object.

For future reference for myself, and as another code example to mechanizes sparse documentation I’m giving below the gist of the simple bot I wrote:

Continue reading mechanize – Writing Bots in Python Made Simple

Bye Bye OmniCppComplete, Hello Clang Complete

For years OmniCppComplete has been the de facto standard for C++ completion in Vim. But as time progressed, I got more and more annoyed by it’s shortcomings. OmniCppComplete is based on tokenizing provided by ctags. The ctags parsing of C++ code is problematic, you can’t even run it on libstdc++ headers (you need to download modified headers). You want to use an external library? You’ll need to run ctags seperatly on each library. Not to mention it’s inablity to deduce types of anything more than trivial. The core of the problem is that OmniCppComplete isn’t a compiler and you can’t expect something that isn’t a compiler to fully understand code. This what makes Visual Studio’s IntelliSense so great: it uses the Visual C++ compiler for parsing, it isn’t making wild guess at types and what is the current scope – it knows it.
Continue reading Bye Bye OmniCppComplete, Hello Clang Complete

Starting Djano upon Reboot using Cron

Three years ago I wrote about starting services as user via cron instead of init.d (specifically Trac). However this method has a serious downside: it has no support for dependencies. This really bothered me when I used cron to start a Django server, has it would attempt to load before MySQL was running. This made the cron useless, as always after rebooting I would receive an error email saying it couldn’t connect to the MySQL server, and I would have to login and start the Django Server manually. Yesterday I got sick of it, and decided to hack something that will work properly. So my crontab had the following line:

@reboot python ./manage.py runfcgi ...options...

which I changed into:

@reboot until /sbin/status mysql | grep start/running ; do echo "Mysql isn't running yet...>&2"; sleep 1; done; python ./manage.py runfcgi ...options...

Basically it loops and checks whether the MySQL service got started. If so, it would start Django as it did before. On the other hand, if MySQL isn’t running it would just sleep for a second and repeat the check.

A small issue is that if for some reason MySQL won’t start at all, it will loop forever. If this happens, it would mean that I’ll have to manually kill that cronjob, but I would have to login anyway to see what wrong with the MySQL. So, while this method can’t support dependencies like init.d does, it does provide a good-enough solution.

Update 2012-11-23: Fixed the crontab line (it would fail when mysql was in start/post-start state).

Gmail backup: getmail vs. OfflineIMAP

I’m currently reviewing my backup plans and decided it’s a good occasion to finally start backing up my Gmail account. Firstly, I didn’t seriously consider Desktop clients as the main backup tool, as they are hard to automate. The two main options are: OfflineIMAP and getamil. Both are available from Ubuntu’s repositories, so installation is easy with both and both have good tutorials, Matt Cutts’ getmail and EnigmaCurry’s OfflineIMAP.

OfflineIMAP claims to be faster, but I haven’t really checked it (and I’m not sure how important that is giving that it runs in the background). From what I saw configuring them is mainly a task of cut-and-paste, but getmail requires to list every label you want to backup, which I consider is a major downside. As both are able to save the mails to maildir format, it should be easy to back it up using duplicity.

Conclusion: This was a short comparison, mainly to guide me in choosing the right backup for me, you may have different opinions (which, of course, I would gladly hear). I finally chose OfflineIMAP, mainly due to the labels issue.

Note on desktop clients: It seems that every decent one can be configured to work with a local maildir, so you can use them to read the backups. As I prefer Gmail’s interface, I will only use desktop clients in case I’m offline, so read-only access from desktop client seems good enough for me.

The annoying eBook vs. Paperback Pricing

I’m an avid Kindle user for more than a year. However once in a while, I come across something like this when I shopping for a new book:

As you can see, Amazon sells Kindle edition for higher price than a paperback. This book of course isn’t the only example for this ridiculous pricing method, and if one browses the Kindle store he will surely find more.

This really upsets me, as there is no honest reason to price an electronic edition higher than a real dead-tree paper edition. In both cases, the author and the publisher get their royalities and share of the profits. But the Kindle editions doesn’t have many related expenses, like storage, transportation (from the publisher to Amazon), and above all printing costs.

I don’t know who is to blame for this absurd thing, Amazon or the publisher (or even both). But the few things I know are that this bad for everyone, the customer because he pays more and Amazon/publisher as in the long run, this encourages piracy as the customer feels he’s being unfairly treated thus he will be more willing to play an unfair game as well.

Creating a Deb for an Updated Version

Say you’ve an existing package like gitg and you want to use the new version of gitg or even apply your own patches. You could directly make install but you will probably regret it as soon as you’ll want to upgrade/uninstall, and you want to create a better package than the one created by checkinstall. Apperantly, creating a deb package for a new version of already packaged deb isn’t complicated.

Getting Started

Start by pulling the sources for the already available package. I’ll by using gitg as an example throughout this tutorial.

$ apt-get source gitg

This will create a folder according to the version of the package, something like gitg-0.2.4. Extract the new version besides it and cd into its directory. The next step is to copy the debian/ directory from the old source package the code you’ve just extracted.

$ cp -R ../gitg-0.2.4/debian/ .

Dependencies

There are some dependencies you’ll need:

$ sudo apt-get install devscripts
$ sudo apt-get install dpkg-dev

You’ll probably want to do:

$ sudo apt-get build-dep gitg

in order to make sure you’ve all the relevant build time dependencies.

Update debian/ Files

The next step is to update the files under the debian/ sub-directory.

$ DEBEMAIL="Guy Rutenberg <myemail@domain.com>" debchange --nmu

This will update the debian/changelog and set the new version. --nmu will create a new “non maintainer upload” version, meaning if the current version was 0.2.4-0ubuntu1, it will change it to 0.2.4-0ubuntu1.1. This will make sure that there won’t be any collision between your package and an official one. If you update to a new upstream version. It might be more suitable to use something like this:

$ debchange --newversion 0.2.5+20111211.git.20391c4

If necessary, update the Build-Depends and Depends sections of debian/control.

Building the Package

If your building a package directly from version control and not part of an official release, you may need to run

$ ./autogen

at this point.

Now to the actual building:

$ debuild -us -uc -i -I -b

-us -uc tells the script not to sign the .dsc and .changes files accordingly. -i and -I makes the script ignore common version control files. -b tells debuild to only create binary packages. You can also pass -j followed by the number of simultaneous jobs you wish to allow (e.g. -j3, like in make) which an significantly speed things up.

Installing the Package

The package will reside in the parent directory, for example:

../gitg_0.2.5+20111211.git.20391c4_amd64.deb

At this point you’re basically done. If you want to install the package you can use

sudo debi

while you’re still inside the build directory.

Creating Source Packages

If you want to go the extra mile and create source packages, it will make things easier for others to build their own packages based on yours.

You’ll need to create a an “orig” tarball

../gitg_0.2.5+20111211.git.20391c4.orig.tar.bz2

(note the underscore between the package-name and the version). The “orig” tarball should contain the original source-code without the debian specific patches.

Now you can run the debuild command like before but without the -b flag.
This will create the following files:

../gitg_0.2.5+20111211.git.20391c4.debian.tar.gz
../gitg_0.2.5+20111211.git.20391c4.dsc

References

  • UpdatingADeb
  • Man pages for debuild, dpgk-genchanges, dpgk-buildpackage.

Using gitg without installing

I’m working on adding spell checking support to gitg. If you intend to use gitg without installing it, a little hack is necessary. You’ll need to symlink the gitg directory (the one with the source files) as ui.

ln -s gitg ui
./configure /pathto/below/gitg

The reason is that gitg will look for Glade UI files under $(datadir)/gitg/ui and in gitg’s source the UI files are in the gitg directory and not in ui.

You can see above a screenshot of gitg with spell checking enabled. Hopefully I’ll be done with the changes soon and they will be merged to upstream quickly.

Update: There are couple more things to do in order to get gsettings’ schemas right.

mkdir glib-2.0
ln -s ../data glib-2.0/schemas
glib-compile-schemas data/
XDG_DATA_DIRS=".:/usr/share/" ./gitg/gitg

For the schemas thing see glib-compile-schemas‘ man page.

Update 2011-12-17: Jesse (gitg’s maintainer) hasn’t responded to my email regarding the new feature, so I’ve open a bug (#666406) for it. If you’re willing to try the changes yourself, you can pull them from git://github.com/guyru/gitg.git spellchecker.

GNOME_COMPILE_WARNINGS(maximum) – Syntax Error in configure

I’m still encountering migration issues from Gentoo to Ubuntu. Apperantly, Gentoo is much more user friendly than Ubuntu when it comes to compiling packages. In Gentoo you’ve got almost all the major dependencies you need. In Ubuntu, on the other hand, you need to hunt them down. It’s much easier with the main ones, as they are listed. But there are some small ones which are harder to track. I came across the following error while trying to compile gitg, a GUI for Git, today:

./configure: line 14447: syntax error near unexpected token `maximum'
./configure: line 14447: `GNOME_COMPILE_WARNINGS(maximum)'

After not so short investigation I found out I was missing gnome-common

sudo apt-get install gnome-common

Why can’t be one distribution which is user-friendly like Ubuntu and in the same time developer-friendly like Gentoo?

Solving Sudoku using Python and Prolog

Two weeks ago, I add came up with an interesting algorithm for solving Hidato which basically involves decomposing the board the grid (can be square, hexagonal or any other shape), into classes of pieces and then arranging them (maybe I’ll write a detailed post on it in the future). So while pondering whether it would be interesting enough to go forward and actually implementing the algorithm compared to the work it would require, I started thinking what will be the simplest way to solve such puzzles, as opposed to efficient.

At first I’ve looked at general purpose constraint solvers, and decided to tackle Sudoku instead as it’s a bit simple to define in terms of constraints. I considered several libraries but in the end I’ve settled on plainly using Prolog. I chose Prolog because as a logic programming language, constraints are its bread and butter. I although kind of liked it as I haven’t done anything in Prolog for quite a few years.

Describing Sudoku in terms of constraints is extremely simple. You need to state that every cell is in a given range and that all rows, columns and sub-grid contain different integers. As mangling with lists in prolog isn’t fun, I’ve wrote a python program that outputs all the prolog statements with hardcoded references to the variables which build-up the board. It’s ugly but dead simple. The script gets the dimensions of the sub-grid.
Continue reading Solving Sudoku using Python and Prolog