pip list --user --outdated will list all user-installed packages that are outdated. We use jq to extract the package names from the output, and feed it to xargs.
So what is the fastest way to concatenate bytes in Python? I decided to benchmark and compare few common patterns to see how they hold up. The scenario I tested is iterative concatenation of a block of 1024 bytes until we get 1MB of data. This is very similar to what one might do when reading a large file to memory, so this test is pretty realistic.
The first implementation is the naive one.
def f():
ret = b''
for i in range(2**10):
ret += b'a' * 2**10
return ret
It is known that the naive implementation is very slow, as bytes in Python are immutable type, hence we need to realloc the bytes and copy them after each concatenation. Just how slow is it? about 330 times slower than the append-and-join pattern. The append-and-join pattern was a popular (and efficient) way to concatenate strings in old Python versions
def g():
ret = list()
for i in range(2**10):
ret.append(b'a' * 2**10)
return b''.join(ret)
It relies on the fact that appending to lists is efficient and then ''.join can preallocate the entire needed memory and perform the copy efficiently. As you can see below it is much more efficient than the naive implementation.
Python 2.6 introduced the bytearray as an efficient mutable bytes sequence. Being mutable allows one to "naively" concatenate the bytearray and achieve great performance more than 30% faster than the join pattern above.
def h():
ret = bytearray()
for i in range(2**10):
ret += b'a' * 2**10
Comparing the naive, join and bytearray implementation. Time is for 64 iterations.Comparing the join, bytearray, preallocated bytearray and memoryview implementation. Time is for 8196 iterations.
What about perallocating the memory?
def j():
ret = bytearray(2**20)
for i in range(2**10):
ret[i*2**10:(i+1)*2**10] = b'a' * 2**10
return ret
While this sounds like a good idea, Pythons copy semantics turn out to be very slow. This resulted in 5 times slower run times. Python also offers memeoryview:
memoryview objects allow Python code to access the internal data of an object that supports the buffer protocol without copying.
The idea of access to the internal data without unnecessary copying sounds great.
def k():
ret = memoryview(bytearray(2**20))
for i in range(2**10):
ret[i*2**10:(i+1)*2**10] = b'a' * 2**10
return ret
And it does run almost twice as fast as preallocated bytearray implementation, but still about 2.5 times slower than the simple bytearray implementation.
I ran the benchmark using the timeit module, taking the best run out of five for each. CPU was Intel i7-8550U.
import timeit
for m in [f, g, h]:
print(m, min(timeit.repeat(m, repeat=5, number=2**6)))
for m in [g, h, j, k]:
print(m, min(timeit.repeat(m, repeat=5, number=2**13)))
The simple bytearray implementation was the fastest method, and also as simple as the naive implementation. Also preallocating doesn’t help, because python it looks like python can’t copy efficiently.
I recently had to work with some data that came in a huge Microsoft Access database. Because I like SQLite (and despise Access), I’ve decided to export the data to an SQLite file. The first thing I needed to do was to somehow get all the data out of the db. Being a Linux user, complicates things a bit, but thanks to mdb-tools it’s possible to process the .mdb files without resorting to Windows and buying Access. Using mdb-tools directly can be tedious if you want to export a large db with multiple tables, so when I’ve looked for a way to automate it, I came across Liberating data from Microsoft Access “.mdb” files. This post shows a nice script that dumps every table in a .mdb file to separate CSV file.
While useful, I wanted something that I could easily import into SQLite. So I’ve modified their script to generate an SQL dump of the db. Given a db file, it writes to stdout SQL statements describing the schema of the DB followed by INSERTs for each table. Actually because mdb-tools doesn’t support SQLite as a backend, the dump uses a MySQL dialect, but it should be fine with SQLite as well (SQLite will mostly ignore the parts it can’t process such as COMMENTs). The easiest way to use the script is
If the original db contains non-ascii characters, and isn’t encoded in UTF-8, you should set the MDB_JET3_CHARSET environment variable to the correct charset. The dump itself will be UTF-8 encoded.
After you upgrade your python/distribution (specifically this happened to me after upgrading from Ubuntu 11.10 to 12.04), your existing virtualenv environments may stop working. This manifests itself by reporting that some modules are missing. For example when I tried to open a Django shell, it complained that urandom was missing from the os module. I guess almost any module will be broken.
Apparently, the solution is dead simple. Just re-create the virtualenv environment:
(depending on how you created it in the same place). All the modules you’ve already installed should keep working as before (at least it was that way for me).
I’m having less and less time to blog and write stuff lately, so it’s a good oppertunity to catch up with old thing I did. Back in the happy days I used Gentoo, one of irritating issues I faced was messed up file type associations. MIME type for some files was recognized incorrectly, and as a result, KDE offered to open files with unsuitable applications. In order to debug it I wrote a small python script which would help me debug the way KDE applications are associated with MIME types and what MIME type is inferred form each file.
The script does so by querying the KMimeType and KMimeTypeTrader. The script does 3 things:
Given a MIME type, show it’s hierarchy and a list of applications associated with it.
Given an applications, list all MIME types it’s associated with
Given a file, show its MIME type (and also the accuracy, which allows one to know why that MIME type was selected, although I admit that in the two years since I wrote it, I forgot how it works :))
Firefox 3 started to store it’s cookies in a SQLite database instead of the old plain-text cookie.txt. While Python’s cookielib module could read the old cookie.txt file, it doesn’t handle the new format. The following python snippet takes a CookieJar object and the path to Firefox cookies.sqlite (or a copy of it) and fills the CookieJar with the cookies from cookies.sqlite.
import sqlite3
import cookielib
def get_cookies(cj, ff_cookies):
con = sqlite3.connect(ff_cookies)
cur = con.cursor()
cur.execute("SELECT host, path, isSecure, expiry, name, value FROM moz_cookies")
for item in cur.fetchall():
c = cookielib.Cookie(0, item[4], item[5],
None, False,
item[0], item[0].startswith('.'), item[0].startswith('.'),
item[1], False,
item[3], item[3]=="",
None, None, {})
print c
It works well for me, except that apperantly Firefox doesn’t save session cookies to the disk at all.
