So what is the fastest way to concatenate bytes in Python? I decided to benchmark and compare few common patterns to see how they hold up. The scenario I tested is iterative concatenation of a block of 1024 bytes until we get 1MB of data. This is very similar to what one might do when reading a large file to memory, so this test is pretty realistic.
The first implementation is the naive one.
def f():
ret = b''
for i in range(2**10):
ret += b'a' * 2**10
return ret
It is known that the naive implementation is very slow, as bytes
in Python are immutable type, hence we need to realloc the bytes
and copy them after each concatenation. Just how slow is it? about 330 times slower than the append-and-join pattern. The append-and-join pattern was a popular (and efficient) way to concatenate strings in old Python versions
def g():
ret = list()
for i in range(2**10):
ret.append(b'a' * 2**10)
return b''.join(ret)
It relies on the fact that appending to lists is efficient and then ''.join
can preallocate the entire needed memory and perform the copy efficiently. As you can see below it is much more efficient than the naive implementation.
Python 2.6 introduced the bytearray
as an efficient mutable bytes sequence. Being mutable allows one to "naively" concatenate the bytearray
and achieve great performance more than 30% faster than the join pattern above.
def h():
ret = bytearray()
for i in range(2**10):
ret += b'a' * 2**10
What about perallocating the memory?
def j():
ret = bytearray(2**20)
for i in range(2**10):
ret[i*2**10:(i+1)*2**10] = b'a' * 2**10
return ret
While this sounds like a good idea, Pythons copy semantics turn out to be very slow. This resulted in 5 times slower run times. Python also offers memeoryview
:
memoryview objects allow Python code to access the internal data of an object that supports the buffer protocol without copying.
The idea of access to the internal data without unnecessary copying sounds great.
def k():
ret = memoryview(bytearray(2**20))
for i in range(2**10):
ret[i*2**10:(i+1)*2**10] = b'a' * 2**10
return ret
And it does run almost twice as fast as preallocated bytearray
implementation, but still about 2.5 times slower than the simple bytearray
implementation.
I ran the benchmark using the timeit
module, taking the best run out of five for each. CPU was Intel i7-8550U.
import timeit
for m in [f, g, h]:
print(m, min(timeit.repeat(m, repeat=5, number=2**6)))
for m in [g, h, j, k]:
print(m, min(timeit.repeat(m, repeat=5, number=2**13)))
Conclusion
The simple bytearray
implementation was the fastest method, and also as simple as the naive implementation. Also preallocating doesn’t help, because python it looks like python can’t copy efficiently.