Few days ago I decided to check what is really the fastest method to iterate over strings in C++. As a string class I chose string class from STL as it is very popular and provides a couple of ways to iterate it. So how can one iterate over an std::string
?
- By using indexes. E.g.
str[i]
and runningi
from zero to the length of the string. - By using the
at
method.string::at(size_t pos)
provides similar interface to indexes with the exceptions that it checks whether the given position is past the end of the string and then throws an exception. One may see it as the safe version of the regular index. - Treating the string as a sequence of characters and and iterate over it using iterators.
- Using
string::c_str()
to get a pointer to a regular C string representation of the string stored in thestd::string
and treating it as array, e.g. using indexes to go over it. - The last way to iterate over the string is to get a pointer to a C string representation using
string::c_str()
and advancing the pointer itself to iterate over the string.
The third method is the native method of iterating over objects in STL, and like the last two it can’t be used if the iteration changes the string itself (e.g. inserting or deleting characters). The first and second method are similar to the fourth (treating the pointer to the C string as an array), except that they aren’t so problematic as the latter when changing the string. The second method is the safest as it’s the only one that does range checks and throws exception if trying to access positions which are outside the string.
To benchmark and find out which method is the fastest method to iterate over a string I’ve created a huge string of random characters ranging from ‘a’ to ‘z’ and five executables, each one implementing one of the above iteration methods to do a simple task
(count the number of occurrences of each letter). The string is fifty million characters long which, as the longer the string the less important the overhead becomes.
The executables for the benchmark of every version were compiled with the default setting of g++
(without optimization as the compiler might change the iteration methods when optimizing). The benchmark executables where timed by using the time
command and redirecting the executables output to /dev/null
. The tests were run both on 64bit Gentoo (with 1 GB RAM) and on 32bit Kubuntu (with 512 MB RAM), to make sure the overall results (which method it better not the runtime itself) isn’t system depended.
Now to the result itself. After running the benchmark couple of times I came up with the following conclusions: Don’t use iterators for string iteration. Iterators came last on every test and usually times up to three times more than the slowest method besides it. Iterators may be the native STL way to iterate over STL containers but it provides very slow way to so. While iterators can be called STL pointers as the work and behave much the same way pointers do, they didn’t preform even close to pointers. Iterating using pointers came up as the fastest way to iterate over a string with a small margin over string::at() and indexes (both over usual C strings representation and std::string). When inspecting the rest of the methods one may find that string::at() and the indexes came up with very close timings, but on most tests the indexes over std::string where faster. It is obvious why they should preform better than string::at(), as they don’t do range checks as the latter does. For some reason I got that iterating using indexes over the std::string directly is faster than iterating over the the C string representation (by very minor margin) and that the C string is timing roughly the same as the string::at().
To conclude the benchmark pointers are the fastest way to iterate by a small margin, but using the std::string indexes and string::at is preferable in my opinion as the performance difference isn’t that big and the latter methods provide safer way (that can handle string manipulation that may cause the string data to be copied to another place in the memory) than the pointers. The indexes over C string representation suffer the same disadvantages as the pointers but don’t operate as fast. Stay away from iterators at all costs! Iterators suffer similar disadvantages as pointers do (regarding string manipulation) and don’t give anything in return except for horrible run time.
My timings
Here is the output of one of the runs of the benchmark. The results where typical to almost all of the test. To try the benchmark on your computer, see instructions bellow.
guy@Guy_Computer ~/temp/stringbenchmark $ ./benchmark iteration using indexes real 0m0.690s user 0m0.644s sys 0m0.048s iteration using indexes over C string real 0m0.752s user 0m0.720s sys 0m0.028s iteration using string::at() real 0m0.762s user 0m0.708s sys 0m0.036s iteration using pointers over C string real 0m0.642s user 0m0.604s sys 0m0.036s iteration using iterators real 0m2.323s user 0m2.272s sys 0m0.052s
Running the Tests on Your System
If you want you can run these test on your system to see the exact results. To do so download the tar archive and go to the directory you downloaded it into. Open the command line on this directory and execute:
tar -zxvf stringbenchmark.tar.gz
cd stringbenchmark
make
./benchmark
You will see the test results printed on your screen. While exact timings might differ from run to run, the overall trends should be clear.
The results are right, without optimization.
However if I turn on optimizations, with -O2 I will get the following result:
jancsi@dew:~/$ ./benchmark
iteration using indexes
real 0m0.325s
user 0m0.280s
sys 0m0.040s
iteration using indexes over C string
real 0m0.327s
user 0m0.260s
sys 0m0.060s
iteration using string::at()
real 0m0.425s
user 0m0.348s
sys 0m0.068s
iteration using pointers over C string
real 0m0.703s
user 0m0.612s
sys 0m0.072s
iteration using iterators
real 0m0.355s
user 0m0.304s
sys 0m0.048s
J.
One method that was left out is using the .data member of std::string. This accesses the underlying data where the c_str() does not and cannot (std::string must be able to have embedded ”’s).
That method ought to be the fastest (hopefully tied with iterators):
char * cur = s.data();
char * end = cur + s.size();
for( ; cur!=end; ++cur )
{
f(*cur);
}
Your iterator code is flawed, move the .begin() and .end() code outside the for loop, otherwise the functions (which are not inline) will be run EVERY loop iteration!
Also you need to change the iter ‘tmp’ from tmp++ to ++tmp. The former is postfix, meaning the compiler will create a temporary object to store the contents of ‘tmp’, then increment it. Using prefix notation will elliminate the temporary.
Heres a better example:
std::string::iterator tmp(hugestring.begin());
std::string::iterator const tmp_end(hugestring.end());
for(; tmp != tmp_end; ++tmp)
{
//->Code Here
}
Note how I use the constructor call for storing the begin() and end() values. If not done this way, the compiler will create a temporary for both those lines!
@RichardK, Thanks for your comments and optimizations.
^ ^
Like RichardK said, not exactly apples to apples. Also, how often do you know the size of your string at compile time?
Very interesting article and comments