tarsum-0.2 – A read only version of tarsum

When I first scratched the itch of calculating checksums for every file in a tar archive, this was my original intention. When I decided I want the script in bash for simplicity, I forfeited the idea and settled for extracting the files and then going over all the files to calculate their checksum value.

So when Jon Flowers asked in the comments of the original tarsum post about the possibility of getting the checksums of files in the tar file without extracting all the archive, I’ve decided to re-tackle the problem.

This time I’ve chose python and by using the tarfile and hashlib modules I came up with a solution that allowed me to go over tar files to calculate the checksum values without extracting all of them to the disk. However some sacrifices where made in the form of back-compatibility of the output. I’ve tried to make the interface similar to the old one, and have kept all the command line options. Instead of specifying a program name to calculate the checksum values (such as sha1sum) as argument to --checksum you specify the name of the checksum algorithm such as md5, sha1, sha256, sha512 (or any other supported by hashlib).

Other changes where made so tar files can be piped directly into tarsum (which also works transparently with bzip2 and gzip compression).

tarsum < sometarfile.tar.gz > sometarfile.tar.gz.md5

Performance-wise, according to some tests I’ve carried out, the new version is faster with big tar files than the old one, but it’s the other way around with small archives (which I find less important).


#! /usr/bin/env python
# Copyright (C) 2008-2009 by Guy Rutenberg
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the
# Free Software Foundation, Inc.,
# 59 Temple Place – Suite 330, Boston, MA 02111-1307, USA.
import hashlib
import tarfile
def tarsum(input_file, hash, output_file):
"""
input_file – A FILE object to read the tar file from.
hash – The name of the hash to use. Must be supported by hashlib.
output_file – A FILE to write the computed signatures to.
"""
tar = tarfile.open(mode="r|*", fileobj=input_file)
chunk_size = 100*1024
store_digests = {}
for member in tar:
if not member.isfile():
continue
f = tar.extractfile(member)
h = hashlib.new(hash)
data = f.read(chunk_size)
while data:
h.update(data)
data = f.read(chunk_size)
output_file.write("%s %s\n" % (h.hexdigest(), member.name))
def main():
parser = OptionParser()
version=("%prog 0.2.1\n"
"Copyright (C) 2008-2009 Guy Rutenberg <http://www.guyrutenberg.com/contact-me>")
usage=("%prog [options] TARFILE\n"
"Print a checksum signature for every file in TARFILE.\n"
"With no FILE, or when FILE is -, read standard input.")
parser = OptionParser(usage=usage, version=version)
parser.add_option("-c", "–checksum", dest="checksum", type="string",
help="use HASH as for caclculating the checksums. [default: %default]", metavar="HASH",
default="md5")
parser.add_option("-o", "–output", dest="output", type="string",
help="save signatures to FILE.", metavar="FILE")
(option, args) = parser.parse_args()
output_file = sys.stdout
if option.output:
output_file = open(option.output, "w")
input_file = sys.stdin
if len(args)==1 and args[0]!="-":
input_file = open(args[0], "r")
tarsum(input_file, option.checksum, output_file)
if __name__ == "__main__":
from optparse import OptionParser
import sys
main()

view raw

tarsum

hosted with ❤ by GitHub

Update 2009-08-12: Removed excess argument to tarsum() and switched the filemode to r|* (from r:*). Bumped version string.

7 thoughts on “tarsum-0.2 – A read only version of tarsum”

  1. Hi.

    Tried out your program using Ubuntu 9.04, but I encountered two problems.

    First, in the last line of main() you called tarsum() with 4 parameters while tarsum() is defined to accept only 3 parameters. Python is aborting the program because of this. I’m not a python programmer, but when I took out the 4th parameters, the program now runs.

    Then I tried to use the program on a 21GB bzipped tarball that contains a 100GB file. While the program runs, it runs for less than 1 second. It also prints out a checksum that is different. The file is ok when tested using “bzip2 -t”.

    After some research, I changed the filemode from “r:*” to “r|*” to use stream IO. After this change, “tarsum-0.2 file.tar.bz2” now aborts, but “bunzip2 -c file.tar.bz2 | tarsum-0.2” now seems to work.

  2. @Mike: Thanks for pointing out the tarsum() signature error. I guess when I’ve cleaned the script before the release I’ve missed that I’ve also change the signature.

    I admit that I’ve never tested the script with such a big file like you did. What suprised me that setting the filemode to r|* slowed the script a bit (at least for my ~300MMB tar), I assumed that giving up random access, should make things faster, but it didn’t

    Anyway I’ve fixed both issues. Thanks again.

  3. I had to do a workaround due to disk space limitations since I cannot extract my archives which contain extremely large log files so this is what I came up with …

    IFS=”

    for line in $(cat ${md5file})
    do
    md5=$(echo ${line}|awk ‘{print $1}’)
    filename=$(echo ${line}|awk ‘{print $2}’)
    md5archivefile=$(tar -zxOvf myfile.tgz ${filename} 2>/dev/null | md5sum – | awk ‘{print $1}’)

    if [ ! “${md5archivefile}” == “${md5}” ]; then
    echo “NOT OK: $filename,$md5,$md5archivefile”
    else
    echo “OK: $filename,$md5,$md5archivefile”
    fi
    done

  4. Hi,

    Thanks for the script !

    I have built a script that hashes the files in a directory and its subdirectories when run for the first time, then upon subsequent runs it will compute the new hash only for files whose size and/or last change time have changed, thus allowing me to update the hashes for really huge directories (several hundred gigabytes) in very little time.

    I’ll use your script to also hash the contents of archives (I use my script to detect duplicate files, so it’ll be nice for it to also find files which have a duplicate within an archive).

    By looking at your code, I think the “store_digests = {}” line in the tarsum function is useless, since that variable is never read from.

    Cheers,
    Georges Dupéron

Leave a Reply

Your email address will not be published. Required fields are marked *