tarsum – Calculate Checksum for Files inside Tar Archive

Update: I’ve released tarsum-0.2, a new version of tarsum.

Some time ago, I got back a hard disk back from data recovery. One of the annoying issues I encountered with the recovered data was corrupted files. Some files looked like they were recovered successfully but their content was corrupted. The ones that were configuration files, where usually easy to detect, as it raised errors in programs that tried to use them. But when such error occurs in some general text file, (or inside the data of an SQL dump), the file may seem correctly fine unless closely inspected.

I have an habit of storing old backups on CDs (they are initially made to online storage), I do it in order to reduce backup costs. But the recovered/corrupted data issue raised some concerns about my ability to recover using this disks. Assuming that I have a disk failure, and I couldn’t recover from my online backups for reason, how can I check the integrity of my CD backups?

Only storing and comparing hash signature for the whole archive, is almost useless. It allows you to validate whether all the files are probably fine, but it can’t tell apart one corrupted file in the archive from a completed corrupted archive. My idea was to calculate checksum (hash) for each file in the data and store the signature in a way that would allow me to see which individual files are corrupted.

This is where tarsum comes to the rescue. As it’s name applies it calculate checksum for each file in the archive. You can download tarsum from here.

Using tarsum is pretty straight forward.

tarsum backup.tar > backup.tar.md5

Calculates the MD5 checksums of the files. You can specify other hashes as well, by passing a tool that calculates it (it must work like md5sum).

tarsum --checksum=sha256sum backup.tar > backup.tar.sha256

To verify the integrity of the files inside the archive we use the diff command:

tarsum backup.tar | diff backup.tar.md5 -

where backup.tar.md5 is the original signature file we created. This is possible because the signatures are sorted alphabetically by the file name inside the archive, so it the order of the files is always the same.

Note that if you use an updated version of GNU tar, tarsum can also operate directly on compressed archives (e.g. tar.bz2, tar.gz).

6 thoughts on “tarsum – Calculate Checksum for Files inside Tar Archive”

  1. This is perfect for what I need. However I need to verify checksums of archives stored on magnetic tapes. Is there a way of using tarsum in this way?

    e.g. tarsum /dev/nst0

  2. Hi Swami,

    I’ve little knowledge of tape drives. As far as I understand, “tarsum /dev/nst0” should work, that’s if /dev/nst0 can be treated as regular tar file (which is what I understand).

    Otherwise, you should copy the tar out of the tape drive, to a temporary location and then use tarsum regularly.

  3. This looks like it could solve an issue I have, but is there a way to get the md5sum without actually writing the file back to disk? Reason is the actual files can get very large and potential disk space problems…

    Also get an error “mktemp: invalid option — -” this is on RHEL4.6

    Thanks.

  4. Hi Jon,

    The current version of tarsum cannot calculate the hashes without writing the file back to the disk. It’s a known limitation I settled with when I’ve decided to write in in Bash. I’m planning to re-write it in Python and then there won’t be any need to actually extract the files back to the disk. So if your interested I suggest checking the blog every once in a while.

    About the mktemp error, it looks to me that your using an old version of mktemp. I checked the script with mktemp 6.10.

  5. Yeah thats redhat for you. Working on a program for a production machine, so updating things is not really an option. We tend to have to stick to what our major software provider uses, and right now thats RHEL4. So mktemp version is 1.5. Just a bit older.

    Will be interested in checking out the version that doesn’t write to disk. Thanks for the help.

Leave a Reply

Your email address will not be published. Required fields are marked *