Update: I’ve released tarsum-0.2, a new version of tarsum
.
Some time ago, I got back a hard disk back from data recovery. One of the annoying issues I encountered with the recovered data was corrupted files. Some files looked like they were recovered successfully but their content was corrupted. The ones that were configuration files, where usually easy to detect, as it raised errors in programs that tried to use them. But when such error occurs in some general text file, (or inside the data of an SQL dump), the file may seem correctly fine unless closely inspected.
I have an habit of storing old backups on CDs (they are initially made to online storage), I do it in order to reduce backup costs. But the recovered/corrupted data issue raised some concerns about my ability to recover using this disks. Assuming that I have a disk failure, and I couldn’t recover from my online backups for reason, how can I check the integrity of my CD backups?
Only storing and comparing hash signature for the whole archive, is almost useless. It allows you to validate whether all the files are probably fine, but it can’t tell apart one corrupted file in the archive from a completed corrupted archive. My idea was to calculate checksum (hash) for each file in the data and store the signature in a way that would allow me to see which individual files are corrupted.
This is where tarsum
comes to the rescue. As it’s name applies it calculate checksum for each file in the archive. You can download tarsum
from here.
Using tarsum is pretty straight forward.
tarsum backup.tar > backup.tar.md5
Calculates the MD5 checksums of the files. You can specify other hashes as well, by passing a tool that calculates it (it must work like md5sum
).
tarsum --checksum=sha256sum backup.tar > backup.tar.sha256
To verify the integrity of the files inside the archive we use the diff
command:
tarsum backup.tar | diff backup.tar.md5 -
where backup.tar.md5
is the original signature file we created. This is possible because the signatures are sorted alphabetically by the file name inside the archive, so it the order of the files is always the same.
Note that if you use an updated version of GNU tar, tarsum
can also operate directly on compressed archives (e.g. tar.bz2, tar.gz).
This is perfect for what I need. However I need to verify checksums of archives stored on magnetic tapes. Is there a way of using tarsum in this way?
e.g. tarsum /dev/nst0
Hi Swami,
I’ve little knowledge of tape drives. As far as I understand, “tarsum /dev/nst0” should work, that’s if /dev/nst0 can be treated as regular tar file (which is what I understand).
Otherwise, you should copy the tar out of the tape drive, to a temporary location and then use tarsum regularly.
This looks like it could solve an issue I have, but is there a way to get the md5sum without actually writing the file back to disk? Reason is the actual files can get very large and potential disk space problems…
Also get an error “mktemp: invalid option — -” this is on RHEL4.6
Thanks.
Hi Jon,
The current version of tarsum cannot calculate the hashes without writing the file back to the disk. It’s a known limitation I settled with when I’ve decided to write in in Bash. I’m planning to re-write it in Python and then there won’t be any need to actually extract the files back to the disk. So if your interested I suggest checking the blog every once in a while.
About the mktemp error, it looks to me that your using an old version of mktemp. I checked the script with mktemp 6.10.
Yeah thats redhat for you. Working on a program for a production machine, so updating things is not really an option. We tend to have to stick to what our major software provider uses, and right now thats RHEL4. So mktemp version is 1.5. Just a bit older.
Will be interested in checking out the version that doesn’t write to disk. Thanks for the help.
Hi Jon,
I’ve released the new version of tarsum (0.2) in
http://www.guyrutenberg.com/2009/04/29/tarsum-02-a-read-only-version-of-tarsum/
I hope it will suit your needs too.