Update: I’ve released tarsum-0.2, a new version of tarsum
.
Some time ago, I got back a hard disk back from data recovery. One of the annoying issues I encountered with the recovered data was corrupted files. Some files looked like they were recovered successfully but their content was corrupted. The ones that were configuration files, where usually easy to detect, as it raised errors in programs that tried to use them. But when such error occurs in some general text file, (or inside the data of an SQL dump), the file may seem correctly fine unless closely inspected.
I have an habit of storing old backups on CDs (they are initially made to online storage), I do it in order to reduce backup costs. But the recovered/corrupted data issue raised some concerns about my ability to recover using this disks. Assuming that I have a disk failure, and I couldn’t recover from my online backups for reason, how can I check the integrity of my CD backups?
Only storing and comparing hash signature for the whole archive, is almost useless. It allows you to validate whether all the files are probably fine, but it can’t tell apart one corrupted file in the archive from a completed corrupted archive. My idea was to calculate checksum (hash) for each file in the data and store the signature in a way that would allow me to see which individual files are corrupted.
This is where tarsum
comes to the rescue. As it’s name applies it calculate checksum for each file in the archive. You can download tarsum
from here.
Using tarsum is pretty straight forward.
tarsum backup.tar > backup.tar.md5
Calculates the MD5 checksums of the files. You can specify other hashes as well, by passing a tool that calculates it (it must work like md5sum
).
tarsum --checksum=sha256sum backup.tar > backup.tar.sha256
To verify the integrity of the files inside the archive we use the diff
command:
tarsum backup.tar | diff backup.tar.md5 -
where backup.tar.md5
is the original signature file we created. This is possible because the signatures are sorted alphabetically by the file name inside the archive, so it the order of the files is always the same.
Note that if you use an updated version of GNU tar, tarsum
can also operate directly on compressed archives (e.g. tar.bz2, tar.gz).