Scanning Lecture Notes – Compression

A new semester is about to begin, hence I again set out to organize lecture notes and scan them. This time I intend to invest more time investigating and perfecting this process. Hopefully, I’ll present my conclusions in few posts, each focusing on a different aspect.

In the first post, I’ll discuss the various ways to compress the scanned lecture notes. Because lecture notes (at least mine) aren’t especially colorful has I only use one pen at the time, I want the result to be black and white (line art). This allows readable lecture notes in while preserving small size per page (as you can see in Some Tips on Scanning Lecture Notes).

Generating the Benchmarks

When scanning documents, PDF acts as a container and the scanned images are stored as a binary string. Given a PDF, you can view the compression used by using strings and grep and look for strings like /F /SomethingDecode or /Filter /SomethingDecode. For example:

$ strings lectures.pdf | grep Decode
  /F /FlateDecode

(the last line gets repeated for every page). This PDF was created directly by XSane. If we search a bit we find that this means the document uses Deflate (or Zip) as compression algorithm. However there are several other compression algorithms that can be used, and each will have a different filter. Below (taken from imagemagick documentation) is a concise list of possible compression algorithms:

"-compress none"	'/ASCII85Decode'

"-compress zip"		'/FlateDecode'

"-compress jpeg"	'/DCTDecode'

"-compress lzw"		'/LZWDecode'

"-compress fax"		'/CCITTFaxDecode'

"+compress"
"-compress rle"
any thing else		'/RunLengthDecode'

So, I’ve generated benchmark PDFs with LZW, Zip (what XSane uses), Group4 (which turned out to be identical to fax).

$ ls *.pbm | xargs -I XXX convert XXX -density 600 -compress lzw XXX_lzw.pdf
$ pdftk *_lzw.pdf output lzw.pdf

$ ls *.pbm | xargs -I XXX convert XXX -density 600 -compress zip XXX_zip.pdf
$ pdftk *_zip.pdf output deflate.pdf

$ ls *.pbm | xargs -I XXX convert XXX -density 600 -compress group4 XXX_gp4.pdf
$ pdftk *_pg4.pdf output group4.pdf

The -density 600 parameter indicates the DPI for the scans, and allows us to retain the correct physical dimensions in the PDF (which is important for documents).

ImageMagick as several other types of compression algorithms which it supports as can be seen via convert -list compress. The most interesting one, that appears there is JBIG2. JBIG2 is a modern bi-tonal compression introduced in 2000, and has been part of the PDF standard since version 1.4. As I understand it, it surpasses the Group 4 compression (which is also bi-tonal, and was designed for faxes), and it should work very similar to the bi-tonal JB2 compression in DjVu.

Unfortunately, it seems that ImageMagick doesn’t support encoding JBIG2 images into PDFs (only decoding) as trying to encode with resulted in /RunLengthDecode streams. As I heard, JBIG2 should give DjVu a fair competition in terms of compression, I’ve looked for other means, and found jbig2enc. jbig2enc provides two useful programs: jbig2 which is able to take several images, and build the compression index for them, and pdf.py which takes those indexes and embeds them inside a PDF. jbig2enc isn’t (yet) in Ubuntu’s repository, but it’s very easy to compile (few dependencies, automake). So I’ve used it to create a PDF with JBIG2 compression:

$ ./agl-jbig2enc-d5cb3d5/src/jbig2 -s --pdf *.pbm
# This create a bunch of output.* files which can be discarded afterwards
$ python agl-jbig2enc-d5cb3d5/pdf.py output > jbig2_pbm.pdf
$ rm output.*

(I’ve also had to patch pdf.py to use the right DPI, as jbig2 can’t extract it from the pbm files.

For the sake of comparison I’ve also used minidjvu to create both lossless and lossy djvu files

$ minidjvu --dpi 600 -a 0 *.pbm lossless.djvu
$ minidjvu --dpi 600 --lossy *.pbm lossy.djvu

Results

All the benchmarks have been made on 38 pages of handwritten notes. The raw results are:

$ du deflate.pdf lzw.pdf group4.pdf jbig2_pbm.pdf lossless.djvu lossy.djvu 
10036   deflate.pdf
9680    lzw.pdf
2716    group4.pdf
2136    jbig2_pbm.pdf
2164    lossless.djvu
1828    lossy.djvu

Both Zip (Deflate) and LZW are general purpose compressions, hence it’s not surprising they perform worst. The old Group 4 performs considerable well, but if left behind the modern options of JBIG2 an DjVu. The JBIG2 encoding I did was lossy (I had some problems with the lossless one), and it compared to the lossless DjVu file. The lossy DjVu encoding surpassed the JBIG2, but not by far.

Conclusion

Using the XSane (or probably any other scanning software) PDF creation, is a waste of bits. The just use the wrong tools for the job, and the PDF produced is way too large. DjVu is the best choice for lossless compression, but the difference between lossy and lossless are perceptually negligible. If you must use PDF, I suggest going for JBIG2. If you’re fine with DjVu go for the lossless (as the difference in bytes isn’t big).

Further Work

I’ve assumed at this part that I have black and white images. I’ve relied to simple conversion from gray-scale, which doesn’t always perform as well as it should. It looks like using ImageMagick and some smart filter, better color separation can be achieved, especially when notes are written with a pen with a non-black ink.

Update: I’ve posted my results in Scanning Lecture Notes – Separating Colors.

The other, less significant, issue is deskewing, e.g. aligning, the scanned images automatically. This should remove small tilts introduced while scanning (and while writing).

I plan to cover both topics in the following posts.

Update 2012-10-12: The lossless DjVU creation command was missing -a 0. Fixing that resulted in minor file size changes for the lossless.djvu, and as such, I took the liberty of not updating the figures.

Leave a Reply

Your email address will not be published. Required fields are marked *