PDF Mail-Merge in Haskell

August 27, 2007, at 09:51 PM

So, I've been taking a break from my other projects by working on a component that I'll eventually be needing, although the system it will be part of is nowhere near ready yet. This is essentially mail-merge for PDF, although I'll need to support both text and images. I'll also need to support multiple fonts.

At first I looked at Perl libraries; Perl has several libraries that deal with PDFs. PDF::Reuse is almost exactly what I want, and nicely documented. Unfortunately, I didn't really understand how I was supposed to use it until I'd already read up, and now I've reimplemented enough of its functionality that it would no longer be a time-saver.

Mostly, when I was looking at existing work, I ran into many other Perl modules which were not PDF::Reuse, and which were all very poorly documented, and finally I had to turn to the official PDF reference to understand what they were supposed to be doing. And then I realized that the reason it's so confusing is that most of these libraries are just one piece of it - dealing only with the document's internal index, or only with generating content streams (the postscript-like portion of the file), or only with parsing the syntax.

So I have a syntax parser/lexer; I put it together in Parsec. I'm starting to really dislike Parsec, but I have to concede that I seriously doubt I could have supported the full syntax in just a couple hours without it. I also wrote Show instances for everything, so that my PDF.Syntax module can both generate and consume. I even handled all the weird little escape things in the string syntax; sweet, huh? Biggest holdup with this part was that, ouch! Different strings can be in different text encodings! You can even mix-and-match line endings within the same file! So I had to poke deeper than I wanted to into Haskell's IO stuff in order to come up with something 8-bit-clean.

Syntax is only half the issue. A PDF file, it turns out, is not a linear stream. Well, figures, right? Anyway, it has an index at the end of the file, the xref table, which gives a mapping of object IDs to byte offsets. I was thinking I would have to deal with space allocation as I deleted and replaced objects, and then regenerate the xref table from scratch, but it turns out that the design is nicer than that: You could do that, but you can also leave the existing document untouched and just append the new or replaced data, and another xref table which contains entries only for the changed things, plus a pointer back to the old one. Since all the internal data structures are built in terms of object IDs anyway, the ability to remap object IDs means you can do anything in append-only mode that you could do by writing an entire new file. Sweet, huh?

It took me a while to get that working, since it was hard to see where the problem in my generated output was when all I got was either total success or total failure when I tried to use the resulting file. So to do it in pieces, first I tried making a PDF entirely by hand. Ouch! Painful! Had to use a hex editor to find offsets into what was otherwise more or less a text file! So I wrote some code to generate a trivial PDF, and refactored that until it turned into a PDF updater. Which now works.

Right now, after basically eight hours of coding, what I have it doing is opening up a file and drawing a predefined graphic either on top of or below every page. I took the example code for the graphic from Adobe's reference.

Making it generate text will be easy; it just has to output a content stream with the appropriate command. Making it do so in any given font won't be much harder; the document contains copious metadata, and it just has to look up what fonts the document already has embedded, and then decide which to use.

Making it generate graphics will be a bit harder. Apparently PDF supports some of the same codecs that popular image formats are based on, particularly jpeg. So I have a choice of whether to slice-and-dice the input image and just embed its existing compressed chunks into the PDF, or whether to uncompress the image, convert it to a different format, and reencode it. There are existing tools for both. The former apparently has a very significant space savings, but is a lot more fiddly and there are fewer examples for me to work from. So, I'm reading up on that.

I also haven't decided yet whether I want to store image files as blobs within the database (which means I can take advantage of the version-control I'll be implementing for the rest of the database, and users won't have to explicitly think about moving them around), or as separate files (which would make the database smaller). Well, plenty of time to think about that. The PDF code will be designed so it doesn't care where its data is coming from, anyhow.

TrackBack

TrackBack URL for this entry:
http://dankna.com/cgi-bin/mt/mt-tb.cgi/12

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)