Wednesday, August 02, 2023

pdf find and replace

 There was a request to find/replace some boilerplate text in a few thousand pdfs. One option was to use  https://pdfreplacer.com/

and see if it does the job with a watermark.

(You may need to find/replace one line at a time).

Just for interest, looked at scripting options, found it was quite complicated. Some notes below.

Learning more about pdf

Visual studio
file 
open with 
binary editor

But does not find the text, because it is zlib compressed.




($30 for no watermarks)





above gist does not work in our case, since the text is not encoded as ANSI/ASCII text but using the font's encoding, in hex.

font subsetting


 
Both are not useful.
Method 1 changes the formatting
Method 2 makes the pdf into an image.



since xpdf pdftotext seems to work, checking out the code
PDFDoc.cc
Page.cc


Although not a library in traditional sense, Pdfedit has scriptable editing capabilities. But it requires QT. PodoFo probably fits best at your requirements. There's also PdfHummus.

(unmaintained)

(maintained, but unstable)

(apache license.)





page 16 says
The Font encoding object specifies how to translate this value into a character (and these are defined in Appendix D of the PDF Reference specification)

But these seem to be similar to Unicode / ASCII, with d being  0x64 - not 0x47 as seen in our document at an offset of 18020 or so.





Using RUPS



























Thence to find the encoding Identity-H

So, looking for a ToUnicode CMap
found the stream,

















to understand bfchar and bfrange in cmap

we see a pair
<0047> <0064>
So, 47 in the pdf is mapped to unicode 64?! Yes!




We can use
instead of RUPS to browser (but RUPS can also edit/change the PDF.)


Unfortunately, there is no space 0x0020 mapped? So we can't just replace the characters with 0x0020 or something similar.

The "correct" way to do it might be to use the "stamp" function to overwrite the offending text with what we want.

The "hacky" way might be to replace the characters with something else and make them not visible, if that is feasible.

No comments:

Post a Comment