There was a request to find/replace some boilerplate text in a few thousand pdfs. One option was to use https://pdfreplacer.com/
and see if it does the job with a watermark.
(You may need to find/replace one line at a time).
Just for interest, looked at scripting options, found it was quite complicated. Some notes below.
Learning more about pdf
Visual studio
file
open with
binary editor
But does not find the text, because it is zlib compressed.
($30 for no watermarks)
google search github for pdf tools
Hosted docker-based,
(but no find-replace text in api
)
above gist does not work in our case, since the text is not encoded as ANSI/ASCII text but using the font's encoding, in hex.
font subsetting
pdftotext seems to be from xpdf and not pdftk
list of pdf tools and what they can do
and qpdf too
to see how we can convert the Tj Tf etc to text
The <0065> does not just mean 0x65 in ASCII -
python possibilities
Both are not useful.
Method 1 changes the formatting
Method 2 makes the pdf into an image.
since xpdf pdftotext seems to work, checking out the code
PDFDoc.cc
Page.cc
page 16 says
The Font encoding object specifies how to translate this value into a character (and these are defined in Appendix D of the PDF Reference specification)
But these seem to be similar to Unicode / ASCII, with d being 0x64 - not 0x47 as seen in our document at an offset of 18020 or so.
pg 27
RUPS free tool for diagnostics of PDF
Using RUPS
Thence to find the encoding Identity-H
So, looking for a ToUnicode CMap
found the stream,
to understand bfchar and bfrange in cmap
we see a pair
<0047> <0064>
So, 47 in the pdf is mapped to unicode 64?! Yes!
We can use
instead of RUPS to browser (but RUPS can also edit/change the PDF.)
Unfortunately, there is no space 0x0020 mapped? So we can't just replace the characters with 0x0020 or something similar.
The "correct" way to do it might be to use the "stamp" function to overwrite the offending text with what we want.
The "hacky" way might be to replace the characters with something else and make them not visible, if that is feasible.
No comments:
Post a Comment