Changing PDF Metadata with Python

While updating a pdf recently, I noticed some metadata I wanted to change and a few annotations that were hidden from view but still in the file. However, the "Get Info" pane in Preview on OS X doesn't provide a metadata editor, nor does its Export function, so it seemed like a good opportunity to learn a bit more about the PDF standard and Python packages for getting the job done. Adobe Acrobat or other GUI's would've been much faster, but I'll likely need to do this programmatically again at some point like those of you who might've found this post by looking on your favorite privacy-preserving search engine for "change pdf metadata in python". So here we go.

Before starting, I hopped into a new folder and created a git repository with a first commit of my original pdf in case anything went wrong. Then I ran conda create --name pdf --python=3.8.1 and conda activate pdf to set up an Anaconda virtual Environment named pdf to keep my work isolated. I've found Anaconda to be a simple way of managing Python packages and dependencies after initially trying it because it made launching Jupyter notebooks with various preinstalled packages a breeze.

After browsing in a few places for pdf editing libraries, I first tried pdfrw. While editing metadata based on the code in this example in its GitHub repo was easy, the library had no support for dealing with annotations. Next up was pypdf2, but its removeLinks() method removed annotations (as expected) along with other metadata like title and author info I wanted to preserve (unexpected).

PyMuPDF turned out to have exactly what I needed. I found the deleteAnnot function in the documentation right away and got things working pretty quickly to strip the annotations. In updating the modification date, I noticed a date string I hadn't seen before. Thankfully, there's StackOverflow to the rescue with this entry showing how the elements of the date-time string break down with links to the references. Then it was just a matter of formatting the datetime to the string I needed. The basics of my script are below:
import datetime
import fitz # assumes pymupdf
doc = fitz.open("existing_document.pdf")

for idx, page in enumerate(doc):
  try:
    while (next(page.annots())):
      annot = next(page.annots())
      page.deleteAnnot(annot)
  except StopIteration:
    print(f'annotations from page {idx + 1} removed')

# use more recent date format
# https://stackoverflow.com/questions/41661477/what-is-the-correct-format-of-a-date-string
if doc.metadata['creationDate'][-1] == "'":
  doc.metadata['creationDate'] = doc.metadata['creationDate'][:-1]

# funky ISO 32000-1:2008 date format
# see https://wwwimages2.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf
formatted_no_utc_offset = datetime.datetime.utcnow().strftime("%Y%m%d%H%M%SZ00'00")
doc.metadata['modDate'] = formatted_no_utc_offset

doc.setMetadata(doc.metadata)

# see https://pymupdf.readthedocs.io/en/latest/document/#setmetadata-example
doc._delXmlMetadata()
doc.save("spiffy_new_document.pdf", garbage = 4)
Maybe there are better libraries, or more Pythonic ways of doing the above, please leave a comment on the GitHub gist with any suggestions. Note that you might still find surprises if you look at the file in a text editor; I thought the removals above were good enough until my curiosity got the better of me and I opened the file in a text editor. It was mostly garbled and not meant to be viewed as plain text, yet I still found my name in what looked like a popup annotation object, despite thinking they were all gone! Yet no comments or annotations showed in any GUIs. I suspected some corrupt data or hidden object was leftover since the file had been around for years and passed through many editors. Who knows.

This answer led me to the qpdf library and I was able to decompress the file, change the string I noticed (preserving the byte length as the answer hints, otherwise the file breaks), and recompress it quickly. This final bit of the exercise served as a good reminder of just how much info gets packed into the innards of files or put in places you're unlikely to notice. As documents age, get updated or passed around organizations, it's important to remember to occasionally take a look at what's lurking in the metadata.

Popular posts from this blog

Thinking About BIPA and Machine Learning

A New Serverless Look