PDF Redacting

I am trying to build a workflow that programattically redacts pdfs based on a large key (I have about 6000 text blocks I want removed as an example). I have been trying to use Stirling pdf, but nothing seems to be working, I can’t even get the thing to run. I have it hosted locally, but there doesn’t seem to be a way to get it to work/recognize anything.

@RamseyHanna welcome to the KNIME forum. Here are two approaches to search for content in a PDF file using R. Maybe you can elaborate on what you mean by large key and what the result should look like. Maybe even with a sample file/workflow.

1 Like

Large key is basically just a list of strings that I want redacted, everything I want redacted is in text so not tables.

In my case, I need CASRN and Chemical Names redacted and I would be pulling those from an excel file, that is the “key”. Final result will be the same pdf but with any of the strings in the key redacted. you can do this manually in adobe, but I have 1000s of documents that I need to do this too.

@RamseyHanna “redacted” means you want the PDF to be marked and not be delivered or something. So basically you want to identify which PDFs do contain these strings. I think this can be done either with PDF Parser:

or with the R package shown in the example. Maybe you can provide us with an example that would demonstrate your use case - maybe using dummy data.

You can then build a loop to process your PDFs

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.