Pull data from PDF Parser using a Variable Loop to access multiple files

phick · December 14, 2021, 4:13pm

I am trying to create a workflow to pull a set of dates and batches from multiple PDF files. Each file contains a date, and a list of batches (and much more information which I don’t need at this time). I have used the PDF parser to pull the list of batches and the date from a file and append the date to each batch in a table and this works fine when there is a single pdf in the input folder. However, I need to do this with a large set of files and if there are multiple files the PDF parser loads them all at once and outputs the batches and the dates from all the files without keeping them together.

I had hoped to use List Files and Table Row to Variable Loop to work through each file individually, so it would find the date in the file and add this to all the batches in that file before moving on to the next file. The problem is that the PDF parser will only take a folder as a variable and not a single file. Can anyone suggest an alternative?

izaychik63 · December 14, 2021, 4:29pm

You can use Tika Parser as an alternative. It returns Filepath column you can use as file unique identifier.

phick · December 14, 2021, 4:41pm

I’ve been having a look at the Tika Parser, that seems to be able to generate a table with the filename in one column and content in another, which looks great, but I can’t get it to feed into the rest of the workflow, "No Column in spec compatible to “DocumentValue”.

It feeds into Punctuation Erasure, Case Converter, Bag of Words Creator then Term to String which generates a list of all the words in the document, then some Rule-Based Row Filters let me pull out exactly what I want (the batches use a consistent format). This is probably very messy, but it’s my first attempt at reading pdf files and the inputs are in a far from ideal format.

I’ll have a think/play and see if there is a better way to approach the problem.

Daniel_Weikert · December 14, 2021, 5:01pm

can you use Strings to Document Node to parse the output into a document which then is compatible with the document nodes you use for cleaning?
By the way I guess you can’t share a workflow sample? It’s easier to help even when it’s dummy data
br

phick · December 14, 2021, 5:06pm

That looks very promising, I’ll look into that.

Unfortunately I can’t share the data, but it is essentially a file with a title including the date and a series of graphs for analysis of each batch. There is a lot of superfluous data in there, but for this process I’m looking to extract the batch numbers and tag them with the date of testing.

phick · December 15, 2021, 9:42am

Ok, after a little bit of tweaking the Tika Parser did the job. Meant I was able to separately pull all the information out of all the individual files but kept track of the Filepath so the association to the original file was intact, then join them together into a desired output table. The next problem is that it seems the supplier of this data hasn’t always been consistent with their formatting

Thanks for all the suggestions

andrejz · December 15, 2021, 1:25pm

Hi, @phick

To extract data from PDF I use this as starting point PDF data extract.knwf (16.0 KB)

Put all the PDFs in the same direcotry and you can continue wit the group loop node to loop over all documents by document ID

Hope helps you

Regards
Andrej

Daniel_Weikert · December 15, 2021, 5:35pm

Nice one, thanks for sharing @andrejz

system · June 16, 2022, 5:36am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.