Text Extraction from PDF

BenJones13 · March 25, 2024, 1:57pm

Hello all!

I’m currently trying to extract a certain number and date out of a list of PDFs and then rename the PDF with that number and date. To be specific, I’m extracting the Employee Number from a payslip as well as the month and year.

Here lies the issue. Employee Number is under the address of the payslip and some addresses have more lines than others. The Employee Number also isn’t structured information, so I’m finding it tricky to extract the four digits when the position changes from payslip to payslip.

I’ve started with the Tika Parser and tried different combinations of the Sentence Extrator node and String Manipulaton node but no luck. Any help would be much appreciated!

Will attach two dummy payslips with multiple address lines.

The Employee Number of interest is the four digit number preceding the name at the top of the “table” after the address.

Commanders Payslips Feb 24_Part8.pdf (121.7 KB)
LA Chargers Payslips Feb 24_Part8 2.pdf (116.5 KB)

lelloba · March 25, 2024, 2:46pm

Hi @BenJones13,

welcome to the forum!
I tried with regex, hope it works for you.

According to the two exaples provided, it works.

Have a nice day,
Raffaello Barri

BenJones13 · March 25, 2024, 3:06pm

Hi @lelloba,

Just had a run through and it looks amazing!

From what I can gather, does your logic work by looking for the first four digit number in the PDF and assume that would be the Employee Number?

Do you know how I would get around that if I had a case like below?

Ravens Payslips Feb 24_Part8.pdf (128.2 KB)

Secondly, would you be able to point me in the direction of the second part of my problem - I’m trying to export/save/write this PDF and rename it with the extracted Employee Number with the Month and Year that was also extracted.

Really appreciate the help so far.

lelloba · March 25, 2024, 3:38pm

Hi @BenJones13 ,

as for the employee code, I’ve updated the code, now it works for the third case as well. At the link above you’ll find the updated workflow.
As for the PDF renaming, I’m on it, I’ll be back soon.

Raffaello

lelloba · March 25, 2024, 3:54pm

Hi @BenJones13 ,

the wf has been updated once again to introduce the renaming part.
Renaming consists in creating a new file with a new name, meaning I had to add a part that finds all old files and deletes them.

Tell me if i tworks.

Raffaello Barri
Let’s connect on LinkedIn!

BenJones13 · March 25, 2024, 4:51pm

Hi @lelloba,

Firstly, I’m blown away by the community, you’ve been incredibly helpful today.

The wf works perfectly on my end, I didn’t need to delete the old files so I removed that loop at the end. Just a few questions from my side.

Is there a way to choose the output destination? Let’s say I wanted it in a separate folder?

And then is there a way to add the data in the naming of the file to follow this format “Employee Number_Payslip_Month_Year”.

The date is also available on the payslip if need be.

Sorry to be a pain but once again I appreciate all the help so far.

lelloba · March 25, 2024, 5:10pm

Thank you

As for the name, refer to the existing link, the workflow has been updated; regarding the new folder, you need to replace the path with the new one inside the Column Expr. (a simple substitution with a replace formula should do the job).

Raffaello

BenJones13 · March 26, 2024, 9:08am

@lelloba I cannot thank you enough, it works perfectly and is exactly what I was looking to do

Would be very interested in how you were able to extract the snippets from the texts using the Regex!

\d{4}\sM[r|rs|iss]\s - this is for the digits but it doesn’t make sense to me!

\d{1,2}\s\w{3}\s20\d{2} - this is for the dates and I have the same confusion!

And if you could just point to where in the Column Expressions node I would use a Replace expression for the new path my life would be made a whole lot easier!

Once again, really appreciate all the effort you put into this.

lelloba · March 26, 2024, 10:08am

Hi @BenJones13,

Regex can be hard to learn, but if you remember the basics it can save you in many situations, trust me . Have a look on YouTube if you want to learn it and test your code with this website regex101.
The website helps you reading the code as well, explaining each single step.

So with this code, for example, I ask to match four digits, followed by a space, followed by a capital M, which can be followed by r/rs/iss and followed by a final space.

Similarly, for the date, I look for 1 or two digits (day), followed by a space character, followed by three word characters (month), followed by a space, followed by 20 and two other numbers (year 20XX).

For the column expression, here’s my local path:

say you want to save to downloads:

Here’s what I would place in the column expressions:

//creating name
name = join(column("code"), "_Payslip_", padLeft(getMonthOfYear(column("Full Match")), 2, "0"), "_", getYear(column("Full Match")))

//creating new path
prov = join(getParent(column("Path")), getSeparator(column("Path")), name, getFileExtension(column("Path")));

//NEW PART HERE!!! substitution
prov = replace(prov, "Documenti\\KNIME\\Personale\\Assistenza forum\\BenJones13\\Text Extraction from PDF\\data", "Downloads")

//substituting repetitions
prov = replace(replace(prov,"(LOCAL, ", ""), ")", "");

//substituting path with new one
replacePath(column("Path"), prov)

You see in the new part I simply replace the part of the path with the new path.

Have a nice day,
Raffaello Barri

system · April 2, 2024, 10:09am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.