PDF - Extract Text plus export to Excel

Hi friends.

I would like some help to do a simple task. (I think)
I want to read some PDFs and extract to excel only two information.

I’ll explain into prints.

First
I have two PDFs (in truth, I have 800) and I want only those two information concatenated into a single line, like below, separated by underscore:

CNPJ: 43.708.379/0003-63_Referência: 11/2022 (then replace / to another character for rename the file)

I’m using Tika Parse, but all PDF information are in only 1 row.
Until now, I only did this:
image

How can I extract those informations?

To complicate more, my real goal is:

  1. Extract those two information
  2. Concatenate into one row
  3. Bring the original name of the file (I already did)
  4. Copy the original file into another folder but with the new corresponding name (concatenated)

The original name is:

My goal is:

image

If I reach the 2) item, I will extract to excel file the results, then uses PowerAutomate to rename the files. With only two columns, the original name and new name I could ( I found a youtube video)

Cheers from Brazil
ReadingPDF.knwf (66.8 KB)

1 Like

@Felipereis50 I might be able to take a look at your example later. In the meantime I could point you to this solution using Rstats to extract text from a PDF and search for a special term:

2 Likes

Hi friend.
I’ll try Rstats as you said.

I’ll get back to you later if I got it.

I’m not familiar with R.
But I’ll try to study cases from challenge 15.

Hello @Felipereis50
Following the KNIME Hub link, you can find a possible solution to your challenge. I tried to build the whole workflow without stepping on scripting nodes. Transfer the files with Transfer Files node was an easy task to be configured within a loop, however I couldn’t complete the file rename task.

As you can appreciate, coding transfer + rename into R is very simple. Then, you can avoid PowerAutomate extra tasks.

Let us know if further support is needed.
BR

P.S.- PDF Files aren’t included in the workflow. The workflow is saved with data. Aiming to re-run the workflow you will have to point source folders in your system, and edit reference folders within R Snippet code.

4 Likes

Hi friend ", Very nice to meet you.

Wow, great to see people committed to helping.
I’m still new to Knime, but I hope I can help too.

It’s great when you send me the code, so I can analyze it.

About the R language, I took a small course on Google Analytics, but I found the language difficult. I didn’t adapt.

I will download your code and studie step by step analysis.
I managed to do it using Power Query and Power Automate.

In the power query I read a folder and did the ETL (easy), then I used PowerAutomate Desktop (youtube video (very easy to)
But I want to learn in Knime. It is a magnificent tool. The best I’ve ever met. Incredible.

I’ll be right back with the analysis.

1 Like

Hello friend.
I studied your code. I understand almost everything.

What I didn’t understand at all was your formula: “regexReplace($Content$, “.?(CNPJ:.)?\n”, “$1”)”
I have no idea what that “regex” rule means.

And since I don’t know R, I didn’t understand R Snippet either. What I do know is that this < - sign is a variable.

Anyway, what you did is exactly what I needed.
Many, many, many thanks.

I’ll ask on the forum if anyone can help me with Transfer Files + loop.

(I never used the loop). It must be a bit complex.

Cheers from Brazil.

1 Like

Hello @Felipereis50

I’ll try to update the workflow by the end of the day, and it will include this function based in Transfer Files node (loop embedded). I can anticipate that it won’t be so efficient as R does.

Regex coding is a powerful tool for dealing with text. As any code, it can solve many workarounds. The explanation of the code is as follows:

Therefore, the code represents the whole multiline text within the cell. The first capturing group is the target row and it is represented by “$1”. Then we are replacing the whole text with only the capturing group.

Thank you for validating the solution. BR

2 Likes

Wow thank you very much.
If you can make the loop for me to understand. It will be a great learning experience.

I don’t want to disturb you.
Thanks for passing the rules. Yesterday I was watching videos to learn regex. I still think it’s difficult. It has many rules. Really Regex is a powerful. I didn’t know.

I had thought of using some “IF” formula and “search” and then left and right to capture the numbers after CNPJ.

I looked at your history on the forum and saw that you know a lot.

1 Like

Hello @Felipereis50
I’ve just updated the workflow in KNIME Hub referenced in previous post. As you can see, ‘Transfer Files’ node can only copy or move (by deleting source files option) to target folder…

The best option to copy_and_rename() from my experience is by coding it into R. Would be interesting to know, if rename function can be achieved from KNIME base nodes.

BR

1 Like

@gonhaddock what I built was to copy the file to a new location with the new name and then delete the old file. Not especially elegant …

R installation can have some challenges. Question would be if an external tool was used it would better be Python since this can be installed just by a KNIME extension.

2 Likes

Thanks @mlauber71
So ‘Transfer Files (Table)’ can do the trick :vulcan_salute:t4:

I will try to upgrade the workflow by including a @mlauber71 's inspired KNIME base nodes supported copy_and_rename() ‘option’. And maybe adding the Py option as well, aiming to close the gap…

BR

2 Likes

Hi @gonhaddock and @mlauber71

As usual.
Thanks a lot for the support.

I’m studying your code.
And for mlauber_71, a very creative way of renaming.

For the first time I saw loop example. Very interesting.
Well, there could be a Transfer node (move, copy and rename) I think from this point on, it’s creativity with other nodes to rename it.

As I don’t know anything about R or Python, it would be difficult for me to finish “GOAL” alone.

In any case, I am very grateful for what has already been achieved.

I found some other threads about renaming files. I haven’t been able to check yet. But I will. Who knows, maybe I can help you too.

But I managed to rename the 800 files in 3 minutes.

Amazing.

Cheers from Brazil.

3 Likes

Hello @Felipereis50 @mlauber71

The workflow now achieves the Copy_Rename function suported only with KNIME based nodes. It is supported with ‘Transfer Files (Table)’ node, summing up @mlauber71 's suggestions.

The current workflow status covers on how to complete ‘Copy and Rename files’ function in 3 approaches for preferences:

  1. KNIME Based nodes, supported with Transfer Files (Table).
  2. R Snippet, code
  3. Python Script, code

I feel the challenge to be completed now.
BR

5 Likes

Wonderful job.

Thank you very much.

It will be very helpful.

2 Likes

I’m analyzing the flow.

How well organized it is.
I’m impressed.
Even the shape of the colored quadrants separating each stage.

This work should go to the course. :pray: :sweat_smile:

4 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.