Extract pdf information - Tika Parse (Help)

Felipereis50 · June 27, 2024, 10:55pm

Hi friend,
I would like some help to extract all values from a PDF, to export to excel after.

Hint
The only different rule is for the last value, that the information “ICMS-ST INTEREST DE SP PARA RJ 102016” could be anything or null.
The code will need to search that point and bring some value tha could be with any lenght character.

could anyone help me?

I’m not know anything about regex, and I know that regex is the best option for that.

PDF.knwf (31.9 KB)

Felipereis50 · June 28, 2024, 1:57am

I have tried this two codes,

1) regexReplace($Content$, ".*?(Código\\s+de\\s+Barras.*)?\\n", "$1")
2) regexReplace($Content$, ".*?Código\\s+de\\s+Barras:(.*)?\\n", "$1")

The first one is exactly what I want, but without the name “Código de Barras”. I only want the values after “:”

So, I tried the second code, creating a “group” with (.*) and calling with “$1”,. But why the result is showing all values?
I didn’t find the right rule.

I studied a great explanation from @takbb about some rule I “copy cat”.

mlauber71 · June 28, 2024, 4:17am

@Felipereis50 just to give you two hints. I have these examples about how to extract tables but also text from a PDF

And then you could try and feed the PDF or an extract to a LLM and instruct it to give back the results. You could try and work with vector stores.

As a side note I also have a streamlit app where you could feed the PDF. Not yet an article or a knime workflow:

Felipereis50 · June 28, 2024, 12:22pm

@mlauber71

Hi,

First…
I’m trying the first article " Extract Text and Tables from PDF Files with Python in a Low-Code Environment"

And I’m getting an error.

Do I need to install a particular extension?

I think is missing “conda installation”. But I don’t have permission to install on my companie laptop.

" Even though, your article looks amazing, but I think is “too much” for my IQ But I favorite to learn later.

mlauber71 · June 28, 2024, 12:57pm

@Felipereis50 at the end of the above article there is a YML file with the necessary configurations in Python to use the packages. You can also use the Conda Environment Propagation if you have the KNIME Python extension installed.

Maybe check out this text:

Felipereis50 · June 28, 2024, 1:56pm

I’m trying to extract the value below and very close to finish.
(see the green selection)

But how to complement the code and select all lines like the code below?
It’s because regexReplace need to.

I have tried to use:

(Descrição:\s+.{6})*?\n

but no success

I know @lelloba is very good in regex. I think is a simple code to finish. (if you have time)

mlauber71 · June 28, 2024, 2:23pm

@Felipereis50 what will be the ‘rule’ that will define this part of the text? I had good results with putting examples and rules to ChatGPT and ask for a Regex format.

You might have to double escape backlashes in KNIME though. Also you might want to test several edge cases and think about what could go wrong.

Felipereis50 · June 28, 2024, 2:36pm

I have tried to ChatGPT in various ways of questions.

I’m 4 hours trying to show only that content in knime column but nothing.

I managed to extract some values when capturing the entire line.
But since I need to capture only a part of the line, I can’t finish it in Knime.

I need to complement the Regex so that only what I want is highlighted in green and the rest in blue, just like the example above. But I think because the middle of the line doesn’t have the new line character, the function doesn’t work.

I urgently need a regex course.

Felipereis50 · July 5, 2024, 11:26pm

I’m came back to inform that I found a solution for regex.
After I buy a book to learn regex, I found the code “(?s)”

system · July 12, 2024, 11:26pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.