Help me Extract "11 04 22" from a PDF

Kinetic_Bullet · May 2, 2025, 2:27pm

Hi all,
My first post. I have PDF for construction specification. The PDF lists the spec numbers like this “11 04 22”. I’d like KNIME to read the PDF and extract similar strings.

So far, this is what I have tried:
KNIME

The top approach in the screenshot above gives me only one string.
The bottom approach splits the string “11 04 22” into three rows:
11
04
22
Which is not what I want.
Can anyone please help me!
Thank you.

rfeigel · May 2, 2025, 4:58pm

It would be easier for someone to help if you could share your workflow. Make sure to include the pdf(s) with the workflow. In the meantime try using a chunk loop after the Tika Parser.

Kinetic_Bullet · May 6, 2025, 1:41pm

Hi rfeigel,
Adding chunk loop after Tika gave the same results. I found the best solution is to convert the pdf to excel sheet but this means that I have to do additional cleanups before I import into KNIME.

Not sure if you can see the screenshots above but is there another workflow you want me to share?

rfeigel · May 6, 2025, 1:58pm

Please share an actual workflow not a screenshot. I can’t help without a sample pdf.

Kinetic_Bullet · May 6, 2025, 3:44pm

Here is the pdf file and the workflow as requested.

Thank you for your help.
PDF PARSER.knwf (88.4 KB)
Sepcs TOC.pdf (731.1 KB)

rfeigel · May 7, 2025, 3:01am

Try this. The pdf you posted doesn’t seem like its exactly what you described in your post, but the structure seems the the same.

Kinetic_Bullet · May 7, 2025, 12:09pm

When I open the workflow, I get this warning message.

Warning During Load856×522 25.3 KB
I used a different pdf but String Splitter (Regex) now cannot find the Data String column! why is that?

Thank you for your help.

rfeigel · May 7, 2025, 2:17pm

Check the Table Manipulator node and make sure its configured as shown below. Row0 may be named differently. If you post the second pdf I’ll check it.

You’ll also need to reset the Tika Parser directory location to your local location.

rfeigel · May 8, 2025, 1:10am

A “like” isn’t very informative. If its working, please mark solved. If not provide some more information.

Kinetic_Bullet · May 12, 2025, 12:18pm

This is strange, for Row0, there is no option to change the type to String. I deleted the Table Manipulator node and inserted a new one but still refuses to show me String. Any idea why?

rfeigel · May 12, 2025, 1:52pm

Are you trying to parse a different pdf than the one you originally posted? If so, please post. Did you make changes to my workflow? Here’s a workflow using the Column Auto Type Cast node which "automatically’ converts non-native data formats.

Kinetic_Bullet · May 12, 2025, 2:45pm

Hi,
Thank you for sticking with me. I will give the above a try.

The pdf I gave you was an extracted version of a huge pdf. The format is the same. I don’t understand why this issue occurred to begin with?

Why I’m getting non-native data type when using different pdf but has the same layout/format? Please explain.

rfeigel · May 13, 2025, 12:32am

I don’t have a good explanation. The problem happens in the Table Transposer node. I have seen this before. The node doesn’t always seem “smart enough” to assign a data type to the transposed column(s). Adding the Column Auto Type Cast node downstream has always fixed the problem for me. So don’t worry about it.

Kinetic_Bullet · May 14, 2025, 12:52pm

Ok. Thank you very much for sticking with me to the end and finding a solution. I did a test and converted a word document to pdf and your workflow worked flawlessly!!

I will mark the previous post as a solution!

system · May 21, 2025, 12:52pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.