Hi all,
My first post. I have PDF for construction specification. The PDF lists the spec numbers like this “11 04 22”. I’d like KNIME to read the PDF and extract similar strings.
So far, this is what I have tried:

The top approach in the screenshot above gives me only one string.
The bottom approach splits the string “11 04 22” into three rows:
11
04
22
Which is not what I want.
Can anyone please help me!
Thank you.
It would be easier for someone to help if you could share your workflow. Make sure to include the pdf(s) with the workflow. In the meantime try using a chunk loop after the Tika Parser.
3 Likes
Hi rfeigel,
Adding chunk loop after Tika gave the same results. I found the best solution is to convert the pdf to excel sheet but this means that I have to do additional cleanups before I import into KNIME.
Not sure if you can see the screenshots above but is there another workflow you want me to share?
Please share an actual workflow not a screenshot. I can’t help without a sample pdf.
1 Like
Here is the pdf file and the workflow as requested.
Thank you for your help.
PDF PARSER.knwf (88.4 KB)
Sepcs TOC.pdf (731.1 KB)
Try this. The pdf you posted doesn’t seem like its exactly what you described in your post, but the structure seems the the same.
3 Likes
Check the Table Manipulator node and make sure its configured as shown below. Row0 may be named differently. If you post the second pdf I’ll check it.
You’ll also need to reset the Tika Parser directory location to your local location.
2 Likes
A “like” isn’t very informative. If its working, please mark solved. If not provide some more information.
1 Like
This is strange, for Row0, there is no option to change the type to String. I deleted the Table Manipulator node and inserted a new one but still refuses to show me String. Any idea why?
Are you trying to parse a different pdf than the one you originally posted? If so, please post. Did you make changes to my workflow? Here’s a workflow using the Column Auto Type Cast node which "automatically’ converts non-native data formats.
3 Likes
Hi,
Thank you for sticking with me. I will give the above a try.
The pdf I gave you was an extracted version of a huge pdf. The format is the same. I don’t understand why this issue occurred to begin with?
Why I’m getting non-native data type when using different pdf but has the same layout/format? Please explain.
I don’t have a good explanation. The problem happens in the Table Transposer node. I have seen this before. The node doesn’t always seem “smart enough” to assign a data type to the transposed column(s). Adding the Column Auto Type Cast node downstream has always fixed the problem for me. So don’t worry about it.
2 Likes
Ok. Thank you very much for sticking with me to the end and finding a solution. I did a test and converted a word document to pdf and your workflow worked flawlessly!!
I will mark the previous post as a solution!
3 Likes