PDF Parsing and Extraction

Hi, I have a sample PDF file from which I need to extract tables into an Excel spreadsheet. I’ve started by extracting the first table using the sample workflows I’ve seen here in the forum, but I’m experiencing some problems. I don’t know how to separate the “Date” values from “Gen, G4 and 3, Intel” into separate columns. They should be in separate columns and not combined.

I would appreciate any help or guidance on how to correctly extract the table.

Here’s the sample pdf file:
LENOVO.pdf (75.5 KB)

This is my current workflow:
PDF Extracting.knwf (87.9 KB)

PDF Table:

KNIME:

Hi, Can you maybe make a “What you expect” vs “What you get” ?
In your KNIME table all the dates are in single cells so I don’t understand what you want differently.

1 Like

Hi your output looks like they are in separate columns after the cell splitter.
Based on the text I would probably rather think regex would be helpful to get info in the same columns
br

1 Like

Hi @Gasperto I want them to look like this:

So that when I use the column merger and string manipulation, this will be the final result:

I have nodes below for fixing the product description column and I want it to concatenate with the above:

It will look like this:

The final output of the table after concatenation should be this:
(before)


(after)

Hi @Daniel_Weikert sorry if its not clear :sweat_smile:, yes they are in the separate columns but the Date values and text values should not be in the same column because the text is part of the product description, should be like this:

So that when separated, I can combine it with the first 2 columns and should be like this:

I don’t know how to do this like how to move the text values from separate columns :sweat_smile:

@Heldyyyyy here is one approach using Python and pdfplumber and then Cell Splitter. Some column types might have to be converted like dates.

More on extracting tables from PDF:

I also experimented with the use of LLMs and JSON formats but this is not done yet for such a case.

2 Likes

Hi @mlauber71 thank you for this, im actually trying your workflow but I can’t seem to intall the path conda directory in the knime. I already have the conda installed. I’m using windows 11.

Update:

This is the one that works:

can’t seem to get the environment of py39_knime
image

@Heldyyyyy you should use the yml configuration file at the end of the article. More on how knime and Python work here:

1 Like

Thank you @mlauber71 the blog and files really help! just want to ask what does this plumber do?

sorry for this question, I don’t code and have limited knowledge in this type of approach :sweat_smile: will look it more in the future if I can advance my workflow knowledge :grinning:

should I edit my py39_yaml file and paste this so that I can get the pdf plumber in the python library? or just input this in the prompt one by one (will start with the 3rd line)? because I just followed the activation of py39 and did not include this.

You can add the “pdfplumber” to the list of packages

And then use the conda env update command to add the package to your environment. This way you keep the information about which packages you have.

conda env update --name py39_knime --file "C:\\Users\\x123456\\knime-workspace\\py39_knime.yml"

Make sure you tell the Python node which environment to use.

sorry again seems that there is missing in my package, can’t find this imagegick in the setup :sweat_smile:

Update: It already works, I already used the workflow. I just get the imagemagick here: https://anaconda.org/conda-forge/imagemagick/files install and create and environment base on your blog. Thank you again for your help!

1 Like

Hi @mlauber71 do you have like a guide on how you create your path here, I can’t seem to grasph how you create the /data | ./ path:


I’m trying to explore the workflow and change the files but its not working, it just work with The list files/folders reads 3 files from the samples you imported.
image

I want to change this v_pdf_folder path so that in the list file/folder it will read it, im just confused how to configurate the collect local metadata so that it will connect correctly to the create file/folder node, this is my output for collect local metadata:


image

@Heldyyyyy in short paths can be created like this. You could remove unwanted files from the folder. Also the list files node has the option to filter by name or you could remove unwanted rows later

If you want the full story of knime path there is the lengthy but very useful:

https://docs.knime.com/latest/analytics_platform_file_handling_guide/index.html#introduction

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.