PDF Parsing and Extraction

Heldyyyyy · June 13, 2024, 6:13am

Hi, I have a sample PDF file from which I need to extract tables into an Excel spreadsheet. I’ve started by extracting the first table using the sample workflows I’ve seen here in the forum, but I’m experiencing some problems. I don’t know how to separate the “Date” values from “Gen, G4 and 3, Intel” into separate columns. They should be in separate columns and not combined.

I would appreciate any help or guidance on how to correctly extract the table.

Here’s the sample pdf file:
LENOVO.pdf (75.5 KB)

This is my current workflow:
PDF Extracting.knwf (87.9 KB)

PDF Table:

KNIME:

Gasperto · June 13, 2024, 9:48am

Hi, Can you maybe make a “What you expect” vs “What you get” ?
In your KNIME table all the dates are in single cells so I don’t understand what you want differently.

Daniel_Weikert · June 13, 2024, 3:33pm

Hi your output looks like they are in separate columns after the cell splitter.
Based on the text I would probably rather think regex would be helpful to get info in the same columns
br

Heldyyyyy · June 14, 2024, 1:06am

Hi @Gasperto I want them to look like this:

So that when I use the column merger and string manipulation, this will be the final result:

I have nodes below for fixing the product description column and I want it to concatenate with the above:

It will look like this:

The final output of the table after concatenation should be this:
(before)

(after)

Heldyyyyy · June 14, 2024, 1:28am

Hi @Daniel_Weikert sorry if its not clear , yes they are in the separate columns but the Date values and text values should not be in the same column because the text is part of the product description, should be like this:

So that when separated, I can combine it with the first 2 columns and should be like this:

I don’t know how to do this like how to move the text values from separate columns

mlauber71 · June 14, 2024, 7:50am

@Heldyyyyy here is one approach using Python and pdfplumber and then Cell Splitter. Some column types might have to be converted like dates.

More on extracting tables from PDF:

I also experimented with the use of LLMs and JSON formats but this is not done yet for such a case.

Heldyyyyy · June 18, 2024, 1:32am

Hi @mlauber71 thank you for this, im actually trying your workflow but I can’t seem to intall the path conda directory in the knime. I already have the conda installed. I’m using windows 11.

Heldyyyyy · June 18, 2024, 2:14am

Update:

This is the one that works:

can’t seem to get the environment of py39_knime

mlauber71 · June 18, 2024, 4:52am

@Heldyyyyy you should use the yml configuration file at the end of the article. More on how knime and Python work here:

Heldyyyyy · June 18, 2024, 7:37am

Thank you @mlauber71 the blog and files really help! just want to ask what does this plumber do?

sorry for this question, I don’t code and have limited knowledge in this type of approach will look it more in the future if I can advance my workflow knowledge

Heldyyyyy · June 18, 2024, 8:54am

should I edit my py39_yaml file and paste this so that I can get the pdf plumber in the python library? or just input this in the prompt one by one (will start with the 3rd line)? because I just followed the activation of py39 and did not include this.

mlauber71 · June 18, 2024, 9:02am

You can add the “pdfplumber” to the list of packages

And then use the conda env update command to add the package to your environment. This way you keep the information about which packages you have.

conda env update --name py39_knime --file "C:\\Users\\x123456\\knime-workspace\\py39_knime.yml"

Make sure you tell the Python node which environment to use.

Heldyyyyy · June 19, 2024, 2:08am

sorry again seems that there is missing in my package, can’t find this imagegick in the setup

Heldyyyyy · June 19, 2024, 4:59am

Update: It already works, I already used the workflow. I just get the imagemagick here: https://anaconda.org/conda-forge/imagemagick/files install and create and environment base on your blog. Thank you again for your help!

Heldyyyyy · June 20, 2024, 2:00am

Hi @mlauber71 do you have like a guide on how you create your path here, I can’t seem to grasph how you create the /data | ./ path:

I’m trying to explore the workflow and change the files but its not working, it just work with The list files/folders reads 3 files from the samples you imported.

I want to change this v_pdf_folder path so that in the list file/folder it will read it, im just confused how to configurate the collect local metadata so that it will connect correctly to the create file/folder node, this is my output for collect local metadata:

mlauber71 · June 20, 2024, 4:48am

@Heldyyyyy in short paths can be created like this. You could remove unwanted files from the folder. Also the list files node has the option to filter by name or you could remove unwanted rows later

If you want the full story of knime path there is the lengthy but very useful:

https://docs.knime.com/latest/analytics_platform_file_handling_guide/index.html#introduction

system · June 27, 2024, 4:48am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.