Read PDF data and process it

Hi KNIME Team,

I have a requirement of reading the PDF file and updating the data as mentioned in the 1st screen shot . I have tried with PDF Parser, Tika Parser, but was not successful. I am able to read the data using R programming and converted to text file. I read the same text file using file reader and used some manipulation nodes to filter the required data. I need only 13 columns, but i have around 149 columns created in this process. some have blanks and some have question marks. Could some one help me in removing or deleting those unwanted columns. Below is the screenshot of the few columns having blanks/empty/??. Also what node should i use to run the R script or is there any node though which i can extract the data from pdf?

Your help on this issue is much appreciated.

1 Like

Hi,

  1. To remove the columns containing only missing values you can use the Missing Value Column Filter node.
  2. To remove empty columns I would use first a Transpose node, then a Row Filter to remove all rows that are empty and then again a Transpose node.
  3. To use your R snippet inside of KNIME Analytics Platform you can use the R Source node.

Best,
Kathrin

Hi Kathrin,

I am successful in removing missing values with Missing Value Column Filter node.
I have transposed the data also, but i couldn’t forward with the Row Filter, below are the screen shots of how the data looks like and the Row Filter node configuration.


Need help on this.

Regards,
Pavan.

Hi Pavan,

is it correct, that I could describe your task by: delete all columns where Row 41 has either a missing value or is empty?

If yes, you can change the column to test on in the configuration window to Row 41. Next, we have to find out what is actually in your empty cell, it could be either really empty or also a space. Therefore, I would copy one of the cells and paste it into the pattern matching input field.

Cheers,
Kathrin

Hi Kathrin,

I did copy the cell and pasted into the pattern matching input field, i could find no change, i believe that it is really empty cell. Could you please suggest how to proceed.

Thank you.

Regards,
Pavan.

Hi Pavan,

Sorry, that this idea of mine didn’t work. Two other things that you can try:

  1. If the empty columns are also constant columns AND all the columns you still need are not constant columns, you could use the constant value column filter node.

  2. You can use the String Manipulation node after the Transpose node and replace the empty strings for example with 1. Therefore you can use the following function in the string manipulation node: regexReplace($test$,"^$" ,“1” ). Then you filter out all rows with value 1.

I hope one of both works for you. If not, please let me know and maybe share part of your workflow, so I can look into it.

Best,

Kathrin

Hi Pawanmtm,

I have a similar requirement where I need to read a PDF file and extract details such as Invoice No., Date, Price, etc.

However I’m unable to read the data and you mentioned you have used R code to achieve this.

Do you know of any alternative methods to read the data (I have no background in R)? I haven’t been able to get past the 1st step yet.

Thanks in advance

There is no precise and stable solution but you can experiment with nodes as I did in WF below.

1 Like

Hi Poojitlingam,

Welcome to KNIME community.

As mentioned by @izaychik63, we can do this in multiple ways as per the convenience. If you are still unable to do, please share some samples or dummy data, so that you can get the helping hand from the community.

Regards,
Pavan.

1 Like

Hey Poojitlingam,

did you find any solution to read the data as described out of your PDF File?

I have a similar issue and also want to extract details from the PDF, but haven’t found any solution for the problem yet.

Cheers,
Laura

Hi Laura,

Yes I was able to read the data using Tika Parser. After that most of my time was spent manipulating the data. See attached workflow. Let me know if that helps you.

1 Like