Hi KNIME Team,
I have a requirement of reading the PDF file and updating the data as mentioned in the 1st screen shot . I have tried with PDF Parser, Tika Parser, but was not successful. I am able to read the data using R programming and converted to text file. I read the same text file using file reader and used some manipulation nodes to filter the required data. I need only 13 columns, but i have around 149 columns created in this process. some have blanks and some have question marks. Could some one help me in removing or deleting those unwanted columns. Below is the screenshot of the few columns having blanks/empty/??. Also what node should i use to run the R script or is there any node though which i can extract the data from pdf?
Your help on this issue is much appreciated.
I am successful in removing missing values with Missing Value Column Filter node.
I have transposed the data also, but i couldn’t forward with the Row Filter, below are the screen shots of how the data looks like and the Row Filter node configuration.
Need help on this.
is it correct, that I could describe your task by: delete all columns where Row 41 has either a missing value or is empty?
If yes, you can change the column to test on in the configuration window to Row 41. Next, we have to find out what is actually in your empty cell, it could be either really empty or also a space. Therefore, I would copy one of the cells and paste it into the pattern matching input field.
I did copy the cell and pasted into the pattern matching input field, i could find no change, i believe that it is really empty cell. Could you please suggest how to proceed.
Sorry, that this idea of mine didn’t work. Two other things that you can try:
If the empty columns are also constant columns AND all the columns you still need are not constant columns, you could use the constant value column filter node.
You can use the String Manipulation node after the Transpose node and replace the empty strings for example with 1. Therefore you can use the following function in the string manipulation node: regexReplace($test$,"^$" ,“1” ). Then you filter out all rows with value 1.
I hope one of both works for you. If not, please let me know and maybe share part of your workflow, so I can look into it.
I have a similar requirement where I need to read a PDF file and extract details such as Invoice No., Date, Price, etc.
However I’m unable to read the data and you mentioned you have used R code to achieve this.
Do you know of any alternative methods to read the data (I have no background in R)? I haven’t been able to get past the 1st step yet.
Thanks in advance
There is no precise and stable solution but you can experiment with nodes as I did in WF below.
Welcome to KNIME community.
As mentioned by @izaychik63, we can do this in multiple ways as per the convenience. If you are still unable to do, please share some samples or dummy data, so that you can get the helping hand from the community.
did you find any solution to read the data as described out of your PDF File?
I have a similar issue and also want to extract details from the PDF, but haven’t found any solution for the problem yet.
Yes I was able to read the data using Tika Parser. After that most of my time was spent manipulating the data. See attached workflow. Let me know if that helps you.