How to read multiple lines from PDF File

Subramanyam · August 29, 2022, 1:55am

Hi Team,

I want to read multiple lines from the PDF File. I have used the PDF Parser as well as the Tika Parser node but unable to fetch the details. Please help me in resolving the issue.
We have already workflow which will extract the PDF details if it is having one line details.
I am uploading the PDF File in doc format as I am unable to upload in PDF format as well as the workflow too.
SG 4503089325.pdf.docx (63.2 KB)
Automation-IDOC–Singapore.knwf (126.9 KB)

Please lemme know for further details.

Thanks,
Subramanyam Kinthada

ScottF · August 29, 2022, 1:47pm

Hi @Subramanyam -

Please don’t post new topics in the KNIME Development forum unless they are specifically related to node development, the KNIME SDK, or similar. I have moved your topic to the main Analytics Platform forum.

Is this question related to your already existing topic posted here, or is this new? unable to read all the details from PDF file using PDF parser node

Subramanyam · August 29, 2022, 2:54pm

Hi @ScottF,

Yes it is related to the same topic but here the input file consists of multiple lines and hence we are unable to fetch the details using PDF Parser or Tika Parser.

From the The input PDF file we need to generate Two Text files Header file and Detailed File.

Header File consists of the below details from PDF. As we have multiple lines we need to generate those many number of header lines.
PO Number–4503089325
shipment code—2003085667
PO Date–09.05.2022
Deliverydate–11.05.2022
Quantity
Amount

The output of the files has to be in the below format
edrpod_2000894913_20220617_1.txt (20.0 KB)
edrpoh_2000894913_20220617_1.txt (299 Bytes)

Please let me know if any details required.

Thanks,
Subramanyam Kinthada.

Subramanyam · August 30, 2022, 2:17pm

Hi Team,

I have attached all the required files and please help me in resolving the issue.
Lemme know for further details.

Thanks,
Subramanyam Kinthada.

ScottF · August 30, 2022, 3:19pm

I took a look at what you posted. As you know, parsing text from PDFs in a non-trivial and oftentimes very frustrating problem - so I sympathize. But you should realize that without focused details it is unlikely that anyone is going to be able to help you. I say this because:

You have posted a large workflow without context on specifically where you are having a problem. Your workflow is largely unannotated and without any descriptive text. Basically, no one can understand what you’ve already tried.
It is is difficult to understand at a glance what your output represents, as there are no headers in the output datasets.
Parts of your workflow just don’t execute (e.g. Cross Joiner, String Manipulation)

I would suggest breaking your workflow into smaller chunks, and focusing on one problem at a time. Remember that people here in the forum are volunteering their time to help you, this isn’t a consulting firm. So give them as many details as possible in a concise way, to help them help you.

mlauber71 · August 30, 2022, 4:02pm

@Subramanyam you could try and use the R package pdftools to extract the text. It will also extract the content of the table but there might be better ways to deal with them as we already have discussed - Solutions to "Just KNIME It!" Challenge 15, Extract Table from PDF with the help of R "tabulizer" and KNIME – KNIME Hub).

Every page will be in a table row so you might take it from there and start splitting and manipulating the header in order to get your information.

kn_example_r_pdf_read_text.knwf (95.6 KB)

Then I would agree with what @ScottF says. It is difficult to follow what you want to do and at which point the community might be able to help. BTW you can enclose data within the node (preferably in the subfolder /data/) so you can have a complete example that other people might be able to run.

Subramanyam · September 2, 2022, 3:44am

Hi @mlauber71,

I have tried the R source table extract node but I am facing the below issue.

do we have any other node which will extract the PDF details into table format so that I can proceed with next formatting.

Thanks in advance.
Subramanyam Kinthada.

mlauber71 · September 2, 2022, 4:51am

@Subramanyam you will have to install R and the package mentioned. Best to start with the official guide:

https://docs.knime.com/latest/r_installation_guide/index.html

Or you can use conda environment propagation and python to also install R.

badger101 · September 2, 2022, 5:41am

I notice this isn’t the first time this happens, @ScottF since you’ve also mentioned this a while ago to them:

In that thread, your comment (and @Daniel_Weikert 's comment too) wasn’t acknowledged at all, as in the case on this one. I’m curious to see what happens in the future.

Subramanyam · September 12, 2022, 10:00am

Hi @mlauber71,

Thanks for your suggestion,

As the above recommended process not accepting in servers do we have any other possible way to extract the details from the table in the attached PDF File.

If its possible, please kindly help me in resolving the issue.

Attaching the PDF file with .docx extension. This is a dummy data.
SG 4503089325.pdf.docx (63.2 KB)

Thanks in advance,
Subramanyam Kinthada.

Subramanyam · September 14, 2022, 6:11pm

Hi Team,

while installing rserve using the below commands using Rtools42 I am facing the below error. Please help me in resolving the issue. Please refer to the below screen shot

Thanks,
Subramanyam Kinthada.

mlauber71 · September 14, 2022, 9:00pm

@Subramanyam installing R and make it work with knime still might face some challenges. You might want to start by reading the official guide

https://docs.knime.com/latest/r_installation_guide/index.html#_introduction

For the installation of Rserve (which you will need) these entry might help:

Then there is the collection sich also describes the option to use Conda Environment Propagation to install R and necessary dependencies with the help of conda.

Here is the conda and R installation with an example (maybe use Miniforge as your conda environment installer https://github.com/conda-forge/miniforg)

system · December 13, 2022, 9:01pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.