Extract table from pdf

Hi everyone,

I used a Python source node to extract tables from a pdf using some Python packages, i.e., Tabula and Pandas.
I am facing a problem when extracting the tables because it is giving an unstructured output from the Python source node. I am not able to extract the exact table from the pdf.
The code i used in the Python source node is as follows:

from tabula import read_pdf
import pandas as pd
#reads the table from pdf file

read_pdf = read_pdf(“examples.pdf”,pages=“all”) #address of pdf file
output_ table=pd.DataFrame(read_pdf[0])

The text which is wrapped in the next line of the cell is coming as new line in the knime output of python source node
Please find the example input pdf and the knime


output in the attachments

Hi SA20276736,

could you provide a minimal workflow (which contains the example.pdf), so we can reproduce it directly?

Thanks
Steffen

@SA20276736 you could check out these links

https://hub.knime.com/search?q=pdf%20table&type=Workflow&tag=justknimeit-15&sort=best

https://hub.knime.com/search?q=pdf%20table&type=Workflow&sort=best

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.