Extract Tables from PDF

Haystack · October 20, 2018, 5:59pm

Hello all,

I am trying to extract a table from a PDF file. Similar to the way the ‘Tabularizer’ package works in R. Is this possible in KNIME?

After looking through the forum and the sample workflows, I’ve only found examples extracting text or metadata from PDFs. I want to get the table directly. Has anyone tried to do this?

Many thanks.

Geo · October 20, 2018, 8:30pm

In the worst case, you can use R snippet or R table reader to use the said package and then use Knime from there…

Haystack · October 20, 2018, 9:04pm

Thanks @Geo for the suggestion. However, I’m not very fluent with R, and I am hoping it would be easier to do in KNIME.

Extracting tabular data from PDFs seems to be a common requirement. Is there a way to do it in KNIME?

Thanks

JAGBI · October 23, 2018, 1:38am

Hi,

Any luck as I’m trying to extract a table from a pdf file. Since I couldn’t upload a sample pdf file, here is a pdf link and if someone can provide the steps to extract the table on page 108

Thanks
Jag

Haystack · October 23, 2018, 2:42pm

I have not. Can a KNIME Team member tell us if this is possible?

Thank you.

oole · October 23, 2018, 3:58pm

Hello everyone,

It is possible if the PDF allows it, meaning if the string we get from the PDF represents the table so that we can manipulate the String, to extract the table.

I took @JAGBI’s example and parsed the first three pages as an example. In order to extract the whole table/document, some more string manipulation would have to be done. The magic happens in the Extract Table metanode, where the string is parsed to an actual table. The workflow would have to be adapted to other PDFs/tables, but it worked pretty well on the given PDF.

Here is the workflow: table_from_pdf.knwf (320.6 KB)

I hope it helps.

qqilihq · October 23, 2018, 5:40pm

Besides oole’s suggestions, there’s a Java lib called tabula-pdf which is optimized for table extraction from PDF files.

So, who’s willing to build a node around it?

izaychik63 · October 23, 2018, 5:46pm

It is not clear how header splitter works? It is really better to have specialized node for the task.

JAGBI · October 24, 2018, 12:46am

This is awesome. Could you confirm if the Extract table is a node or a macro? If its macro could please share a link to a guide on how to create macros in KNIME?

Thanks
Jag

oole · October 24, 2018, 5:55am

@qqilihq That sure does look interesting, and might be a better/easier way to extract the table. I just played with it a little and it was not working all too well with the example provided by @JAGBI. Maybe I missed some configuration. We need some volunteers!

@izaychik63 I admit that it is not as straight forward, the parsed PDF has to be inspected in order to know on what delimiter the string, extracted from the PDF, should be split.

@JAGBI The Extract table is actually simply an encapsulated array of nodes, if you double-click the Extract table metanode you can inspect the nodes that I used.

muthmann · October 30, 2018, 8:42am

If the PDF is not too crappy, I used to use a combination of the PDF Parser and the Document Data Extractor to get the data from my bank account statements.

Veys · April 11, 2020, 11:26am

Hi all,

I followed Philipp’s suggestion and implemented a solution with Tabula, which I wanted to share with the community.

Basically, I am parsing a command string and executing it with the bash node.

Parse with string manipulation node:
string("java -jar \"C:\\Users\\Veys\\Desktop\\Tabula\\jar\\tabula-1.0.3-jar-with-dependencies.jar\" \"") + string($PdfFilePath$) + string("\" --no-spreadsheet --stream --pages all --area 338.19,18.997,645.501,573.346 --outfile \"") + string($CsvFilePath$) + string("\"")

Result:
java -jar “C:\Users\Veys\Desktop\Tabula\jar\tabula-1.0.3-jar-with-dependencies.jar” “C:\Users\Veys\Desktop\Tabula\data\pdf\0795602.pdf” --no-spreadsheet --stream --pages all --area 338.19,18.997,645.501,573.346 --outfile “C:\Users\Veys\Desktop\Tabula\data\csv\0795602.csv”

Hope it helps

Haystack · April 11, 2020, 3:43pm

Hi Veys,

This is great! Does it work with all PDF Tables?

Thank you,
Haystack

Veys · April 12, 2020, 1:08am

Hi mate,
I reckon, if the table areas are the same, you should be able to extract it.
If you have headers with multiple lines, I’d suggest to select the data area only, as Tabula does not seem to support multi-line headers.

Cheers
Veys

JinnyLe · May 11, 2020, 2:31am

hi @Veys,
this is fantastic. I’m still quite new to KNIME so there is some parts i’m not very clear - do you mind to share the .knwf file with us too? Thank you so much!