PDF Table to Excel

Hi Knime Community, I have this type of PDF table where the value from one cell has 2 or more rows (it’s like a value that is in a merged cell). I did try the workflows that I found here (Thank you to the members who are kind enough to share their initial workflows) like the use of tika parser node to read the pdf file then use the cell splitter to split the contents and divide by columns, etc… Here’s my scenario:

Sample Data header and cell value that uses 2 or more rows but combine as one (merge):
data table

Scenario 1 that I try:
After using tika parser, cell splitter and transpose node; the value in the Product description wherein it uses many rows (merge) picks up by knime by row. I’m thinking of combining the column for Product Description + Price Start Date into one column, but how can I move the description that is in left columns?

Scenario 2 that I try:
Tika parser - cell splitter content - ungroup then group by; it reads all the line in one row, How can I split the specific values per column (I suppose will use splitter and some type of pattern/numbering matching but I don’t know how will it pick up the other’s, it was my guess :sweat_smile:)

Im thinking of combining the columns with a specific row iteration because the pattern is for every 1-9 rows or for every missing column ( like group them) Im just not sure how to execute this one.

Hi @Heldyyyyy,

converting PDF to well structured tables is always a gamble. Despite PDFs being some sort of XML, the data written can be vastly different. Do you mind sharing some sample data so the community / I can have try providing you a solution or at least a starting point please?

Best
Mike

2 Likes

Hi @mwiegand thank you for replying, this is my current workflow after 2 days of working on it :sweat_smile:, what I did is to get the table first and try to fix it to convert into a excel table. What I currently did is seperate the 1st row of Line item number and Product code then transpose to columns then merge the columns and I added loop variable so that it can read the row transformation from top do bottom. I’m still unsure on what to do next, and I’m trying to figure it out.

PDF to Excel.knwf (73.1 KB)

This is the sample raw file for reference I get this from the website of lenovo and the given data here is dummy

LENOVO.pdf (183.2 KB)

@Heldyyyyy without having had a chance to look at your workflow one option could be to split the text with Regex by comma when this comma is not enclosed in brackets.

Then if not all cells do have the same order of (future) columns you could try and have a hierarchy of key words to identify columns like Item, BT, Wifi and so on.

1 Like

Hi @Heldyyyyy,

thanks for sharing and thankfully the PDF was created in a quite proper way. What I did was:

  1. Read the Excel using the Tikka Parser
  2. Copied the text to an editor like Sublime to inspect the structure
  3. Injected markers for start and end of the desired data
  4. Injected markers for line breaks
    Note: This is very important as extraction of data which contains line breaks in each cell can be tricky!
  5. Structured and sanitized the data
  6. Sequentially extracted and formatted the data

Let me know if that suits your needs.

PS: I could have used loops but to illustrate the process I felt the less abstraction there is, the easier it is to understand, play with and learn from it.

Best
Mike

4 Likes

Thank you for this, I somehow get the idea in order to break it down it should have like a marker, I’m just a little confuse with what does “Deli starts” do. I mean I check the node but I did not see the function of the deli start?

I’ll try this process. Thank You!

Hi @Heldyyyyy,

please read the node description right below it. The $DELI START$ marker indicated the start of the data you want to extract.

Best
Mike

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.