Read Word-document, including tables

Hi guys!
So, I’ve got a bunch of word documents that I need to process.
The basic idea of the documents are that they contain job descriptions that I need to fetch, transform, and push out as an import file for the target system.

The documents are formatted somewhat like this:
1.1 Job Title 1
Responsibilities
Bla bla bla bla…
[table]
Requirements
Bla bla bla…
1.2 Job Title 2
Responsibilities
Bla bla bla bla…
Requirements
Bla bla bla…

Some of the job descriptions have a table.

So far in Knime, I’ve been able to set up some rules to identify the paragraphs and separate the job positions. So I end up with a nice table, where I get all the paragraphs, and I have more columns defining the “1.1” number untill it sees the next new job title. All this is working fine, but what would be the best way for me to also preserve the tables?

The end result will be a text file to be imported into target system, and the target system is supporting HTML.

Thanks for any information that can point me in the right direction!

@Stianbl you can try and use and R package to extract tables from a MS Word document. Don’t mind the title, it is about .DOCX:

More examples can be found on the hub:

2 Likes

Thanks! Will have a look and test the nodes mentioned :slight_smile:

2 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.