Data Transformation of unstructured excel data

I’m relatively new with data transformations and KNIME and was wondering how you’d go about transforming an excel spreadsheet like the attached into a more usable format for power BI.

Each drawing comes under a heading like Mechanical schematics (lines with a blue or green fill). I’d like to obtain information on document status which can be filtered out by discipline, location, is it overdue, whilst keeping the raw information as well if more detail is required on the dashboard. I can’t change the excel trackers format as these are locked by the company and there will be more drawing records attached to this.

I’ve removed sensitive data like company name and location.

Tracker.xlsx (119.8 KB)

@lahiru_ten I came up with this workflow. I extract the color of the cells of the first column with the help of the Python integration and OpenPyxl and use that to identify the block forming the sub-tables.

Then I create a new column that would assign the “Drawing Number” to all lines - with the identified blocks.

A few things to note if you are new to KNIME:

The Path and the name of the sheet is being captured in Flow Variables to be used later. You can adapt this settings and also use this in Loops.

Column Expression node is used to set the Drawing Number when the right (family of) colours is being found. You could also work with a reference table here matching all Drawing Numbers.

The missing value node uses a special setting of bringing the previous entry to the missings so as to define the blocks of numbers

3 Likes

Hi @mlauber71

Much appreciated for your help. That is quite amazing.

I just realised you gave me your workflow.

Thanks

1 Like

@mlauber71

would it be possible to use a column expression node instead? so I use the location column and every time it says section, in a new column return the value in the first column and then use a missing value node to fill the values?

@lahiru_ten yes you could use a column expression to identify the start of a block of data. Which column would contain this text?

What I did was use the ‘Location’ column and every time it said ‘Section’ I’ve told knime to display the value in the first column.

On the tracker I sent, I removed that information as it may be considered sensitive information on the project. The same method could be used for the ‘latest’ ‘discipline’ or ‘service’ column.

I then used a set of string manipulation nodes to remove the word ‘Section’ as I found the missing value node seemed to copy the word section to other cells that should have dates in it.

I used a column renamer as the original tracker had merged cells and a number to string manipulator to change the content based on a table I created.

I did however end up getting these two errors, but since I got the desired results, is it correct for me to ignore these?


This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.