Table Normalization: Mapping (Index) Table for one key and many values

dobo · October 17, 2019, 6:50am

I would like to bring a table into 4th normal form, using a normalization node. This would yield great speedup for any type of workflow.

After loading my data into KNIME, a lot of that data is redundant (not normalized). Mostly there are always Key-Value combinations. But also cases where several values (hierarchies) can be mapped to a key (e.g. Company->Country->Region, each also being a Key-Value combination which results in 7 values for one key which can be associated 1:1, i.e. a mapping should take O(1)). So I should be able to extract the values and only pass the keys through the KNIME flow. This would save a lot of computation time and also memory. At the end of the flow I would then add the values back to the data structure.

The only solution I see so far would be using a Joiner node. However this seems to yield very bad performance (definitely not O(1)).

Anybody able to help me with this?

mlauber71 · October 17, 2019, 9:43am

Not exactly sure what your question is. If you want to do joins you could try and use a local database and KNIME’s generic database nodes. H2 offers the possibility of in memory DBs so given you have enough power on your machine that may speed up joins.

Otherwise you might want to upload an example of what kind of database question you have. In the end from my experience you will have to test which solution is best for you. An advantage of KNIME DB nodes would be you could reuse your code with a whole range of other Databases later if you need more serious power.

dobo · October 29, 2019, 6:20am

In my example, KNIME is working as ETL tool between different systems, which means I need to transport all meta data corresponding to the records. The ETL part contains really a lot of logic (especially many Joins and GroupBy’s), which results in too much memory required, than actually needed. To ensure the proper outcome of the resulting values, I would only need the keys to be pulled through the ETL part. Here is a small extract of the data set which displays many keys and their corresponding value right next to them (e.g. key=EUR, val=Euro).

Sometimes one key has many values because there is a hierarchy behind (e.g. key=BSL, val0=Basel, val1=Switzerland, val2=Europe).

I tried to strip the values from the keys, then only process the keys through the ETL (which resulted in significantly better performance) and then at the end map the values back to the keys using a Join in order to write the proper dataset into another system. But this join to map the values back to the keys was very inefficient due to the large dataset. Since this mapping of key to value would be a O(1) operation, this task should actually be possible in only a few seconds, even for the given large dataset.

So my question is whether a node exists that does such a remapping of value lookups to existing keys?

mlauber71 · October 29, 2019, 11:10am

I think you could try and use the

Although I do not have that much experience with that. But it seems to adress a problem you mentioned.

In general I think it is easier to match numbers than strings so you could try to create artificial keys by numeric rowIDs and try to match them.

The other (general) way to deal with such problems is to do it in steps and store intermediate results.

dobo · October 29, 2019, 12:55pm

Thank you for your suggesgtions sadly, they do not address my issue.

The table Indexer creates an index of a table, which allows fast querying for records based on the indexed columns. This does not map keys to values and therefore doesn’t allow to reduce the complexity of the table.
yes I agree, in terms of computational complexity it is simpler to match numbers, leaving out a hashing step for the index. However, the task of stripping out mapping tables still remains. Your suggestion would mean that during such an index table creation, not even what I refer to as keys (e.g. “BSL”) would remain, but only numbers. That would then indeed, as a successive step, yield even more speedup. I would then not need an ID on a per row basis, but per to be indexed column.
Exactly storing intermediate results would lead to a large volume of (unneccessary) data, as all the meta data columns are stored, even though they would only need to be added at the end of the workflow in order to produce the complete output and store it in a database.

mlauber71 · October 29, 2019, 1:18pm

If your data is of such a complexity that KNIME alone cannot handel it question is if you would need a Big Data system or a powerful SQL-DB. As a sidenote: Big Data systems are not necessarily better at combining very complex relational datasets but they would not mind the intermediate steps.

Luckily KNIME can provide a comfortable GUI for all kinds of database access.

The only other possibility I could think of is separate the IDs and then create a meta-id table in an in memory H2 database and later use that tabel to rejoin the other tables.

In the end joins are costly and I fear there is no eays way around besids indexing or using artificial numbers to make the join task easier.

izaychik63 · October 29, 2019, 1:31pm

You may try

dobo · October 29, 2019, 2:12pm

Well then I might just develop my own node that does exactly what I want
Basically what I am referring to is normalization of tables into 4th normal form.

I need to keep the architecture as simple as possible and running on a laptop. I know there is much more juice to be squeezed out of my workflow if I find a way to normalize the tables in the workflow, which would allow to only send efficient data types such as numbers through the number crunching of my workflow.

mlauber71 · October 29, 2019, 3:07pm

Of course you could always develop your own node

Question is which of the suggested approarches did you test and what are the results, especially using a SQL-DB in memory where you could perform several (or more coplex) joins at once, or maybe even using a local Big Data Hive DB, although it might not be as powerful as the real thing, hive tends to faithfully execute tasks on the hard drive. Not sure if a local big data environment would be sufficient but it might be worth a try.

Also @izaychik63 hint about the JoinedSort could help. You for sure have tried to toy around with the individual settings to write nodes to disk or store only small files in memory.

The first part shows hot to load data into (local) hive:
https://hub.knime.com/mlauber71/spaces/Public/latest/kn_example_h2o_sparkling_water

system · April 29, 2020, 3:10am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.