Dealing with legacy data structure

Clock1932 · April 7, 2020, 10:56pm

Hi guys,

Have been using KNIME for some months now and helped me out big time. My skills are limited to basic data manipulation and loops, but eager to expand on that every day! Came across a use case today that exceeds my current capabilities and need a push in the right direction. Much appriciated!

Data structure #1 - products in catalog

CatalogID	ProductID	ProductName
XBA01A	1	ANDRELON HAIR MASK DELUXE CURL&SHINE
XBA01A	2	RAVENSBURGER NIJNTJE 65 JAAR
XBA01B	3	ANDRELON HAIR MASK DELUXE REPAIR&SHINE
XBA01B	4	ANDRELON HAIRSPRAY TEMPTINGLY SHORT EXTR
XBA01C	6	AXE DEOSPRAY 150ML NEW DESIGN YOU

Data structure #2 - catalogs per store

StoreID	CatalogID
10003	XBA01A
10004	XBA01B
10005	XBA01C

Based on these two data structures I want to create a full list of all StoreID / ProductID combinations; basically a full assortiment list per store. Normally I would use a full outer join linking the CatalogID’s and that would be it.

Problem is the legacy data structure, where the last symbol in the CatalogID actually represents the catalog size (A is smallest, G is largest). And a catalog also includes all products in the smaller catalog sizes (i.e. each catalog size only includes the incremental products). For example: catalog XBA01C also includes products from catalog XBA01B and XBA01A (because A < B < C).

Desired output for store 1005 (single example store)

StoreID	CatalogID	ProductID	ProductName
10005	XBA01C	1	ANDRELON HAIR MASK DELUXE CURL&SHINE
10005	XBA01C	2	RAVENSBURGER NIJNTJE 65 JAAR
10005	XBA01C	3	ANDRELON HAIR MASK DELUXE REPAIR&SHINE
10005	XBA01C	4	ANDRELON HAIRSPRAY TEMPTINGLY SHORT EXTR
10005	XBA01C	6	AXE DEOSPRAY 150ML NEW DESIGN YOU

Therefore a full outer join doesn’t cut it, because it will not include the products from the smaller catalogs to the store assortment. My proposed solution in pseudocode:

Loop row-by-row through data structure #2 (catalogs per store)
For every row, isolate the CatalogID and look at the last symbol representing the CatalogSize
Loop from A to Catalogsize, and insert all products from data structure #1 (products in catalog) associated into a new table (StoreID, CatalogID, ProductID).

Help is needed to validate this approach and convert it into a KNIME model. Some guidance is appreciated! Thnx!

Clock1932 · April 8, 2020, 11:44pm

Made it work using the same approach as mentioned above in pseudocode. It takes about 30 min to turn through the 1.7M rows. It’s not the most elegant solution, but it works!

Model setup

mlauber71 · April 9, 2020, 6:01am

@Clock1932 welcome to the KNIME forum.

One idea could be to start with the catalogs and build complete ones. So the new A catalog would included all its sub catalogs.

From there it would just be one join.

Maybe you could share a sample workflow representing your problem so the community might have a look.

ipazin · April 9, 2020, 3:07pm

Hi there @Clock1932,

for these kind of use cases you should utilize scripting possibilities KNIME has (Java, Python, JavaScript…). For example I would use Column Expressions node to create additional collection (list) column holding all CatalogIDs less or equal than the one from CatalogID column.

StoreID	CatalogID	CatalogIDCollection
10003	XBA01A	[XBA01A]
10004	XBA01B	[XBA01A, XBA01B]
10005	XBA01C	[XBA01A, XBA01B, XBA01C]

Then use Ungroup node and simply join based on new column.

Only “tricky” part can be writing proper case statement with array and string functions in Column Expressions node. Drop me a message if you need example

Br,
Ivan

system · October 9, 2020, 3:07am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.