Problem with a Joiner node

Felis90 · November 9, 2024, 2:11pm

Hello, colleagues,

I have strange situation with my workflow. I executed import metanode (58.189 rows and 32 columns), then I performed the ABC-XYZ-RFM-Churn analysis (at the final homer node there are 21.214 rows and 12 columns). Then I tried to join two parts: import metanode and the last joiner of ABC-XYZ-RFM-Churn analysis, using the left outer join with the keys Customer Key and Order date (if I use only customer key, I have increased quantity of rows), and include all columns from import metanode (32 columns: unit profit, margin unit, order quantity, unit price, profit, age, birthdate, total product cost, sales amount, full name, gender, yearly income, education, occupation, date first purchase, customer city, customer country, order date, sales order number, subcategory, category, order date (год+квартал), order date (год+месяц), order date (год+неделя), customer key, product key, commute distance, product line, order date (год, количество), order date (год, число), order date (год, строка), product name) and from ABC-XYZ-RFM-Churn analysis last joiner only new generated columns (ABC, XYZ, ABC-XYZ, Recency, R, F, M, RFM, Churn), which are not inside the Import metanode. Then I put the flag in Append custom suffix if there are duplicate column name. Then put the flag in Create new in case of RowIDs and push OK. Then I have 58.189 rows and 41 column with missing values in ABC, XYZ, ABC-XYZ, Recency, R, F, M, RFM, Churn. I changed the matching concept and checked the quantity of rows and columns at all joining that I have before, nothing changed…I have no idea, HELP!

takbb · November 10, 2024, 1:01am

Hi @Felis90 that’s quite a lot to take in and not easy to visualise without seeing a sample of the data.

I’m actually not entirely sure what you are saying the problem is. What result are you expecting and how does the result differ from what you are expecting?

Have you tried reducing your result sets temporarily to, say, just a single customer and observing the results? That might make it easier to discover what is happening than looking at 58k rows.

Are you saying that this is a problem, or is it simply an observation? An increase in resultant row count, if you only join with a partial key, is to be expected since this will cause duplication of data in the output.

Perhaps you could upload some examples of data that is being joined, the result of the join, and indicate what is wrong with the output, and the output you are expecting.

Can you also upload a screenshot of your joiner node configuration.

mlauber71 · November 10, 2024, 2:42am

@Felis90 most likely you will have to check your combination of join columns and see where you might have duplicates and if you expect or want them.

Also you might note that if you choose a join other than inner join knime will create new RowIDs. You can try to preserve yours if a left or right join would not produce duplicates.

More on the subject here.

Otherwise you also might want to follow @takbb advice.

Felis90 · November 11, 2024, 8:24pm

Hello, I have found the reason of my problem…as I understand , the reason was the node row filter, that limit the amount of rows (cases), leaving data only for 2016 year, so when I tried to join import metanode with 58.189 rows and ABX-XYZ-FRM-Churn node with 21.214 rows, (using different join approaches), the part of new ABX-XYZ-FRM-Churn data obtained for all 58.189 rows was misunderstood by system and in that places were filled with missing value. Because when I put these two metanodes sequentially joined, step by step, (not using joiner node after abc-xyz-rfm-churn, to merge import with abc-xyz-rfm-churn analysis) the data occurred in the column. Thanks for all, tying help me)

Felis90 · November 12, 2024, 11:32pm

Hello, @mlauber71, I’m really sorry, I did not solve my problem… I’m lost((( I tried to play with rows as you suggested, but I don’t clearly understand what I need to write in Java snippet (simple) and row id to have the correct rows to join a big Import metanode and ABC-XYZ-RFM metanode. Can you help me?)))

mlauber71 · November 13, 2024, 7:40am

@Felis90 you will have to think about what you’re unique key is at every step and if you want that.

This is not so much a technical question but a question of what your data is and what you want to do with it.

The syntax in the Java snippet simple is

return $ROWID$;

but that will only give you the current RowID in a column.

Felis90 · November 13, 2024, 9:09am

@mlauber71 I put

return $ROWID$;

into the method body box and push apply, but nothing happens…

What is has been done by me and what I want to have - it is just the following:
I just imported data, performed abc-xyz-rfm analysis, joined them and tried to joined all these analyses data with import node again to update the data taking into account new analysis data…

takbb · November 13, 2024, 10:52am

Hi @Felis90, I found your original post a little difficult to understand mostly because of formatting, and because it is difficult to visualise your data , so I hope you don’t mind, but I gave it to chatGPT to reword and restructure it for me. Below is the output. Hopefully others will also benefit from this rewording, and be more able to assist, so would you please confirm that this captures the meaning correctly:

I’m facing a puzzling issue with my workflow. Here’s what I’ve done so far:

Imported data: I loaded a metanode that has 58,189 rows and 32 columns.

Performed analyses: I ran the ABC-XYZ-RFM-Churn analysis, which left me with a final output of 21,214 rows and 12 columns.

Now, I’m trying to join the original import metanode with the final results from the ABC-XYZ-RFM-Churn analysis. I’m using a left outer join based on two keys: Customer Key and Order Date. (If I join by Customer Key alone, I end up with more rows than expected.)

Columns to include: I’m keeping all 32 columns from the import metanode (things like unit profit, order quantity, unit price, sales amount, etc.) and adding only the newly generated columns from the ABC-XYZ-RFM-Churn analysis (such as ABC, XYZ, Recency, RFM, and Churn). To handle any duplicate column names, I selected the “Append custom suffix” option.

After the join: I end up with the full 58,189 rows as expected, but I also have 41 columns. However, I’m getting missing values in all the new columns from the ABC-XYZ-RFM-Churn analysis (ABC, XYZ, Recency, RFM, and Churn). I’ve tried adjusting the join settings and checked row and column counts along the way, but nothing changes.

I’m out of ideas—any suggestions?

As @mlauber71 rightly says, your question is not one we can answer definitively from a technical perspective since your problem is quite possibly down to the analysis of your data, and how it should be joined. The configuration of your joiner may also be problematic but we can’t tell from what you have written so far. Without being able to see your data, all we can do is point you at how things work technically which probably won’t find a solution to your problem very quickly.

This brings me back to my earlier suggestions:

Ideally put in some row filters to reduce the incoming data for the joiner to just one item that you know should be joined. Then see what the results are for the joiner. It is always easier to work with a small data sample when trying to work out what is going wrong, than to work with a massive number of rows which just masks the problems and confuses everything. Test it small and then scale it up, and test again.

I would suggest you upload that sample data to the forum (with anything sensitive anonymized or redacted).

If your data looks like it ought to join, but you are seeing missing values in the columns, as I believe you have suggested, then showing us how you have configured the joiner, or better still uploading a small workflow containing just that input data and the joiner is likely to be your best chance for finding a successful outcome.