Hi everyone, I would really appreciate some help because I’m struggling terribly with the Group Loop Start node.
I’m building a model where I’m trying to understand which variables play a role in the result of the number of orders in a given month on my e-shop. I have prepared data where each row contains the month (a timeline in year-month-day format) and the resulting number of orders (the desired result of the equation). I’ve read that apparently the best way to figure out why a certain number of orders occurred in individual months is to run this dataset through the Group Loop Start node. However, when I do this, the result pours into one single row instead of the original 24 rows (I have 24 consecutive months). After that, the Linear Regression doesn’t want to work with it anymore. Do you have any ideas on how to prevent this from happening so that the result isn’t just a single row but a separate row for each month, so I can figure out the answer to the question, “what influences the sales the most in October/November/December?”.
It is a bit unclear what you want to achieve - can you provide some:
example data (anonymised)
state of your current workflow
Output you want to see in each iteration inside the loop
In general if you configure the Groop Loop Start you can select the columns that “make up unique combinations” - e.g. if you have a column “ABC” where you only have either “A”, “B” or “C” in each of 20 rows, you can select only that column in Group Loop Start (i.e. only keep it in the green box to the right) and the loop will then iterate three times - first filtering the data set for column “ABC” = “A”, then in the next iteration “ABC” = “B”…
So if, as you describe, your date column has 24 consecutive months in it and you select this column in the Group Loop Start node, it will by design iterate 24 times and only output one row at the time…
I’m trying to analyze the factors which impact historical sales and through that I want to decide what should I focus on to improve the results for the future.
Of course, no problem, all of the datas are my own so I don’t need to anonymize them. Anyway, maybe You should lead me in different way to show You what You want, my apologise if I bring You something different, 'cause I guess that U dont need to see CSV file (or do You?) :))
My first screen is my workflow - maybe I’m doing it like an idiot, I don’t know, but it worked till the “normalizer”, as You can see - there is always 24 rows.
My second and third screen is output and my settings in group loop start. Well, probably it is not fail, but I can’t figure out how can I reach something like: “in month september 24 there was X number of orders, because the weight of “marketing spend” is Y and as you can see in comparison to october 23, where You can see that weight of marketing spend is 2Y” - I know this sentence is not gonna be there, but You know what I mean And I’m not sure how I find out from this attachment output every single month and its reasons for the historic results, cause there is rows - date with number of orders as the very first CSV file - and I need to find out why the result looks how does it looks in particular month. And it is absolutely the same even if I include all the columns.
Did I give a better explanation or is it not understandable?
Did my procedure (csv → clean data → join → normalize → group loop start → linear regression learner) leads to result I want or am I absolutely wrong?
I don’t think you do it like an idiot :-). Idiots wouldn’t dare to ask valid questions!
From what I understand you are trying to apply some sort of regression in order to try and predict the future based on past values.
From what I can see on the screenshots the group loop works as expected - it identifies the unique values in your creation time column and iterates over them one-by-one. As such you only see one single row inside the loop, unless you have the exact time stamp more than once in your creation time column.
As to applying regression to your data: Typically to “fit a model” you’d want a lot of data - single rows probably won’t cut it so it might be better to to do this outside of the loop.
Have you explored your data for correlations already?
You could e.g. use this node:
To explore quickly if a linear regression model is a good idea or not.
And yes, you understand it pretty well. And again - yes, that was exactly why I knew I failed and not KNIME :)) The problem for my understanding what I see is this - if I understand You correctly, let’s imagine that all my 24 rows was “squeezed” to only one. But for me is this no longer readable - I have no clue what is this good for in order to solve my problem and to be honest I have no clue for what is this good for anything to have all datas in one row? :)) You know, there was an advice from PDF reader (I learned this reader all users guide from KNIME I found) that this could help me to solve my equation 'cause it told me that for the case when I want to study every one from all 24 months separate, group loop start and than linear regression is necessary.
Maybe this group loop start is just wrong idea, when I have only 24 rows with 17 columns. Don’t know. But with little different language, I need to know these correlations for each one row separate (because each row talking about different month/year) and have all others columns correlated to the column named “orders” to see which column influence this the most. At the very first I was expecting, that I would be able to ask KNIME for “in column ‘orders’ are 24 different results and each of them has different value of its variables (17) and through this 24 equations tell me common importance for each of this variables on final result”, but after that I started to hope that I would be able to find out importance of these variables for every single row to reveal differences even in months, because it’s probably that every month has different importnace of different factor, if You know what I mean.
Anyway, yes, I tried to use linear correlation, but to be honest I was not sure if this is right way how to solve my 24 equations, I was little bit confused and thought that this result was fail :)) Maybe it is because I don’t know how to read output (again, I put it in the PDF and than join to the reader and asked him how should I understand that). Does linear correlation even do the thing, that “i have 1 result, 17 variables and I need to know how they interact in order to become a result”
No problem, would love to do that if it could lead to lean me how to do what I need to do :)) do You want to share my CSV files or do You want to share something else?