@daniel_yi it looks like you have already walked the path I am taking. I’m trying to do the nano degree projects with Knime. I’m stuck on the AB Test Project. How did you generate the Gross Margin trend and seasonality?
Data
Store Characteristics (StoreID, total sales, gross margins, sqft, control/treatment store, etc.)
Transaction data with StoreID
The Excersise:
10 out of 120 stores are undergoing an update to services to see if there are increases in Gross Margin are greater than the paired control stores (that are similar and have similar Gross Margin Trend and Seasonality)
Filter the data to the proper date range
- typical data engineering stuff, split range before test period and range during test period
Aggregate the data to get the weekly gross margin and count of unique invoices
- Is this done as 2 parallel paths utilizing the Aggregate Granularity Component?
- or use the GroupBy after extracting Year and Week from the transaction date
Calculate Trend and Seasonality
- tried to use the Decompose Signal Component for Time Series but could get the output I wanted
- stuck on this
Label the data as treatment and control stores
- this was straight forward using a cross join and the table of Treatment(test) stores
Calculate correlation between other numeric measures and gross margin
- used the linear Correlation node to select columns (for distance measure?) to include for Control/Treatment pairing
Match Treatment to Control stores per region
- I suspect I need to use s distance measure here to help me find which store is most similar to each of the Treatment stores to be it’s control
Calculate lift from control to treatment store
- here I think just need to sum up the weekly data by store during the test period to compare the treatments and controls store gross margin… do I need to do some sort of normalizing of gross margine numbers here?
The other Post you were helping me on ( Matching Test and Control Units) is related to this larger project I have documented here.
I’m not sure how to get the overall trend and seasonality of the entire dataset (by StoreID). As the Decompose signal generate the trend and seasonality between data points. Do I just take the mean?
I know this isn’t your job and I am very grateful for all of your help.
I’m not sure if I understand your question correctly: If you have the trend and seasonality for, let’s say, March and April, how do you get the trend and seasonality for May?
The trend is a regression model with the row index as the predictor column. To get the trand value at any point in your data, you need to know the row index of this point, and apply the model using the row index as the predictor column. For example, if you have daily data for March and April in the training set, then the row indices in the training set are 0 to 60 (for 61 days), and the row index of the first row in the test set (the first day in May) is 61. If you want to know the trend value at the end of May, you need to apply the trend model to the row index value 91.
Seasonality is a repeating fluctuation in the data. To detect it for all data in your training set, your training set must contain at least one whole seasonal cycle. For example, if you have yearly seasonality, the training set must contain at least one year.
I hope this answers your question and if not, please let me know!
I’ve outlined the steps I’m trying to follow for my analysis at the top of this post. I’m stuck on Step 3. Using your example… with the Decompose Node I am able to get trend and seasonality for March and April and every month separately… is there a way to get the trend as one number for all 12 months (or what ever period I choose). And then the same for seasonality.
Basically, I am trying to generate 2 more columns that describe the trend and the seasonality for each store. That way if I have these for each store I can use overall trend and seasonality and the other attributes of the store to compare distances of all attributes for each store to pair them.
Ah, I see - You could find the control store by comparing the trend and seasonality columns of two stores for the same time period with the Numeric Scorer node. Then select the control store candidate with the greatest R-squared values. What do you think?
This sounds like a good approach if I understand you correctly… let’s assume each store has N number of rows of trends and seasonality data. I should pivot these into columns for a total of 2N. Then calculate numeric distances and then use the Similarity search (for controls units)? Will adding 2N columns of data skew the similarity too much to trend and seasonality? as there are other characteristics being used like store size, sales, etc.
I wouldn’t use the trend and seasonlaity columns in the similarity search. I would create a new numeric column that indicates the similarity of the trend (or seasonality) to the trend (or seasonality) of the test store. This column can be used in the similarity search together with other numeric columns, such as the “sq_feet” column. I created an example workflow for that, using only 5 stores from the data you shared and only the trend column: