All of this data is in a single table. There are three sets of column types: 1) those I wish to retain with every iteration, 2) Those that designate if a particular feature is present for a unique identifier (ListingId), and 3) a column matching another in type 2 but have a calculated value (SHAP). Directly below are examples of column types 2 and 3, which we wish to pair during each iteration. Below them are column type 1, which I wish to retain with every iteration.
To this point we pair up the column examples 2 and 3 along the the four columns we retain for every iteration.
However, I wish to split the data in the columns by row values in column type two, designated with only a 1 or 0 in each row.
We would split the data into two sets: our column types 2 and 3 along with the permanent type 1 columns we always retain, but also by row values of 1 or 0.
I have tried all kinds of loops (group loop, column list loop, and others) with various permutations of row filters and math formulas in between, and often end up with my workflows correctly parsing out one column, but never all columns.
Hi @ActionAndi , after keeping at this for another day, I think I can simplify my problem. As long as we keep the same RowIDs across all processes, I can join everything at the end.
In the current example, there are 60 columns (this will change with every analysis I do, as the data changes, but for today let’s stick with that).
The first four columns here are essentially static but I can join them back after processes are run on the remaining columns using RowID.
What I wish to do is “pair” multiple sets of two related columns and run basic math calculations.
For example, below are the first four columns. Looking at just the first column “Covered Patio Deck Porch SSAI Token” the cells have either a 1 or 0 (1 means this feature is present, 0 means it is not present in the house “ListingId” column).
Each of these columns has a “twin” with a calculated SHAP Value. The name of each twin column is the same, except with “SHAP” placed at the beginning of the column header “SHAP_Covered Patio Deck Porch SSAI Token”.
Obviously, the SHAP values for each ListingId are the feature contribution, in one case when the feature is present (1) or when not present (0).
What I am trying to do is create a series of new tables and column values, with one set as 1 x SHAP, then a separate set of columns of 0 x SHAP. In a simple scenario of only two columns in a table I might use the Rule Engine. But when I try various kinds of loops to do this with paired columns everything blows up on me. When the loop does perform as expected it breaks after the first paired column run because the row counts keep changing after every column series (every feature column has different amounts of 1s and 0s).
My goal is to have one table of only 1xSHAP and another of only 0xSHAP.
@ActionAndi Let me expand on that a little bit. When I attempt to run a Row Filter or Row Splitter to separate out the 1s from the 0s within a Column List Loop, things blow up, as the number of rows keeps changing and the system desires a set number of rows.
@ActionAndi When I do get my current mess to run, I can only get one column to separate out 1s from 0s (below). This is when I run a Numeric Row Splitter inside of a Column List Loop. But I never get this to work across all columns.
Once I have a series for all columns, then I can join them with the affiliated SHAP columns using Row ID and do the multiplication. I’m sure there’s a better solution though.
@ActionAndi Thanks so much for your workflow. I simply do not have flow variables skills to pull off what you did here.
One last question, this is great data for “global” analysis by aggregate feature, and thanks again, but I was hoping to also re-associate the final outcome “locally” for each individual property by either keeping the same row numbers as the original data table, or perhaps the property “ListingId”.
Ah one more thing:
It seems that the data set is a toy data set to learn ML methods. In one of your first threads about this you analyzed regressions with R² of less than 0.3… Which is quite low. Have you ever tried to draw random numbers and calculate R² of these? It can be in the same order of magnitude…
@ActionAndi , yes low R2. Let me explain. The data here is almost certain to make the least total impact on the value of homes. I call this group “qualitative” data. This will be combined with the physical characteristics of homes (square feet size, etc) which I refer to as “quantitative” data, which is likely to have the greatest impact on value. Then I’ll add in “location” data (two homes exactly the same but in substantially different locations have different values). Based on the R2 of these three groups’ combined, I may attempt to bring in a fourth group “market effects”, things like interest rates.
So, at this point I am really just testing for the effects on value of features home owners have some control over and seeking out the highest impact features. And yes, owners can add square feet and such, but that is a very big investment. In most cases they have very little functional control over quantitative, location, or market effects.