Convert IDs to Derived Values in a Table

In one dataset I have several hundred rows, each with a unique identifier, and several hundred columns with the column header identifying a specific potential quality within each row. The field of each row/column intersection contains a number also identifying the column quality when present within a specific row, or the “?” symbol if that row does not contain that particular column quality (graphic below; dataset #1 attached).

In a second dataset I have solved for the likely dollar value of each quality identified within the first dataset, obviously only when present at the row/column intersection. (graphic below; dataset #2 attached).

temp2

My goal is to trade out the field identifier when present from dataset #1 with the dollar value from dataset #2. For example: dataset #1 row identifier 5118126 has a quality identifier match of 1023 within the “stone siding” column. I would like to trade this 1023 for the derived dollar value from dataset #2 of $14,035.

The difficult part is needing to do this using thousands of datasets, each with completely different data–it’s fully customized every time. So creating a standard dictionary and cell replacer just will not work.

As always, any advice is very appreciated.

DataSet_1.xlsx (138.1 KB)
DataSet_2.xlsx (5.8 KB)

PS: I have DataSet #1 in a binary format instead of unique identifiers at the row/column field intersection if that helps in any way.

Hi @creedsmith , I thought of asking if these numbers from your dataset #1 mattered, like 1023, or 1017, etc. Then I saw your PS note about the binary format, which indicates that they don’t matter, and that we’re only interested if there’s a value there or not.

This also answers your question about your PS note, we don’t need the binary format as we can check for missing vs non-missing :slight_smile:

I’ve come up with a way to do this - there are probably other ways to do this.

FYI: Not all columns from Dataset 1 matched to a Home Qualities from Dataset 2 (you have 228 columns, including ListingId, so 227 columns to be matched from Dataset 1, vs 111 rows only in Dataset 2), so those who do not match do not get replaced/traded off. Of course this can be modified without any problem if you want to replace them by 0.

Here’s a sample of the results after I ran my workflow:

As you an see, first column is Meadow View, which does not exist as Home Qualities, and therefore the value 5004 did not change. Mountain View exists, and the rows that have a value in that column were changed to 5613. Similarly for Updated Remodeled, changed for 9101, etc.

Workflow looks like this:

The logic that is used was:

  1. Read 1 column at a time
  2. Look for the column name as a Home Qualities record (Row Filter Node 4)
  3. Replace value of Dataset #1 with the filtered value from Dataset #2 if value from Dataset #1 is not empty and was matched with Dataset #2 (Column Expressions Node 7)

That’s basically it. I added a Run Garbage Collector because these types of operations can use up memory and do not necessarily release them until your system’s Garbage collector runs.

Here’s the workflow: Convert IDs to Derived Values in a Table.knwf (140.2 KB)

EDIT: One more thing I forgot to add, your Dataset #2 is not “clean”. Like the “Stone Siding” that you took as an example, I could not originally match it. It turns out there are extra spaces in the file. That’s why I have a String Manipulation (Node 9) to do some clean ups.

1 Like

Hi Bruno, thank you so much for the workflow :heart_eyes: . I will definitely see if i can get it fully implemented in my stuff.

BTW, I did not know a node called Garbage Collector even existed…

Yes, the second data set is smaller than the first data set. The first set is every potential quality a home can have; the second set are just the qualities meeting particular value requirements. So if there is not a match between set #1 and set #2, only the data is set #2 is important.

The field values in dataset #1 do relate to exact word matches for a dictionary in the system, but equal the text value of each column header.

Again, thank you

I am still experimenting, but this I have this partially integrated and working. One question: each time the Row Filter throws a “node created an empty table” warning–that is just for one field/cell that is either not a match to the second table and/or has a “?” instead of a value it can match?

again, thank you so much.

Hi @creedsmith , yes these warnings are happening when there is no match. The Row Filter returns nothing, and therefore returns an empty table.

You can also see where the warnings are happening:
WARN Row Filter 0:4 Node created an empty data table.

You see the node name “Row Filter” and also it says node 4 in current workflow 0:4

They’re just warnings, and they don’t affect the process.

Hi Bruno, one last question if you have the time: I am attaching a very reduced dataset #2 here. I am not familiar at all with programming the Column Expressions node, but is it possible to have only the matched results between dataset #1 and dataset #2 be the results–dropping all other columns.

In this example, dataset #1 has about 230 columns being analyzed, but dataset #2 has only 30 of those 230 qualities after the reduction analyses. Is there a means to set up what you did previously so only the 30 qualities of dataset #2 override the field values from dataset #1, creating just the 30 columns in the final outcome?

Sorry to bother you on this, but I just just haven’t been able to accomplish this. thanks

forgot this…
dataset2.xlsx (10.9 KB)

Hi @creedsmith , please refer to a user by using the @ sign and the user name, or the user will not get notified, just like I did not get notified that you had a message for me.

Regarding the Column Expressions, I can explain what it says:

if(!isMissing(column("Generic_Col")) && variable("Home Qualities") != "missing") {
    toInt(variable("Level Adjusted HAVES Minus Have NOTS Dollar Value"))
} else {
    column("Generic_Col")
}

It looks impressive because of the long names - they’re the names that your file had :smiley:

Basically whatver is in column() represents a column name, and whatever is in variable() represents a variable name

Let’s just use some generic names for the explanation:

if(!isMissing(column("A")) && variable("x") != "missing") {
    toInt(variable("y"))
} else {
    column("A")
}

It’s a bit better now, isn’t it?

To summarize the code, it’s doing:

if(conditions) {
  apply this part if conditions are true
} else {
  apply this part if conditions are NOT true
}

And I have 2 conditions there:
!isMissing(column("Generic_Col")) : This means the condition that the value from your dataset #1 is not missing.

variable("Home Qualities") != "missing" : This is the condition that the column header from the dataset #1 matched as a Home Qualities record

The “&&” is an AND operator, which means that both conditions have to be true for the if statement to be true. You can use other operator, such as “||” for OR, which would mean at least 1 of the conditions to be true is enough for the if statement to be true. In our case, we want both conditions to be true.

The toInt() function simply enforces that the type of the value is integer.

I have modified the workflow so that it only includes the matched results between dataset #1 and dataset #2:

I highlighted the changes. Basically, I just filter in only those that match, and then continue the same process as before.

Here’s the modified workflow: Convert IDs to Derived Values in a Table.knwf (154.5 KB)

Note, I did not use your new dataset2. You can just use it on your side in the new workflow. The workflow does not care if the data change. It will process whatever you give it the same way, so just change the data file and run it.

Hi @ bruno29a. sorry, i did not know the reply button did not directly notify you.

Thank you so much for all of your efforts and assistance. I see you are “self-taught”. Me also. But I think you have a better teacher than I do :flushed:

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.