Compare and replace strings for one column

TS22 · September 19, 2022, 9:03pm

I am very new to KNIME and only familiar with a few basic Column, Row, and Math manipulation nodes. So far I know how to filter, split, and add basic math calculations. I am given this project with 2 files of movies names. File 1(Popular Movies) only has unique strings which are movie names and their codes. File 2(Movies On My List) has duplicates strings(duplicate rows) and inconsistent movie names. I would like to compare the strings within col2 in File 2 first to get rid of inconsistent movie names.
Examples of a few records. My files have a few thousand records.
File1: col1(movie code) col2(movie name) File2: col1(actor name) col2(movie name)
HR1 Don’t Breathe John Doe Don’t Breathe 2020
FA1 Encanto Jane Smith Don’t Breathe
AC2 Mission Impossible Tom Cruise Mission Impossible
CO1 American Pie Jen Gen Don’t Breath &
SCI Outbreak John Doe Don’t Breathe 2020
SCI Stranger Things Billie Millie American Pie
CO2 End of The World Dakota Fanning American Pied
FA2 Shrek Joe Smith Mission ImpoSSible
ABC James Alphabet
123 John Counting
I can filter out the duplicate records in File 2 to keep only the unique values but it still includes inconsistent strings like:
Don’t Breathe, Don’t Breathe 2020, and Don’t Breathe & should be Don’t Breathe
American Pie and American Pied should be American Pie
Mission Impossible and Mission ImpoSSible should be Mission Impossible

First, I would like to clean up these strings and count total actors in File 2 after duplicate records are removed.
Then merge Cross Join File 1 at movie name but append movie code to the matching movie names in File1, and finally count the unmatched movie name in File 2.
Any help will be appreciated as I am stuck at this point. I saw the option of The String Matcher Nodes but it is no where to be found in my Node Repository.

r_jain · September 20, 2022, 4:18am

A joiner node on the movie name condition will work in this. You can match on movie name and you will find all the movie code, movie name, actor name. You might get duplicate in this on the only condition that there are multiple actors in your second file lets say
John Krasinski Don’t Breathe
Emily Blunt Don’t Breathe

In this case the joiner node will create two entries

HR1 Don’t Breathe John Krasinski Don’t Breathe
HR1 Don’t Breathe Emily Blunt Don’t Breathe
which is totally fine.

You can use groupby Node on the file2 to identify unique actors, and in the eventuality that there are same actors who might have worked on multiple movies, and you want movie name as well, then you can groupby on actor, movie name

TS22 · September 20, 2022, 6:29pm

Thanks. It worked. I used the matching rows at movie name and that got rid of the multiple entries but then if I want to know the unique count of movie on file 2 how do I do that when each incorrect movie name is counted as one entry. I can do that in Excel before reading the file but that is not what I want.

r_jain · September 21, 2022, 4:20am

Groupby Node can help you with the unique count. So Don’t breathe will be counted as 1, Don’t breathe 2020 will be counted as 1 in that case.
Do not do this count after doing the inner join as the ones which are not getting matched will get removed from the result.
Take this as a separate chain.

TS22 · September 22, 2022, 12:26pm

But this will also count Don’t Breathe, Don’t Breathe 2020, and Don’t Breathe & as 3 unique values when it should be just one movie.

r_jain · September 22, 2022, 12:46pm

In that case you would need to write a regular expression to make them into same field.
But that will also just work for one type.
In the case that there are 2-3 different movie names which again have similar issue it won’t be possible.

If you only have one movie that is Don’t breathe… you can count the letters and similarily do the same length extraction in your file 2.

If its Don’t 2 breathe, it won’t work
If its Don’t breath also it won’t work.

Your goal is somewhat between fuzzy match and count of movie name by changing the erroneous name to its correct name.
This change needs to follow a pattern or you won’t be able to do this according to me.
Rest i think the other community members may have a better solution for this.

TS22 · September 22, 2022, 1:09pm

Thank you. I will be much easier to clean the movie names using Excel first since I have almost 5000 rows and atleast 30 unmatched movie names. I tried the String Replacer(dictionary). That didn’t work either or maybe I didn’t do it correctly.

r_jain · September 22, 2022, 1:30pm

You can maintain a dictionary but again that is a tedious thing.
You have to like make a fuzzy logic first on the first join to create that dictionary

so you get like 3 rows

Don’t Breathe Don’t breathe 2020
Don’t Breathe Don’t breathe &
Don’t Breathe Dont Breathe

Then you can match this back again with the original data set to get the accurate count.

TS22 · September 22, 2022, 2:23pm

I agree. Maybe this is not just what one of the tasks for KNIME. What so you think about String Matcher Nodes? I can’t find it but the forum mentioned it.

Daniel_Weikert · September 22, 2022, 3:24pm

You can try string similarity node for fuzzy matching your data
br

TS22 · September 22, 2022, 3:57pm

Is it called String Similarity Node because I do not see it. What do you do mean by fuzzy matching? Will it replace the incorrect ones?

Daniel_Weikert · September 22, 2022, 4:17pm

Yes the node is called that way and it can match values which are similar based on string distance (e.g. slightly misspelled words)
br

system · December 21, 2022, 4:18pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.