Genres in IMdB data

hans3400 · February 9, 2022, 3:31pm

Hi,

I’m new to Knime, so hope someone can help with a rather basic question.

I’m working with IMdB database. In the Genre column there are sometimes multiple different genres. By using the Cell splitter i now have separate columns for every genre that a movie might have. Now i have to find out which genre is more common. How do i do that?

Thank u

Thyme · February 9, 2022, 4:04pm

Can you share with us what you have achieved so far? Your workflow or example data would help a lot in answering your question. How do you define “most common genre”? This link about minimal reproducible workflows might help:
Reproducible (Minimal) Workflow Example - KNIME Resources / Knowledge sharing - KNIME Community Forum

I’ll do a speculative guess though:
Set the Cell Splitter to “List” or “Set” instead
Ungroup that List
GroupBy (genre as group column, aggregate any column with count)

Daniel_Weikert · February 9, 2022, 4:48pm

If you have the genres as a list in one column you could use the ungroup node and then use a groupby node to count the times the genres appears
br

hans3400 · February 9, 2022, 5:29pm

Thank u so much for your reply. I’ll try to explain a bit better this time.

This is my assignment:

Split the contents of the genre column, so that each genre name for each film goes in a separate column. [Cell splitter].
Bonus exercise: Which genre is more common?

And this is what i’ve done so far: KNIME - Album on Imgur

I’ve put my CSV file in a file reader, used the Cell splitter to break up the genre column (on imdb a movie can have multiple genres), and now I have to see which genre is more common.

As I wrote before, i’m just getting in to it so it’s rather basic.

Thyme · February 9, 2022, 5:46pm

Hmm, for finding the most common genre it’s actually more efficient to not split the entries into multiple columns, but rather create a set (or list, doesn’t matter since there shouldn’t be any duplicates).

This set can then be “ungrouped” into multiple rows.

Finally, we aggregate the columns containing the split results. If you want the most common genre, using “Mode” is exactly that, otherwise you could also use “Unique concatenate with count” to count all occuring genres.

hans3400 · February 9, 2022, 6:19pm

Thank u so much! I think it’s so cool that there is a community like this that are ready to help people like me.

This is where it took me. Drama is the most common genre.

Last question, how can I sort it so it starts with drama and then the next most common and son on? Now it’s just random I think.

Thyme · February 9, 2022, 6:56pm

It’s order of appearance. To sort the results, we’d have to do another Cell Splitter, Ungroup. Then some string manipulation to get the frequency so we can sort by that.

This get’s you the genre name (everything until the first round bracket):

substr($ungroupedCol$, 0, indexOfChars($ungroupedCol$, "("))

and this the frequency (everything from the first round bracket. the closing bracket is removed):

replace(substr($ungroupedCol$, indexOfChars($ungroupedCol$, "(")+1), ")", "")

system · February 16, 2022, 6:57pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.