Topic Extractor Node

Hey Victor,

thanks for your answer. I provide you two examplary lists, one with single words and the other with inspection data.
Inspection_A.xlsx (10.4 KB)
Possible Findings.xlsx (8.9 KB)

As you can see in the inspection data, there is a column including texts from component inspection. What I’d like to do know is to preprocess the texts (what I’m already able to do) and then match the “Findings” to one or more words from the table including the “Possible Findings”.

I already thouhght about using the Levenshtein distance or LDA but it feels like that transforming the “Findings” into a binary type and than performing a similarity search ( like you explained in the “Spanish notes Analysis” ) would be more helpful. But I´m completely open to new ideas and advices.

My biggest problem, as I already mentioned, is to create a working similarity search and to understand what the e.g. Joiner, Bit Vector Creator and Similarity Search nodes to specifically.

Thanks for your support so far, I really appreciate your help.

Br, Martin

Great, could you also attach the .knar file? (the workflow). That way I can inspect what you’ve done and see how you expected your output and assumptions, settings, etc. Thanks.

Hey Victor,

I‘ll upload my workflow as soon as I‘ve Internet access with my computer again. I’m currently on the highway so it might take couple hours. I will get in touch with you asap!

Component_Workflow.knwf (45.1 KB)

Good Morning Victor,

attached you find the workflow I created so far. It took me a while to export it yesterday, sorry for that.
Summarized again:

  • I still have not a good solution to assign the “Finding”-Texts to the appropriate “Possible Findings” yet (but there is an idea for a similar solution you explained in the “Spanish notes Analysis”)
  • I used the Stanford Lemmatizer to shorten the words but it doesn´t work if it is spelled worng (inspekted instead of inspected)

I´d beyond grateful if you could help me out with that problem.

Best regards, Martin

You can actually do this with just a few nodes. Notice this method will label all rows regardless if there is a true match or not.

Similarity Search 1.knar.knwf (32.6 KB)

@badger101, could you also share your workflow that you mentioned as well?

@Martin_23 , I can also show you a way to extract the words instead of doing a similarity search, but this requires a little bit of Regex.

2 Likes

First we lowercase to get the best possible matches as upper and lowercase letters are not the same for computers.

Then we use a little regex joining all the search terms together with | (which means “OR” in regex).

Search for the word apple or dog ==> apple|dog

Similarity Search 2.knar.knwf (52.0 KB)

This solution actually works better on this sample of data, but if you have a lot of misspellings, this won’t work that well unless you account for those misspellings.

If there are common misspellings then this just add them to your list of words to search, if not, you may have accept some loss of information in exchange for automation.

Notice this method is more convenient because you can filter non-matches using missing rows from the Split Value 1 column.

2 Likes

Hi, it seems like the workflow you shared here is the advanced version of what I had in mind. I’d choose yours over mine. Plus I didnt take into account the possibility of finding more than one word from the reference source.

Hey Victor,

big thanks for your two solutions. I tried the first one on my real data and it worked most of the time. Sometimes was not correct or didn´t match to the findings as expected.

The second one looks more suitable for what I want to to with the data. Did you build this node? I didn´t know that there are nodes wich include even more nodes. Could I for example build my own ones? As I can see there´s only a chance to sum up nodes into metanodes.

How did you come up with the idea to put all words in one regex? I see that it works but I would never had thought that this is necessary.

Thanks for your help so far and have a great start into the week.

Br, Martin

Hey Victor,

unfortunately I´m not able to use my preprocessed data as input for the “Regex Find all” node.

Available Input selection

All colums consist of strings, so no different data types. How is it possible to see my column names in the drop down list? Do I need to change the flow variables in the “Regex Find all” node?

Br, Martin

I don’t think I’ve seen this issue before. Could you send me a list of your column names and types so I can attempt to replicate this (use Extract Table Spec node) and then cut and paste the names giving you problems directly from the output:

And yes, you can create “components” or “metanodes” which will let you wrap up several nodes. Then you can share those components with the community. The Regex Find All is a community component which I use regularly. Whenever I run into an issue, I then open the component (ctrl+double click or command+double click) to look inside the component and see where it is failing.

Hey Victor,

I´m glad to say that I was able to fix the problem on my own. Everything went fine so far after I restarted KNIME this morning.

After manipulating the results (converting it into strings, etc.) I now want to match the rows to the appropriate findings. Right now I have a bunch of rule based row filters which output the rows with the corresponding finding. Is there a way to make it more “simple”?

It will look pretty “bulky” when I add about 7 or 8 more filters. Later on I´d like to show how many text belong to which finding (e.g. in a bar chart or pie chart). Do you have an advice for that?

Br, Martin

You can try to pack you rules in the dictionary.

Use Group by node to get statistics.

@izaychik63 suggestions would be first approach. @izaychik63 , thanks!

The other thing, @Martin_23 is to post your workflow (.knar file) so we can inspect your output and produce something similar with less ndoes. Thanks.

Hey Victor,

following you find the workflow and the two excel-lists I used for the workflow. It is again with example-data.

Findings_to_categories.knwf (85.8 KB)
Inspection_A.xlsx (10.6 KB)
Possible Findings.xlsx (8.6 KB)

Thanks for your support. Br, Martin

Hi @Martin_23 , if the end game is a pie chart or bar chart, then the Unpivoting node will be sufficient for this task:

Findings_to_categories_2.knar.knwf (96.0 KB)

I then used a row filter to remove missing values, but some charts won’t count missing values anyways, so this may be unnecessary.

1 Like

Hey Victor,

the unpivoting worked pretty well, yet it did´nt bring the output I expected. As you can see in the example workflow I provided it is possible that there are two findings possible for one text/component. What I wanted to do is to assign the texts to the corresponding finding.

  • If the text includes only one finding it counts for one category.
  • If the text includes two findings, like e.g. row 1 (cracks and nicks) or row 2 (cracks and dents) it creates a new category with these two findings connected. I expected that in this case both cracks and dents/nicks would be counted to the corresponding category. It seems that I need to split the string first so I get the two (or more) findings seperated right?
    When you use the value counter node instead of the bar- or pie chart you can see the additional categories. How can I avoid this to happen and let the workflow count “correctly”?

Br, Martin

Could you create a doodle of the output you expect (on paper or via some image)? I don’t think I understand what you mean.

Sorry for the misconception. I´ll try my best to make it clear:

The table “Possible Findings” which I provide contains Findings/Words that can be found in the column “Finding” in excel-sheet “Inspection_A”. This “Finding” column also includes rows/cells which don´t have one of those words from “Possible Findings”. These specific rows we already filtered with the "Regex Find All node.

After unpivoting (which you explained) I got a new column that shows which of the “Possible Findings” word was found in the “Finding” cells. If two words were found, we got a string consisting of two words.

What I want to do now is to have an interactive filter, where you can select one, two, three or many more words to show the findings in a bar-chart or pie-chart. The result should look like this:

Right now I use the Nominal Value Row Filter to filter the words that should be shown:

I also tried to use the Interactive Value Filter Widget, managed to have the drop down menu, but was not able to connect it with the Pie-Chart oder Histogram:

Is there a way to connect the Interactive Value Filter Widget with the visualization like a Histogram or Pie-Chart? I thought about a “fusion” of those two nodes as one node but so far it seems a little bit unrealistic to me.

I hope I could describe my problem more clear than before. I´d really appreciate your help with that.

Best regards

Great, please attach your knar file (knime file) where you used the Interactive Value Filter Widget and I think we can make the interactivity work by putting the nodes in a component. You can see a discussion of how to make use of widgets here.

Hey Victor,

find attached the example workflow where I use the interactive value filter widget prior to the pie-chart node. I also provide the new example data I had to do some editing on to make it more similar to the real data:

Inspection_A.xlsx (10.6 KB)
Possible Findings.xlsx (8.6 KB)
Interactive Pie-Chart example.knwf (105.6 KB)

Here are a few things I was wondering about:

  • Reading your article (Widget vs. Configuration Nodes) made me think about the need of widgets because I will use KNIME Analytics Platform exclusively. Right now I am not completely sure if I will still need the widget nodes or can switch to the configuration nodes.

  • When executing the interactive value filter widget all checkboxes or the list (for multiple values) are enabled by default. Using the real data there will be over 40 different checkboxes for different findings. If you only want to have the pie chart for e.g. 10 findings, you have to disable all the others by hand which takes a lot of time. Is there a way to disable them by default?

  • After putting the widget and the pie-chart in one component I expected these two elements could “communicate” with each other. But checking or unchecking the boxes does not do any change to the pie chart. What did I miss here?

  • Additional to the pie chart a histogram would be helpfull. As far as I understood I need the value counter node first to get a double-type value to the histogram. Is that also possible in the “new built component” or do I have to count the words previously?

Hope that´s not to confusing.

Thanks in advance and br