Question specific to the blog "How to Evaluate CX with Stars and Reviews"

I’m trying to better understand analyzing reviews using KNIME and text processing. I found this blog under ML101, and have found it quite helpful:

How to Evaluate CX with Stars and Reviews | KNIME

However, there are certain decisions in the blog that aren’t self evident to me.

  1. I understand that using n-grams provide more information. I expanded all of the components to look into each one - and I converted the bi-gram into a tri-gram, which was extremely helpful in seeing how many reviews were associated with customer service.
  2. I also see that these n-grams are being used downstream to filter to the top 20 (arbitrary number per their notes), and that text in the reviews are replaced with the hyphenated bi-gram - but where is this being used? Does the topic modeler see these bi-grams as a single entity for topic discovery, vs. looking for singular words?

I think I know this answer, but just wanted to confirm that the point of extracting the bi or tri gram is that the topic modeler sees those words together to provide a more meaningful topic representation (ie, combining “terrible-customer-service” vs. “excellent-customer-service” would yield better topic identification vs. single terms of “terrible” “excellent” “customer” “service”).

  1. In one of the nodes, the filter is to remove the following in reg-ex:
    (staff friend|friend helpful|staff helpful|clean comfortable|park garage). I’m not understanding why these are being removed? Also, they are removing words related to Philadelphia, philly, etc. I assume because we are comparing only two hotels in same area, so those don’t provide useful information, but why remove the friendly or helpful staff?).

I have reviews that call out specific reps positively, it would make sense to leave them in becuase if i have a crew of customer service, i would want to see how many reviews refer to friendly, or even specific names? we could then replicate their style to others?

Anyway, there isn’t a place to comment on the blog, just wanting some clarification on why these terms would be filtered out.

thank you!

2 Likes

Hey @ebarr,

I took a look at the blogpost and the workflow it is associated with. Although I did not write the blogpost, hopefully some insight from a fresh pair of eyes can give some closure on some of the questions you had.

On the first point you had, I am also sure that the purpose of using the hyphenated bi-gram is so it can see it as a single piece to capture more meaningful information.

For the second question, I am assuming you are referring to the highlighted node below as it has the exclude parameters you mention:

I believe they are filtered out due to them possibly being common positive phrases and the author is looking for more specific terms that offer more insight. I think the purpose of leaving them out is to try and keep the topics as separated as possible because including bigrams such as “clean comfortable” or “friend helpful” may reduce the specificity of the topics that show up in the graph.

This can actually be tested by deleting the row filter and seeing how the graphs change. We can see it actually significantly changes the graph view and it looks like hotel_1 has very similar topics now as compared to when we exclude certain bigrams that can be applied over several different topics.

However, like you mention, it is also perfectly reasonable to leave the excluded phrases in if you want to consider all possibilities of positive/negative sentiment.

To answer your question in short again, I think the ‘Row Filter’ is trying to prevent dilution of the topics so we can see a more insightful analysis that has certain topics stand out, which would not normally be seen without the exclusion.

Hope this helps,
TL

3 Likes