Webinar: Text Mining Techniques, June 25

Analyzing textual data comes with its own set of unique challenges. Whether you’re evaluating consumer sentiment about a product, clustering terms in medical literature, or extracting topics from social media posts - you need a tool that’s broadly applicable. KNIME Analytics Platform will help you tokenize, enrich, process, and visualize your text data, with an eye on building predictive models you can easily put into production. What’s more, with KNIME you can build a workflow using a graphical user interface, without writing any code.

At the text mining webinar on June 25, @ScottF and Dr. Dursun Delen from Oklahoma State University walked you through a demo of some common text analytics techniques using KNIME Analytics Platform. They touched on use cases like sentiment analysis and topic modeling, and present methods for both exploratory visualization and model creation.

Further resources:

  • Recording of the webinar is available on YouTube here
  • The slides from the presentation are here

You’ll find the questions and answers that were discussed during the Q&A session below. Click on the arrow to open up the question and see the answer.

What is the best approach to interpret the extracted topic?

Generally, you can look at the words (and weights) associated with each topic to see if there is an obvious commonality - for example, the terms [firm, market, value, portfolio] in a single topic probably indicate these documents have something to do with investments. You may not always be able to interpret the topics in an obvious way, so don’t be surprised if that’s the case. Still, even if topics are not immediately interpretable, it can be useful to know how documents are clustered by the LDA algorithm.

If the text is in Spanish how can I do it? I have to translate it first, can you do that directly in KNIME?

Translating beforehand isn’t necessary, and in fact, you probably wouldn’t want to do that anyway for fear of losing nuance in the original documents. What you can do instead is download one of the existing language packs available in KNIME. This will make available tokenizers (for example, in the Strings to Documents node) native to the language you’re interested in. Once the documents are tokenized, the rest of the analysis is largely the same regardless of what language you’re working with.

With regard to these hidden topics - are they necessarily words that already exist within the sentences being analyzed?

The topics themselves are abstract representations of how documents are clustered together based on the LDA algorithm. The weighted terms by which the topics are organized do of course appear in the document, but the topics themselves are just abstract groups. Any meaning that you may assign to them will be based on your own domain knowledge and intuition.

Is there a way to lookup existing corpora to trying and find what it need one approval arrives

If you’re looking to tag words according to sentiment, named entities, chemical structures, and so forth, you can use the Dictionary Tagger node in combination with lists from different academic sources. For example, the MPQA Opinion Corpus is commonly used for assignment of sentiment.

How large has to be the document collection to apply LDA?

The answer here is “it depends”. You want to use as much data as you can, but LDA can be a resource intensive algorithm, so you will have to find a balance between your available CPU cores and RAM, the number of documents you want to process, and how large those documents are. For small documents like tweets, you can process many more records than you could looking at, say large journal articles. So you will have to use your best judgment.

Is it easier to predict critic success as opposed to box office success, in your experience?

That is a good question. We all know that financial success and critics ratings do not usually move in the same direction. Our results are aimed at predicting the financial success of a movie before its production cycle, so that it can be used as a decision tool to select movie projects that are more likely to produce profit. We have not tried to predict the critic ratings, but I think we can (perhaps with a slightly different input variable list), if we have enough motivation behind it.

Could movie genre be predicted using LDA?

Theoretically, I would say yes, at least in a large position. It may not be perfect. Using only plot summaries may not be sufficient, and hence, the full script may need to be used to identify genre/themes with LDA. It would be an interesting exploratory study.

How stable is the movie model over time? (related to changes in tastes in movies.)

Better than expected. Some of the special events may make significant shifts in viewerships at times, but these seem to be rather random, and are not easy to include in the predictive model.

1. On the movie project, why not treat it as a regression task?

We have done that in our earlier attempts. The assessment with MSE, MAPE, MAD produces results that were deceivingly off target (due to a few large errors in the point estimates). After consulting with the domain experts (i.e., Hollywood decision makers), we converted the target variable to a nine success class. This structure provided a better intuition, prediction and trust in the predicted results.

1. Can the workflow demonstrated in the webinar be used to analyse results from a qualitative survey?

Yes, it can. Analyzing unstructured/textual input data obtained from surveys can be done in a semiautomated fashion with text mining. The results of which can be used for both predictive modeling (predicting sentiments or other prediction worthy labels/classes) as well as explanatory modeling like LDA, to extract/discover novel and actionable patterns and relationships.

What does “tokenization” refer to?

Breaking a document into its sub-components: sentences, terms, words, etc.

Is it possible to export workflows from KNIME as python scripts?

This is not possible. But you can call KNIME from Jupyter notebooks and the other way round https://www.knime.com/blog/knime-and-jupyter

What are advantages/disadvantages of these three methods - positive/negative words, machine learning, deep learning - for foreign languages?

The dictionary-based approach has the big advantage that you don’t need a labeled dataset. The machine learning approach often leads to better performance, but you can only apply it if you have a labeled dataset. And the deep learning approach can improve your model even more, but you need a lot of labeled data.

Why does negative tagger follow positive tagger and not have them in parallel?

If you use them in parallel, you have one document with the positive words tagged and one document with the negative words tagged. By using them in a sequence you get one document, where the positive words are tagged as positive and the negative words as negative.

How do you create the library (group the words)?

You are probably referring to how to create a bag of words from a set of documents. FYI https://kni.me/n/_z-NeUjJ73PmXkhP

Can I connect KNIME to an S3 bucket from the KNIME desktop?

Yes, have a look at this link on the KNIME Hub to the Amazon S3 Connection extension: https://kni.me/n/0KZX9OWGLEgXxGhA

Scientific papers, in pdf format, usually contain different columns. When I read pdf files, the words from different columns will be joined in a single row. How can I avoid that?

You could try whether the Tika Parser node is doing a better job and whether you can split them with the cell splitter node. If there is a delimiter in your string separating the different columns, you can use the Cell Splitter node.

Is there a node that can remove all punctuation? Although I applied several nodes for that in my workflow, I still end up with some words that have some punctuation characters

In theory the Punctuation Erasure should do this for you https://kni.me/n/MHwpGtMX1Fgfz31v . However, you could also remove the punctuation before you create the documents with the string manipulation or try out another tokenizer when creating the documents.

I am waiting for an approval by UNESCO to develop a "grooming language" detection model to detect such language or process. I lack the grooming language corpus in Español

You can tag terms in KNIME AP by using whitespace tokenizer or simple tokenizer. The Stop Word Filter node allows you to filter terms in Spanish (there is an embedded list for that).

What's the process to go about getting KNIME beginner certification?

Our L1 and L2 courses prepare new users for the beginner certification (L1 and L2). The next certification takes place online on June 30 https://www.knime.com/about/events/knime-certification-online-june-30-2020