Find out about text analytics use cases at this free webinar:
October 1, 2020 at 5 - 6 PM UTC+2 (Berlin)!
Whether you are dealing with social media data or long, complex documents, KNIME Analytics Platform can help you extract knowledge with a number of techniques. Over the years, Paolo Tamagnini (KNIME) and his evangelist colleagues have worked on a number of case studies.
In this talk, Paolo will present a subset of those case studies. After a short intro to KNIME Software, there will be examples of scraping social media activity, sentiment analysis, network analysis of user activity, named entity recognition, topic detection, and more.
Don’t forget to bring your questions to the webinar. Paolo (@paolotamag) will be answering questions in a Q&A session.
About the presenter:
Paolo Tamagnini, a data scientist at KNIME, holds a master’s degree in data science from Sapienza University of Rome and has research experience from NYU in data visualization techniques for machine learning interpretability.
Q&A Session from the Webinar
Click the arrow to expand and see the answer
What kinds of different tags are available? Is there a list?
There are several different types of tags available - parts of speech and sentiment, for example, but also specialized tags for chemistry and pharma applications. In addition, there are special tag sets for languages other than English.
Is XML considered a suitable input for Text Processing
We have a XML-Processing Extension (https://kni.me/e/Uj1RwEYxAESCvnlT) which can be used to process the input before.
Is it possible to download the presentation?
Slides for today’s presentation are here: tinyurl.com/FromWordstoWisdom
What is the name of the interactive node that was shown in the presentation?
It’s a component that was built for the COVID19 viz. You can download the workflow to play around with it here: https://kni.me/w/eKanRrgP51ARSrT5
This example in the presentation in the webinar focused on the Twitter set of modes. Is there something similar for Facebook, Instagram or Google?
Currently no, although you could probably build a component to access those services using REST API.
Are Pos/neg lists built-in in the app or should be user generated?
We use external lists (for example, the MPQA opinion corpus) for assigning pos/neg sentiment. They are not built in. (That said, the Stop Words list IS built into its node)
Is there a statement in for instance rule engine node that will look for a character string but search in all columns?
Perhaps the String Manipulation (Multi Column) node? Example here: https://kni.me/w/BX714o2vw7uXGObl
What is the best way to perform a complex task of identifying personal data in data sets if we want to guarantee full compliance with the GDPR? With a Dictionary Tagger for nouns?
For this use case you might try out the Redfield Privacy Nodes extension, built by one of our partners: https://kni.me/e/sP3v-Esc58q3tLFw
Sentiment = (Positive words - negative words)/Total words. Here total words also considers stop words or not?
If the stop words get filtered out before they are not taken into account.
Is KNIME Python-based?
No, KNIME is written in Java and uses the Eclipse UI. That said, we have nodes that allow you to execute Python code (or R, or Java) if you need to.
Performing LDA, I was having trouble with topics not being neatly delineated—there’s overlap in the word-clusters associated with different topics (too much correlation). Is there a way of reducing redundant or overlapping topics by regularizing the distance between each of the topic vectors (or otherwise)? ALSO, Can I assign all possible topics to a record (I.e. have records with multiple topics)?
Maybe you can try to reduce the number of words per topic to avoid overlap. You can have a look on our hub to find some examples on how to use the LDA node, for example https://kni.me/w/IvbjJjYIIRMwBvnx
Does the StanfordNLP NE Learner work only with english, or with other languages too (german, italian)?
You can use it with other languages, depending on what you provide for its input dictionary. We have an example on the Hub that identifies Roman (Latin) names in English text: https://kni.me/w/WcH-8Te16DbeBWdN
What is the semantic difference between a document and a string (as in the Clinton emails)
A document contains not only the text string of interest (in the case the email text) but also metadata like the author, date, source, tagged information, and so on
Is it possible to get the Twitter Cloud of Top 20 frequent words per hour?
Yes, this should be possible so long as you have an appropriate timestamp
Does KNIME text processing tools support Russian language?
We don’t have a native Russian text pack like we have for some other languages, but you can still use a simple or whitespace tokenizer to do some Russian text analysis. Here’s an example blog post that blends several languages, one of which is Russian: https://www.knime.com/blog/around_the_world_in_8_languages
Any Knime specific book recommendations for text analytics, which are case study centric, and use a relatively recent Knime version?
Check out our From Words to Wisdom boook, available here: https://www.knime.com/knimepress/from-words-to-wisdom
Can you tag a document sert using a multilevel taxonomy
You might check out this workflow that uses a specific ontology for pizzas: https://kni.me/w/XYhcwbBj7In9hb70
Is there a node to detect language?
There is an Amazon node that will do this - Amazon Comprehend (Dominant Language). But it requires an Amazon account and has a small charge.
In how many languages can you do Text Mining with KNIME
KNIME has language packs for Arabic, Chinese, French, German, Spanish, Turkish, and English. We also support whitespace tokenization for general text analytics.
Is there a node to extract data from PDF?
Yes- you can use the Tika Parser or the PDF Parser.