Categorization of a big data sample

Hey,

I am currently writing my master thesis and using KNIME for the first time. My task is to aggregate and group round about 200,000 responses of a survey to a number of superior topics. Thereby the topics are not formulated beforehand but should be deduced from the repsonses. The responses contain either catchwords or at most one sentence. My current status is that I have all the responses in one column and reduced them by dropping out multiple indications.

Now my problem: I still have a lot of similar words or responses that mean nearly the same (e.g. culture, cultural offerings, theatres, etc.) which I would like to group. I tried out several things like the topic extractor which did not give me reasonable topics. Moreover I tried to work with dictionary nodes and added txt-files so that KNIME could notice several synonyms-but I also did not get reasonable results.

Could someone please give me some hints how I can manage to do it? My survey responses are in German so that I can not use Wordnet which I thought would be a nice thing. Is it possible to handle that problem within KNIME? What's about the term co-occurence, clustering and naive Bayes predictor nodes? I thought about using them but actually did not manage to achieve with them what I like to do.

For my research it is not necessary to know how often terms were used and I should not formulate topics beforehand. Is it possible to do this grouping without entering topics before? Do I have actively to integrate ditionary files and synonym lists into KNIME or has KNIME an own possibility to access dictionaries? Can I integrate an URL link to a dictionary website or do I have to download a word list and upload it to KNIME? How can I do it so that KNIME works with this word list and recognizes the synonyms and word groups in my responses? Is it necessary to build own tables with synonyms/ word groups so that KNIME learns it? In this case, KNIME would not be that helpful as my database is too large to make own synonym tables. Lastly I would like to ask how I can fix spelling errors within the responses so that they can be linked to the right categories/topics.

It would be a big help for me if you could answer to my questions!! Thank you in advance!

It would probably help to give an example of your actual data to get a better understanding of your concrete problem.

From what I've understood, you have short text snippets and want to group semantically similar snippets? How do you determine, if results are "resonable"? Do you have a set of ground truth data which tells you for selected items if they are actually similar or not? Or is it a completely unsupervised problem?

If you're looking for a net approach, there are some German pendants of the English WordNet, e.g. GermaNet or Wortschatz of the University of Leipzig.

Probably, word vector models/embeddings could also help (look for word2vec, GloVe, … there are some pre-trained German models available afaik). In a nutshelll they can tell you, that 'BMW' and 'Mercedes' are more related than 'Strawberry' and 'BMW'.

And last but not least: If you haven't tried yet, I would definitely start with a primitive token/n-gram-based syntactic similarity measure (Jaccard, Cosine, …) as a baseline.

Note: My answers are not KNIME-specific. If you want to use above's nets or word vectors you'll most likely have to rely on external scripting/coding.

-- Philipp

Hey Philipp,

thank you for your fast response! My data looks as follows: I have short text snippets or single words within one column. I've added an extract of my data in the Excel sheet. The words are in German but I added it to show how the text allocation looks like. In my version the column contains about 200,000 text snippets or words. Every text snippet/word was given independently by one survey respondent so that they are not connected.

Within that column there are a lot of similar or related words. E.g. "Bevölkerungswachstum"="demographic growth" and "Bevölkerungsentwicklung"="demograpic development". I need to group them under a superior topic like (in this case) it might be demography. My problem is completely unsupervised. My task is that superior topics are built from the given words with the help of text processing. I should not actively enter topics first and put matching words to that topics but the topics should be generated from the words itself. 

Do you think it is possible to solve that problem without programming/Java skills? I am afraid that I would need that skills to arrive at that aim and I am not able to use Java and hope that I can manage it with the help of the nodes.

How would I have to integrate Wortschatz for example so that KNIME would recognize typos and corrects them in the words? Is that possible at all? I had the impression that I need to build my own dictionary table with the typos from my input table in one column and the correct spelling in the other column so that KNIME changes the typos. In this case that would be an enormous effort. I thought it would be possible to include e.g. a text document with all German words in it and KNIME pulls the information out of that document and corrects the wrong words.

You said that I would need for the net approach and the word vectors another program. Which program would I have to use for that? Can I use the word vectors in KNIME and if so what do I have to do with them? And concerning your hint to start with syntactic similarity measures. How can I manage that in KNIME? Where can I find Jaccard and Cosine? Is it possible to compare each text snippet/word with each other? I tried out NGram in KNIME and there only the words in one row were compared to each other. But I would need to compare each word with all other ones.

Sorry for that big amount of questions but I am really unsure how to manage it.

Thank you!!!

Hi Amelie,

attached is a very simple example using syntactic similarities from your example data. It uses a Dice (in effect similar to mentioned Jaccard) n-gram overlap. This still has lots of potential for optimization and tuning (try different n-gram lengths, or modify the preprocessing ... I currently simply remove 'entwicklung' and 'wachstum' from the strings, as these parts seem very common in your dataset and obviously give no valuable information for matching the snippets) and then build a similarity matrix by matching the snippets pairwise. The last table shows you the most similar pairs.

Without knowing your entire dataset I would still claim, that this purely syntactic approach is a relatively strong baseline. You can for sure improve these results much further using standard KNIME functionality. It always depends on how seriously you want to those push things. (when doing e.g. Kaggle challenges I always find myself writing plenty of custom code after initial KNIME prototypes in order to achieve marginal improvements in the leader board ... but this is obviously not your focus here :) )

One idea for the "Wortschatz" approach: If you look up a word, e.g. "Bevölkerungswachstum", they give you a list with categories, called "Sachgebiet". For "Bevölkerungswachstum" e.g. they provide the terms "Bevölkerung, Sozialstruktur, Soziale Situation, Soziale Bewegung" -- probably you can use those terms to group your snippets? The "Wortschatz" site obviously provides a REST API which you could easily access using KNIME with the Palladian or REST nodes.

The word vector thing mentioned above would imo be especially interesting if KNIME would support using pre-existing models. But this is currently not yet the case. On the other hand, I tried measuring text similarities through word vectors in one of my projects recently and the results were not any better than a "stupid" but highly optimized n-gram matching ... this obviously always depends on the kind of text and domain.

What's still a bit unclear to me: How specialized vs. broad should the matching be? Would you only want to group very "close" syntactic snippets such as "Bevölkerungswachstum" and "Bevölkerungsentwicklung" together, or should this group also contain "Migration"? The latter would obvisouly require to jump onto the semantic train, but a clarification would be helpful to further understand your focus.

Still, I hope this gives you some new sparks!

-- Philipp

Thanks a lot Philipp! The syntactic similarities gave me a good impression how to do it and it matches to my task.
Your idea with Wortschatz also sounds quite good. Unfortunately, I have problems to integrate that web page in my workflow...At which point should it be integrated? Do I only have to enter the general URL in the GET Request node? And how could I arrive at grouping the words fom my input table to the "Sachgebiete"? Is it also possible to get rid of typos with the help of Wortschatz?

Yes, you are right: I do not only have to group "close" syntactic snippets but also have to look on it from a semantic perspective. The example you came up with fits perfectly. How could I deal with that problem?

After getting the syntactic/semantic similarities-how can I generate an output that groups me the words with the smallest distance? Something like the tag cloud but I think that the tag cloud works with frequencies which I do not want. Is there something that could represent the similarities in a structured way (Table/graphic)?

I finally added you a larger extract of my dataset so that you can better imagine.

Thank you!!! It's really friendly that you help me with that:)

Hey Amelie,

I just had a closer look at the Wortschatz REST API, unfortunately they do not expose the "Sachgebiete" information via their API, so that idea is basically ruled out.

In regards to the "semantic" requirements, you will obviously need to exploit some semantic net. Probably the above mentioned GermaNet net would make sense, but there is no support to access this data directly via KNIME. So you would basically need to obtain the dataset and then write your own code using their Java API (through a Java snippet node or a custom node implementation). Not sure whether this is within your focus.

Some additional idea on how you might address the problem:

1) If you look up "Bevölkerungswachstum" on the German Wikipedia, you'll be redirected to "Bevölkerungsentwicklung", so these terms are very obviously synonyms.

2) When you scroll down on the page for "Bevölkerungsentwicklung", you'll find that the page has two categories: "Demografie" and "Sozialer Wandel"

3) When you look up "Migration" you'll be presented with a so called "disambiguation page", as there are multiple meanings for the term. For the first match "Migration (Mensch)" you'll again get a page with categories "Bevölkerungsgeographie", "Anthropologie", and "Demografie"

From my feeling, these Wikipedia categories might be good candidates for labels which you're trying to assign to those snippets. Obviously above three words make up for a quite idealized example and I'm not sure whether the categories would be a good fit for your entire data.

To mine this information from the Wikipedia you would have to use their REST API (with the REST or Palladian HttpRetriever nodes in KNIME). Accessing their API can unfortunately be a little frustrating as the documentation is quite scattered. At the end, this will definitely become a medium complex workflow for getting and extracting the required information.

Concerning the spell checking: To my knowledge, there are no spell checking nodes for KNIME. However, there are some REST APIs available wich support spell checking and correcting German text, e.g. this one -- but: looking at your data, there some entries which seem quite "dirty" (e.g. "Berufsmünsteran  denken erfahrungsgemWohnmünsteran  Kosten Auto Arbeitsweg sparen Münster Namen Region"), probably caused by OCR? I'm not sure whether a spell checker will be of a great help in this case and whether it's worth the effort at all.

At the end, if you're going for a small amount of very broad categories (e.g. max. 10 high level labels such as "Demographie", "Verkehr", etc.), there might also be a totally different approach:

Address the labelling as classification problem; hand-label a selected subset of your data and see how well a text classifier performs. You can do text classification with the Text Processing nodes, or using the TextClassifier nodes from the Palladian extension (the latter is built by me and colleagues and is quite simplistic, but has shown to be a quite strong baseline and does not require too much parameter tuning). An example for the Palladian text classifier is available here.

Best,
Philipp