Find tokens that discriminate between classes

I’m working on a textmining workflow in Knime. The corpus is a set of documents that consists of various classes. The content of the documents within a class is very diverse and the documents are about half to a whole A4 of text. The workflow I use is based on the document classification example. Unfortunately the classification is not working well. I tried principal component analysis but this doesn’t work either. So I’m looking for an approach to find a set of keywords that discriminate between the classes. What approach can I use to find this set of keywords and which nodes can be usefull?

Hi Taita,

unfortunately the Document Classification example is not the best example for classification of text documents. Have you tried the sentiment analysis example? The preprocessing is better than in the other example.

Cheers, Kilian

Hi Kilian,

I checked the sentiment analysis example with the link you send me but the preprocessing is exactly the same as the classification example:

  • Punctuation erasure
  • Number filter
  • N chars filter
  • Stop word filter
  • Case converter
  • Snowball Stemmer

The file reading and document creation is different from mine, but I use XML's instead of CSV's.

I see some differences in the feature extraction and vector creation part. Both the classification example and the sentiment analysis example in knime make use of the nodes 'extract table dimension' and 'java edit variable' that compute minimum document frequency. Ik can imagine these nodes offfer a number dependant filter but I already found out that with the few numbers I have now this doesn't work for me. So I bypassed this by a filter that has fixed low and high bounds that works the same as the one discribed in the example discription you offered. Why this between the description and the process in Knime? Can you clarify the java algorithm and how this works in the row filter?

The current wordverctors (I used different versions for upper and lower bound in the number filter) lead to bad classification. I think this is caused by the noise of the high number of words that have no or little meaning. Besides I think the current rough classes have to be devided in sub-classes. I'm thinking of using annotations or dictionaries to select the words that have meaning. For annotations I see some input in the biochemistry. But I need annotations for the IT sector, is this available?

Hi Taita,

the trick happens in the second meta node. Keep only those terms that occure in at least n% (e.g. n=2) of all documents. This is basically what is happening by the Java Edit Variable and the Row Filter node. The first calculates the min occurence number based on the number of documents. The second does the actual filtering.

To count the occurences of words in classes fo documents you can use a TF node after creeating a bow and than use the Pivoting node to group by terms and use document classes as pivots. As aggregation you can sum over the TF values. This could also help to see the class distribution among terms.

Cheers, Kilian

Hi Kilian,

I use 440 documents/rows and the row filter uses 4 as the under boundary. Apparently n=1%. Whats the reason for this, why not another n? Can I change the n?

You suggest to use the pivoting node but the document vector node (in the examples) is doing the same isn't it?

I repeat my question regarding annotations, are these availble for the IT industry?

Kind regards, Taita

 

Hi Taita,

of course you can change the value of n. This is an arbitrary threshold. Change it according to your needs. To change it see e.g. the Java Edit Variable or the Quick Form node before.

Yes, you can create document vectors with the Document Vectors node or with the Pivoting node. If you use the pivoting make sure to use the Missing Value node afterwards to replace missing value with 0.

I recommended to Pivoting node for another reason. You can group by terms and use the class column (values) as pivots. This will result in a table showing the distribution of all terms over the classes. You can take this as a starting point to figure out which terms are discriminative among classes.

There are no NER models provided regarding IT name spaces. You need to bring your own dictionary.

Cheers, Kilian

 

Thanks for you reply, this is really helpful.