How to visualize the impact of dictionary taggers at document classification

Hi

I am using the NE dictionary tagger with unmodified option for document classification. Not marked: case sensitive & exact match.
I manually classified 50 documents with class A and 50 with class B.

I created four tests:
1. Class A documents is not using a dictionary / Class B documents is not using a dictionary
2. Class A documents tagged with class A dictionary / Class B documents is not using a dictionary
3. Class A documents is not using a dictionary / Class B documents tagged with class B dictionary
4. Class A documents tagged with class A dictionary / Class B documents tagged with class B dictionary

The test proofs that the automatic classification accuracy gets better the more dictionaries I use. But when I look at the tagged terms in the Term Grouper I only see [] and [UNKNOWN].

I search for a way to make the impact of dictionaries visible, I would expect that [] and [UNKNOWN] do not impact the results. But they do looking at the results ;-) So is there a way to investigate what is happening ? I use the following algorithms: Decision Tree, SVM, KNN, Naive Bayes. I assume that they are responsible for the result differences, but still I don't see the value of the dictionaries with the tags mentioned.

thanks
Holger

Hi Holger,

that is an interesting use case / test you are doing. What the dict tagger nodes basically do is finding terms in documents that are contained in the dict and tag / mark these terms. Additionally these terms are flagged unmodifiable by default. This means that (by default) preprocessing nodes, such as filters and stemmers etc. do not changed or filter these tagged (unmodifiable) terms. This means that the feature space will be different with or without tagging. The feature space basically consists of all distinct terms of all documents in your data set. Filtering certain terms or not filtering certain terms will effect the feature space. Of course this effect on the feature space will somehow effect the classification.

There is not other effect of taggers on the classification. Just the effect on the feature space which is created. However, of course this effect can have a strong impact on classification.

I recommend to not use the Term Grouper node. The node groups terms with identical word but different tags and deletes the tags. So your tags assigned beforehand can not be seen anymore. Simply delete the node from your workflow and try it again. In the bag of words you should see the tagged terms and the assigned tags.

Attached is a workflow for document classification. Standard filtering is applied. The Term Grouper node is not used. In the bag of word you can see the terms and the assigned tags (POS tags in this case). Note that the POS tagger is the only tagger node that does not flag tagged terms as unmodifiable, since all terms are tagged.

To compare the impact of using dicts for tagging beforehand you need to compare the feature sets that are created.

Cheers, Kilian