Functionalities of KNIME text processing


#1

Hello,
I am starting a new project and I am looking for the solution which could be the most suitable for it. I started to build some small processes with KNIME and the text processing extension.
Now, I would like to know if the following functionnalities are possible with this solution :

  1. Associate and count more than two words : I used the node called “Co-occurence counter” which count how many times appears 2 associated word. Now I would like to know if there is a node which allows to count the association of more than two words like “Knime Text Processing”

  2. Is there a node to work with synonymous

  3. A dictionnary node ? To detect specified word and excludes specific ones.

  4. A node which recognize dates format ?

  5. Navigate in the document by clicking on the result word set

  6. Finally, is it possible to have an end user interface, I mean to build it in that way non IT people could use it by navigating in the document

Thanks a lot !

Sabine


#2

Hello Sabine,

This can be done with the NGram Creator node. It creates all possible n-grams for your given n and counts the occurrences on document and sentence level.

Currently, there is no node for this task. If you want to get synonyms for terms within your document for example from WordNet, you could try to get information with help of the Palladian nodes and query the words for which you want to have the synonyms. There has already been a question about this in our previous forum.
If you want to mine similar words within your documents, you could give this blog post a look.

To detect words coming from a dictionary, you could use the Dictionary Tagger node. It tags terms that have been found within your documents with a specific tag that you can set in the node dialog. Afterwards, you can create a bag of words to see which words are included in your documents.
To exclude words, you could use the Stop Word Filter node. It also uses a dictionary and removes all words contained in the dictionary from your documents. There are also some predefined stop word lists that you can use.

One solution would be tagging. For example, you could use the OpenNLP NE Tagger or StanfordNLP NE Tagger node. Both nodes provide models to recognize dates (choose ‘date’ model for OpenNLP NE Tagger node or any 7 class model for StanfordNLP NE Tagger node). These tagger try to identify dates, but it may not be satisfactory, because it hardly detect date formats like 13.05.2018. It is more useful for month names and constructs like ‘May, 2018’.
Another option would be the Wildcard Tagger node. You can provide a data table containing regular expressions which are specifying the date formats you are looking for.
A combination of both solutions could also be worth a try.

I’m not quite sure, but you could have a look at the Document Viewer node. Maybe that’s what you are looking for.

I hope this helps. If you have any quesions how to use the nodes I mentioned or if you encounter any other problems, feel free to ask. :slight_smile:

Best,

Julian


#3

Hello Julian,

Thank you for all those answer, I will go through one by one.
Hope you will be able to answer in this topic.

Talking about the dictionnary node

Blockquote
To detect words coming from a dictionary, you could use the Dictionary Tagger node. It tags terms that have been found within your documents with a specific tag that you can set in the node dialog. Afterwards, you can create a bag of words to see which words are included in your documents.
To exclude words, you could use the Stop Word Filter node. It also uses a dictionary and removes all words contained in the dictionary from your documents. There are also some predefined stop word lists that you can use.

Blockquote

I am having trouble to use it correctly …
I put a PDF parser then I use all my filters (Number filter, Punctuation erasure, Case converter, Snowball Stemmer)
Then I put my BoW creator, then a column filter in which I keep only my term column
Then the dictionnary tagger and I get this error while trying to configure the node "the dialog cannot be oppened for the following reasons : No column in spec compatible to “DocumentValue” "
Regards,
Sabine


#4

I am facing the same error message for the point 4 the data format.
I wanted to use the node “Wildcard Tagger” as you adviced but I hace the same error message "The dialog cannot be oppened for the following reason . No column in spec compatible to document value.
I put the BoW node just before.


#5

Hey Sabine,

all tagging nodes have to be applied on a document column.
So you use your filters and then tag the words you want to with the Dictionary or Wildcard tagger node and afterwards you can apply the Bag Of Words node. It creates a column with terms occuring in the document.

Regards,

Julian


#6

Hi Julian

I am facing the same error message “No column in spec compatible to document value”
and I did not understand what I should do to solve this problem.would you please help me ?


#7

Hey @honarjooyan,

I’m sorry about the late response. Can you provide more information about your problem? Which nodes are you using and how did you order them?

Regards,

Julian


#8

Hi julian
Thank you for your response. I found the solution. By using "String to Doc. " node, the problem solved.

                                                                                                            Thanks very much!