How to find misspellings and anglicisms to avoid redundancy

Hello everyone,

I have a list of words, mainly german words, witrh anglicisms or different typ of writing. Here is an example:

  • music
  • musik
  • ■■■■
  • ■■■■■ 
  • pornographisch

I dont want to see all of those words in my tag cloud, only "musik" and "■■■■". So the result in the table should look like that:

music musik
musik musik
■■■■ ■■■■■
■■■■■ ■■■■■
pornographisch ■■■■■

Any ideas?

Thanks in advance :)

Hi, 

maybe the Replacer o the Dict Replacer nodes are helpful for you.

Greeting

Jasmin

Hi Jasmin,

thanks for your respond :)

I try to slove the problem by not using external data, because I want to analyze different articles with very different content. Therefore it would b e the more efficient way to let KNIME replace the words by itself.

Hi Ralph

you could solve this using a density based clustering.

Therefore use the String Distances (i used the Levenshtein distance, with weight 0 on insertion) and afterwards the DBSCAN node gave me the following results:

music   Cluster_0
musik   Cluster_0
porn   Cluster_1
porno   Cluster_1
pornographisch   Cluster_1

After identifiying the Clusters you would need to set them to one of the names. E.g. by taking one of the values in the cluster.

Cheers, Iris

Hi Iris,

that sounds like the perfect solution I was searching for. The only problem I have is that I have no fucking clue how to build the proper workflow because I never worked with the DBSCAN. Is it possible to attach your workflow?

Thank you so much  :)

Hi Ralph, sure the workflow is attached.

Hi Iris,

the workflow is almost perfect, but it is not working 100% accurate. I attatched the workflow with additional data so you can see the wrong results. 

Any idea how to fix it?

Thanks :)

No, I am sorry, I played with the parameters but did not find a good solution.

Did you check out the following Blog Post? https://www.knime.org/blog/address-deduplication

Maybe this provides a solution.

Iris

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.