Zemberek TurkishSentenceNormalizer

Hi everyone,

Despite KNIME supports some limited attributes of Zemberek NLP tool, I need to benefit much from that library. For that I have tried several ways, yet having struggle.

I would like to apply TurkishSentenceNormalizer to each of my table rows:

First I downloaded Zemberek source jar file

I added files under workflow I would like to employ it. Then I put Java Snippet node. Add file in additional libraries section. And written codes below:

// system imports
import org.knime.base.node.jsnippet.expression.AbstractJSnippet;
import org.knime.base.node.jsnippet.expression.Abort;
import org.knime.base.node.jsnippet.expression.Cell;
import org.knime.base.node.jsnippet.expression.ColumnException;
import org.knime.base.node.jsnippet.expression.TypeException;
import static org.knime.base.node.jsnippet.expression.Type.*;
import java.util.Date;
import java.util.Calendar;
import org.w3c.dom.Document;


// Your custom imports:
import java.io.IOException;
import java.io.PrintWriter;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.LinkedHashSet;
import java.util.List;
import java.util.stream.Collectors;
import zemberek.core.collections.Histogram;
import zemberek.core.collections.UIntSet;
import zemberek.core.turkish.Turkish;
import zemberek.morphology.TurkishMorphology;
import zemberek.morphology.analysis.SentenceAnalysis;
import zemberek.morphology.analysis.SentenceWordAnalysis;
import zemberek.morphology.analysis.SingleAnalysis;
import zemberek.morphology.analysis.WordAnalysis;
import zemberek.morphology.lexicon.DictionaryItem;
import zemberek.morphology.lexicon.RootLexicon;
import zemberek.morphology.lexicon.tr.TurkishDictionaryLoader;
import zemberek.morphology.morphotactics.Morpheme;
import zemberek.morphology.morphotactics.TurkishMorphotactics;
import zemberek.normalization.TextCleaner;
import zemberek.normalization.TurkishSpellChecker;
// system variables
public class JSnippet extends AbstractJSnippet {
  // Fields for input columns
  /** Input column: "column1" */
  public String c_column1;

  // Fields for output columns
  /** Output column: "Checked" */
  public String out_Checked;

// Your custom variables:

// expression start
    public void snippet() throws TypeException, ColumnException, Abort {
// Enter your code here:

TurkishMorphology morphology = TurkishMorphology.createWithDefaults();
TurkishSentenceNormalizer normalizer = new
    TurkishSentenceNormalizer(morphology, lookupRoot, lmFile);

//String[] words= s
//for (String word : words) {
//    System.out.println(word + " = " + spellChecker.suggestForWord(word));
//} 
out_Checked = normalizer.normalize(c_column1);


    Path lookupRoot = Paths.get("/home/aaa/zemberek-data/normalization");
    Path lmFile = Paths.get("/home/aaa/zemberek-data/lm/lm.2gram.slm");
    TurkishMorphology morphology = TurkishMorphology.createWithDefaults();
    TurkishSentenceNormalizer normalizer = new;
    TurkishSentenceNormalizer(morphology, lookupRoot, lmFile);


    // expression end
    }
    }

I am having trouble with instantiation step. How can I proceed?

Please get knwf file from cloud TurkishNLP

Thank you for your help!

Edo

Can anyone help? Or maybe KNIME may update library with Zemberek / Spellcheck node. It would be great if futher features for Turkish language in Text preprocessing.

Hey @asenkron,

I will have a look. However, it isnā€™t possible to use Documents in the Java snippet node. It will only use the String value of the Document, so tokenization will be lost, but you can still turn the Strings into Documents again afterwards using the Strings To Document node.

Cheers,
Julian

3 Likes

Hey,

there were some issues in your code in the Java Snippet.
I did some changes and it seems to work. I donā€™t speak Turkish though.
There are some comments in the code (see (1),(2) and (3)).

// Enter your code here:
// (1) it seems that knime relative paths don't work. so i'd recommend to use the absolute path
Path lookupRoot = Paths.get("PUT_ABSOLUTE_PATH_HERE/normalization"); //bilgisayarda kayıtlı olduğu yer belirtilir.
Path lmFile = Paths.get("PUT_ABSOLUTE_PATH_HERE/lm/lm.2gram.slm"); // Ć¼sttekiyle aynı

final TurkishMorphology morphology = TurkishMorphology.createWithDefaults(); //morphology nesnesi initiate edilir.
TurkishSentenceNormalizer normalizer = null; //Normalizer nesnesi initiate edilir.
// (2) you need to catch the possible IOException. 
// (3) i added a new column containing a boolean value showing if the normalization was successful. if not, the input string is copied to the output.
try {
	normalizer = new TurkishSentenceNormalizer(morphology, lookupRoot, lmFile);	
	out_Checked = normalizer.normalize(c_column1);
	out_Successful = true;
} catch (final Exception e) {
	out_Checked = c_column1;
	out_Successful = false;
}

Cheers,

Julian

4 Likes

Wow! It seems to work, amazing! Thank you for your help. Speaking java is more than enough apparently, despite of Turkish :slight_smile:

Cheers,

Erdinc

2 Likes