Database - noENG big text field pre-processing

Hi all!

I have a database with a field I want to analyze, with the purpose to predict relevancy to another word (target).

The target (dependent var) would be an article title, and the features (independent vars) would be the terms in the body. That would mean I would need to separate sentences and words, but the language is Bulgarian, not English or Spanish, for which there are pre-proc. nodes in KNIME.

Everything I find online is related to English texts; or needs a Document input, which is not my case.

What would you suggest to do?

Hi deicide_bg,

Which kind of nodes would you use? I think that in KNIME Analytics Platform a number of nodes in the Text Processing category are language agnostic (e.g. the Snowball Stemmer node for example). Other nodes are English specific. Of course, if you want to use a specific Bulgarian dictionary this is not available in KNIME.


I am trying to find my way around the whole text mining topic, and eventually receive some (any) output.

The nodes don't have rich interface, neither the help windows say much about the algorithms, and I didn't want to get into the documentation of 20-30 nodes.

So far I managed to connect and execute these nodes after the database readers: 

Strings to Document,

Document Data Extractor,

Sentence Extractor,

Bag of Words. 

After that I think I am lost for now.

Hi Deicide,

to build a predictor for Bulgarian language should be doable. The Snowball stemmer node can stem Bulgarian. The Stop Word Filter does not provide a build in Bulgarian stop word list but I think you can find one on the web.

POS tagging is not necessarily needed for predictive modeling on textual data.To get an idea about how to build a predictive model for textual data see e.g.

I hope this helps.

Cheers, Kilian


I am trying to build a relevancy pool of words (semantic core). E.g. if there is a recipe for pancakes, I should recognize milk and powder in most (or all) recipes.

So the target may consist of one or more words, but the predictors must be as many as the words in a short article. And I want to be able to exclude conjuctions and prepositions. So no sentiment analysis would be involved in here.

I don't fully understand the use case yet. However I have a few ideas about what I think I understand so far.

To identify and extract words that are shared across a subset of documents you could do e.g. frequent item set mining. Use documents as transactions and terms as items. If you use the subset of documents that describe pancake recepies you can extract the most common words as core words.

Does that help?

Cheers, Kilian

I'll try what you suggested, and see how it goes, if I, on my turn, have understood it correctly. ;) 

I'm sorta trying to get this thing done: 

Target: a word or a short set of word (say title; product name)

Inputs / features: long-text field (all the words in the body text; product description and story-telling)


Have the title as "Black smartphone"

Have the body as "A black smartphone is a mobile device allowing you to make phonecalls, use low-resource computing functions and geo-positioning, and take pictures. Also, this device comes is in black color."

Then have the same or similar title, say "Dark smartphone"

and some other body as "A mobile device which is in dark colors and has functions, and brings you internet access where there is network coverage."


I want to be able to get the common words and phrases which are in the two body (inputs) texts, that lead to (some adj.) smartphone. And eventually enrich this input-content.

I still don't quite get the idea of the different storage types in KNIME, but I'll get there eventually. I hope this example makes it more clear, and I think we're discussing almost the same thing.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.