Question about handling with syllables

Hi,

How can I display the single syllables generated by the Hyphenator Node as terms?

Which nodes are essential?

I tried the following procedures:

1) Hyphenator following nodes: Column Filter => BoW creator

Filtering the term column in order to load the column "Document" (which is generated as a "Bag of words" by inserting a blank into the field "separator" inside the Hyphenator node) into the Bow Creator. Unfortunately the Bow Creator doesn't separate the generated syllables to terms though the input is a Document Column. Maybe the BoW Creator recognizes the blank as character. When I  load documents into the BoW Creator by using the Parser Nodes starting the workflow the BoW creator works correctly.

2) Hyphenator following nodes: Term to String => Column Filter => Strings to Document => Column Filter => BoW creator


Converting the hyphenated terms into strings and, accordingly, converting these strings into documents to avoid errors alerted by executing the BoW Creator. (I didn't find any nodes for converting terms into documents directly)

In so doing I succeeded (by inserting a blank in the field "Separator" inside the Hyphenator node before) and the syllables are listed separately as terms. But the classified document category of each of the documents/terms seems to be deleted and all terms get the document class "undefined" (Adopting the "Orig. Document" doesn't work because only one document column is allowed using the BoW Creator)

Previously I thought that eventually the Keyword extractor would automatically recognize the generated syllables as single keywords. Now I don't think so because the keywords only appear hyphenated and the original term/keyword doesn't change: e.g. terms (i.e. syllables of terms) considered in a tree classification model will not be recognized if I want to classify new documents including the same syllables later.

I would like to ask you to tell me how to handle this situation, especially which nodes are important to continue processing the generated syllables to terms.

Many thanks!

Regards,

Werner

Hi Werner,

the Hyphenator node simply inserts the specified character ("-" by default) into the terms, at the position a hyphenation is correct. The syllables are not converted or represented as terms. The hole term is still one term.

To be honest i do not exactly understand what You want to do and why You would like to convert the syllables into terms.

In order to hyphenate the terms (of the original documents) and create a single term for each syllable the second approach You described should work (Bow->Hyphenator->Term to String->Column Filter->Strings to Docs->Column Filter->BoW). I see Your point with the categories. Unfortunately the Strings to Document node allows no column to be set, which values are used as categories. This would clearly be a "nice to have" option and is now on my feature request list :-).

What you can do to keep Your categories is to use the Document Data Extractor instead of the Term to String node:

Bow->Hyphenator->Document Data Extractor->Column Filter->Strings to Docs->Column Filter->BoW

The Document Data Extractor extracts the texts and the categories (if specified). Afterwards you filter based on the categories and handle each subset of documents (having the same category) separately. To create documents again You use the Strings to Documents node and specify the category in the dialog field manually. So, it is possible to do what i guess You want to do but a bit tricky.

Cheers,

Kilian

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.