From words to wisdom material - chapter 4 example 2

I have some discussion notes about the workflow of From words to wisdom material - chapter 4 example 2 specially for the topic detection scenario later (decision tree/clustering).

1 - what is recommended, create a BOW and then partition it or partition the corpus and then create a BOW for each part of them?

2 - Why didn’t create the document vector model/representation over all the BOW and then partition it for another model later? is it just to clear the concept of partitioning, training set, and test set?

3 - Why didn’t create a document vector model/representation directly over the the test set and depend on a model in spite of the data-rule is known?

4 - is the document vector applier node required during the detection journey?

5 - why does use the relative frequency not the absolute frequency?

Dear Ahmed,

Please find my answers in line:

1 - what is recommended, create a BOW and then partition it or partition the corpus and then create a BOW for each part of them?
If you first create the BoW and then partition you might end up having some occurrences of words belonging to a document in the training set and some others to the test set. The recommendation is to first partition the corpus and then create the BoW for each part of them.

2- Why didn’t create the document vector model/representation over all the BOW and then partition it for another model later? is it just to clear the concept of partitioning, training set, and test set?
How would you partition the vector space? In general I would say that the workflow clears the concept of partitioning, training set, and test set. However, it depends on the use case that you want to solve. For instance, the workflow available here shows the Document Vector Adapter node used to adjust the feature space of a second set of documents to make it identical to the feature space of a first, reference set of documents.

3 - Why didn’t create a document vector model/representation directly over the the test set and depend on a model in spite of the data-rule is known?
4 - is the document vector applier node required during the detection journey?
Could you please clarify the questions and give more details, please?

5- why does use the relative frequency not the absolute frequency?
Term occurrences in long texts cannot have the same weight as in short texts. In this case, we need to normalize the term absolute frequency by the total number of words in the text. This refers to relative frequency. Relative frequency fits well text corpora with variable text length. This is the reason why we used the relative frequency in the example workflow.

Hope that helps.
Best,
Vincenzo

1 Like

3 - For the partitioned test set, why didn’t create a document vector for it directly? In other words, why did depend on the generated document vector model from the training set? the first scenario will guarantee that all words will be vectorized. the second scenario may miss some words that don’t exist in the training set.

4 - is the document vector applier node required during the topic detection workflow?

Hi @ahmed_gomaa,

3- that is because we want to evaluate how the document vector performs on unseen documents.

4- if you want to apply an algorithm to extract topics such as the Topic Extractor (Parallel LDA) it is not required to use the document vector nodes beforehand. In this case you just need the corpus of pre-processed documents.

Best,
Vincenzo

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.