Please find my answers in line:
1 - what is recommended, create a BOW and then partition it or partition the corpus and then create a BOW for each part of them?
If you first create the BoW and then partition you might end up having some occurrences of words belonging to a document in the training set and some others to the test set. The recommendation is to first partition the corpus and then create the BoW for each part of them.
2- Why didn’t create the document vector model/representation over all the BOW and then partition it for another model later? is it just to clear the concept of partitioning, training set, and test set?
How would you partition the vector space? In general I would say that the workflow clears the concept of partitioning, training set, and test set. However, it depends on the use case that you want to solve. For instance, the workflow available here shows the Document Vector Adapter node used to adjust the feature space of a second set of documents to make it identical to the feature space of a first, reference set of documents.
3 - Why didn’t create a document vector model/representation directly over the the test set and depend on a model in spite of the data-rule is known?
4 - is the document vector applier node required during the detection journey?
Could you please clarify the questions and give more details, please?
5- why does use the relative frequency not the absolute frequency?
Term occurrences in long texts cannot have the same weight as in short texts. In this case, we need to normalize the term absolute frequency by the total number of words in the text. This refers to relative frequency. Relative frequency fits well text corpora with variable text length. This is the reason why we used the relative frequency in the example workflow.
Hope that helps.