Pre-processing?

Hello,

This may be a layman query, I am new to data analytics itself.

In the following snapshot, post creating the Bag of Words and Filtering the terms and applying the Reference Row Filter, while computing the TFs, how does KNIME compute the same? Does the underlying document’s term structure also get filtered and TF is now computed on the altered document? I am unable to see any such setting in the Configuration settings which ask you whether to append/ replace pre processed document while going through BOW to Reference Row Filter (as available in the pre processing nodes).
image

Thank you and sorry for the trouble!

Regards.

Hi @LakshmiK and welcome to the KNIME forum,

If yo want to filter terms inside documents, you can use the Dictionary Filter node. The Reference Row Filter node has no effect on documents.

:blush:

2 Likes

Thank you so very much for the reply. My query actually is, to compute TF, it is only the concerned document detail that needs to be configured in the node (if I am not wrong). If the underlying document is not getting modified, shouldn’t the TF node return the same results when executed both after the ‘Bag of Words Creator’ action and the ‘Reference Row Filter’ action in this part of the workflow (irrespective of whether I have done some filtering in the Terms column)?

I am not sure whether my query is valid or not. Please help. Thank you.

image
image

Regards,
Lakshmi

Yes, it should. Are you sure that both TF nodes are using the same document? It seems you have 2 columns as document.

:blush:

1 Like

Hello,

Thank you again for replying and being very patient!
They are using the same document itself. This is the workflow I was referring to (https://hub.knime.com/knime/spaces/Examples/latest/08_Other_Analytics_Types/01_Text_Processing/22_Hierarchical_Clustering_Visualization) - in Preprocessing II component (I included a TF node after BOW to check).

I was also building a similar workflow and found similar results. Why would it or am I looking at it wrongly?

Thank you again,
Lakshmi

The original TF node is using the “Preprocessed Document” column but the one you have added (Node 111 in the screenshots) is using the “Document” column.

:blush:

Thank you again. I had tried to check with both documents (which is when the screenshot was taken), but its still showing disparity between the 2. (Another strange issue is that both “Document” and “Preprocessed Document” give the same results). Could you try checking please? Thank you for all the time you are giving this issue.

Regards,
Lakshmi

I just checked the workflow. Using the same document column after Reference Row Filter or right after the Bag Of Words returns exactly the same results for the same terms of the same document.

The only difference, obviously, is the filtered terms. So the second TF output has less number rows since some were filtered by the Reference Row Filter node.

:blush:

Thank you.

So filtering the Terms column has an effect on the TF node, even though it is taking as input an underlying document only? Ok…

Sorry about all the convoluted queries. Really appreciate your time on this.
Thank you!

1 Like

No problem at all. Feel free to ask questions.

:blush: