I wanted to know if anyone can help me understand the configuration of the Chi-Squared Keyword Extractor node. I really do not understand what I am setting up when I indicate the parameters: pointwise mutual information threshold and Normalized L1 norm threshold.
I wanted to know if anyone could help me with this.
Thanks in advance.
The following are the steps required to configure the Chi-Square Keyword Extractor node:
1- Set the number of keywords to extract;
2- Percentage of unique terms in the document to use for the chi-square measures. By default, the top 30% of Document terms is selected;
3- Clustering frequent terms. Terms are clustered using two measures: distribution similarity and mutual information:
a) For distribution similarity, the Chi-Square Keyword Extractor node computes the L1 norm - normalized in [0,1]. This means that all terms whose normalized L1 norm score is greater than or equal to a set threshold are considered similar.
b) As mutual information the pointwise mutual information is used, to cluster terms that co-occur frequently. The terms whose pointwise mutual information is greater than or equal to the set threshold are considered similar, and thus clustered together.
4- Calculation of expected probability. Here the algorithm counts the number of terms co-occurring and computes the expected probability.
5- Computation of the χ2 value.
More details are available in the paper “Keyword extraction from a single document using word co-occurrence statistical information” by Y.Matsuo and M. Ishizuka
Hope that helps,
Sorry for the late response … you helped me a lot … Thanks.