Strange chars in the bag of words

Dear Community,

Hello from France !

I extract document from an url + case converter + stop words + punctuation filter + BOW.

Now I see a stange row in the bag of words : []

The problem is this is the term with most occurrences + TF metrics so all datas are wrong.

How can I replace or delete this [] and what is it ?

(Sorry for my english...)

David

1 Like

Hi David,

the"[]" at the end of terms indicates that no tags have been assigned. If a term, say "house" would have the POS tag NN (for noun singular) assigned the term call would look like "house [NN (POS)]". If the term has not tags assigned the brackets are empty and the term cell looks like "house []". These brackets are part of the term information but not part of the term text. To see the actual word in that term you can use the Term to String node that extracts the word(s) as strings and removed the assigned tags.

I hope that helps.

Cheers, Kilian

Dear Kilian

Thanks for your help.

The problem is I see this  [] alone in some rows with no associed terms.

This single -alone  [] have a tif value so I think all datas are wrong.

In fact it will be good if this alone  [] (with no associed words) will not be present in the bag of words.

In my workflow, I use : 

Document data extractor > case converter > replacer (for french punctuation stripping) > table creator + dictionary tager + tag filter (for stop words exclusion in french) > Bag of words > TF

In the TF output table I see : 

Row ID Term Document TF rel
1 révolution[] Document x 0.044
2 [] Document x 0.044
3 française[] Document x 0.044
		<p>This [] is a problem because the TF of the document not really correct.</p>

		<p>Thanks for your help,</p>

		<p>&nbsp;</p>

		<p>David</p>
		</td>
	</tr>
</tbody>

 

 

you could filter the terms using chars filter...

Hi,

One can use a simple rule based row filter node too.

Try and let us know.

Thanks and regards,

Narmadha.

Hi David,

as Geo mentioned you can use the N Chars Filter node to filter out words existing of only a few (n) chars. Use the filter before countnig the TF values.

Cheers, Kilian