Filters not working properly

Hi,

I tried to do a text analysis of a few pdf reports.

I used the following nodes in the same order with appropriate settings:

Pdf parser

POS tagger

BOW creator

Filters (Number, Punctuation, Stop word etc.,)

IDF

TagCloud.

However, the outputs have some issues.

1. The path of the files is read as a 'term', as seen in the first image attached herewith. Why is it so? How to rectify this?

2. Though i use the number filter, the numbers are not getting filtered as we can see in the second image attached.

3. After getting a Tagcloud, is it possible to remove certain non important words /unintresting words like 'Chapter, VI, III ' etc., (second image)?

Would be glad if somebody could help me.

Thanks.

 

Hi,

1. If a title could not be extracted from the PDF the file path is used as title. The title is treated as any other text (terms) in the document and will appear as terms.

2. The number filter filters only terms that represent only numbers e.g. 1234 will be filtered whereas abc123 will not be filtered.

3. You need to filter the data set before creating a tag cloud e.g. using the Row Filter node. The tag cloud node allows no filtering.

Tipp: use the Bag of words creator after the filtering and preprocessing to apply these operation directly on the documents inseated of the bow. This will increase speed.

Cheers, Kilian

actually, to have the tag cloud filter words on a separate port based on a term weight would be a nice and intuitive extension, though completely redundant given row filter. Still... :-)

Thanks a ton for the inputs.

Narmadha