Filter the results of Ngram creator node

Hi,

I have following workflow:

PDF Parser --> Preprocessing Nodes --> NGram creater --> Document Data Extractor --> Java Snippet (simple) [I use this to extract the File_name of the different Documents] --> Column filter

The result of the Column filter node is:

Row ID Ngram Document frequency File_Name
0 word1 word2 word3 1 EP0000001A1
1 word4 word5 word6 4 EP0000002A1
2 word4 word5 word6 2 EP0000002A2

I want to filter the table above with another Table (TABLE1) which contains different words.

TABLE1:

word1
word2

The result should be:

Row ID Ngram Decument frquency File_Name
1 word4 word5 word6 4 EP0000002A1
2 word4 word5 word6 2 EP0000002A2

So I tried to add the following nodes to the workflow above:

[...] Column filter --> Strings to Document --> Dictionary tagger + TABLE1(see above) --> General Tag Filter--> Reference Row Filter + Table (with space "")

The result of the Reference Row Filter is a filtered table with a missing column "File_Name":

Row ID Ngram Document frequency  
1 word4 word5 word6 4  
2 word4 word5 word6 2  

Is there any possibility to add such a column? Maybe another way to filter those rows?! The Document Data Extractor doesn't work, I guess thats because of the "Stings to Document" node.

 

Many thanks in advance!

Best
Simon

Hi Simon,

what about joining the file name from the first table to the last table by RowID?

I also attached an example workflow, showing two other filtering variations.

1) Create Set out of ngrams, ungroup set, reference row filter, ungrouped table (filter out rows thatare contained in dictionary), group again, compare set lengths (filtered and original), if length are not equal filter

2) Create Set out of ngrams, create set out of dictionary, compare sets with Subset Matcher, reference row filter

Cheers, Kilian

Hi Kilian,

many thanks for your answer!

what about joining the file name from the first table to the last table by RowID?

It works! It's that simple! ;-)

Many thanks!

Best
Simon

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.