Parsing PDFs and searching for specific keywords

chamallow · April 2, 2020, 4:43pm

Hello, I am trying to build a simple PDF parsing workflow that would input read multiple PDFs and score them based on the occurrence of certain keywords (which can be manually entered in a dictionary or table beforehand). The output would be a table with the name of each file, and its score.

It sounds conceptually simple, but I’m a bit lost as to which nodes to use as a beginner.
Thanks in advance

izaychik63 · April 2, 2020, 4:56pm

Look through this example

To read PDF use Tika Parser.

chamallow · April 2, 2020, 5:13pm

Thanks, this is useful, especially the Tika Parser. However, how do I actually score my files based on the occurrence of a given word list? The provided example workflow seems to identify most frequent words, which is not exactly my use case.
Thanks

izaychik63 · April 2, 2020, 5:19pm

Use Rule-base row filter or Joiner to filter specific words. Group by to count them.

chamallow · April 3, 2020, 6:31pm

Thanks, I got it working! The rule based row filter was the way to go.

system · October 3, 2020, 6:31am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.