Keyword analysis from PDFs abstracts

mpieper · October 9, 2020, 5:58am

Hey everyone,

I just started working with KNIME and are a little bit overwhlemed by the many possibilities and nodes.
I want to analyse some given keywords from several hundreds PDFs, but only from the abstracts. So far I can only analyse frequencies or extract the most used keywords, but not define keywords to search for (which node would do something like that?).

Does anyone know an existing workflow for that? Or would you recommend to first transfer the abstracts in Excel?

Thanks and best wishes
Marianne

ScottF · October 9, 2020, 4:00pm

Hi @mpieper and welcome to the forum.

Are you planning to use the KNIME Textprocessing extension for this work?

It might be tricky to isolate the abstracts from the remainder of the text, depending on how the PDFs are formatted. Generally I like to use the Tika Parser to ingest text from PDFs.

When searching for keywords, you can use the Document Viewer node for exploring text, or some combination of filtering and/or tagging nodes to isolate keywords manually.

More details about your use case would be good… otherwise we’re just guessing about how best to help you.

badger101 · October 9, 2020, 11:48pm

Hi Marianne,

As Scott was saying, it’d be best if you provide more details, like what’s the goal you’re aiming for etc.

I specifically use the Parallel LDA Node to model topics from the whole PDF documents. If you’re corpus size is large, and you want to summarize the content of the corpus in terms of themes, then that’s the node you’re looking for.

system · April 10, 2021, 11:48am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.