Performing text mining through PDF files

Nico1990 · October 30, 2012, 11:03am

Hello Knime community!

I have dozens of PDF files I'd like to read to count occurences of certain words. Theses files are scientific articles. I think the PDF Parser node should load my files and extract the text, but it does not. In the best case, it returns me the title of the article with the name of the journal, but I can't get the full text.

I know that PDF format is pretty restrictive, but when I read the description of the PDF Parser node, I think there should be a way to do it.

Nico

kilian.thiel · October 31, 2012, 11:32am

Hello Nico,

You can use the PDF-Reader node of the textprocessing feature to read all PDF files of a certain directory. Once the PDF files have been parsed, the documents are stored in KNIME as document cells. The output table of the PDF Reader node shows a row for each document. Note that only the titles of the documents are displayed in the table cells. If you want to take a look at the complete document use the Document Viewer node. After parsing the documents, they need to be transformed into a bag of words. Therefore use the BoW creator node. Once You have a bag of words, You can use the Term-co-ocurrence counter node, to compute the term co-ocurrence values.

Attached You find an example workflow reading pdf files, creating a bag of words, filtering the bow and finally computing term co-ocurrence values.

I hope this will help You.

Cheers, Kilian

pdfreader-termcoocurrence-example.zip

Nico1990 · November 6, 2012, 11:51am

Hello Kilian,

Sorry for the delay. Thank you for this explanation and for the workflow demonstration. There are tools I didn't know they existed. It's quite interesting.

Regards,

Nico

Rinjez · October 17, 2013, 11:13am

Hi KNIME users,

I have been trying to text mine 2 pdf documents and would like to see the frequency of a list of words that I need ( i do not want the frequency of all words, just some key words that I have selected). Secondly I would like to view this in a visual format maybe a chart or tag cloud.

I have followed the examples given but the tag cloud produces all the words in the document.

How do I do this?

Secondly, I would like to do the same with an excel sheet containing text. Is there a way I could convert the rows in excel to be seen as documents so that I can perform text mining?

I will appreciate an exapmle workflow for this.

Thank you

Grace

Ergonomist · October 18, 2013, 12:46pm

Hi Grace,

You should be able to filter the terms with standard KNIME tooling after replacing the square brackets - [] - with a String Manipulation operation. Seems like a Reference Row Filter node is in order for the actual filtering.

Re Excel, the "String to documents" node applied to the parse columns of interest will help you do that.

Cheers,
E

Rinjez · November 24, 2013, 11:13am

Thanks E.

I managed to do it.

Another question is, the text mining is analysing each document as a single document. Is it possible to combine these pdf documents into one and then perform text mining on it?

To make my self clear: These documents are from one country and would like to see the frequency of certain words (I have a dictionary for this). When I run my work flow, it brings the frequency of each documents therefore am unable to find the total frequency of these words.

Or is there an option to add the frequencies from each document and have them with just one value other than multiple values?

I hope my question is clear. Please assist.

Thank you

Grace

Patrik · April 27, 2015, 10:24pm

Thank you for the sample text minig workflow that was uploaded above www.knime.com/files/reducedenergydata.zip.

It's quite helpful! Further using it, I would need to alter one more thing:

Generating tag cloud I find terms that I would like to remove from the cloud. These are usually specific terms. So I look for a node similar to Stop Word Filter but with customisable terms option.

So my question is: Is there a node that allows me to identify and remove a custom group of word of my choosing from the tag cloud?

Thnx in advance,

Patrik

ipazin · December 2, 2019, 3:27pm

A post was split to a new topic: Using Name/Description of the PDF File as Title