I have dozens of PDF files I'd like to read to count occurences of certain words. Theses files are scientific articles. I think the PDF Parser node should load my files and extract the text, but it does not. In the best case, it returns me the title of the article with the name of the journal, but I can't get the full text.
I know that PDF format is pretty restrictive, but when I read the description of the PDF Parser node, I think there should be a way to do it.
You can use the PDF-Reader node of the textprocessing feature to read all PDF files of a certain directory. Once the PDF files have been parsed, the documents are stored in KNIME as document cells. The output table of the PDF Reader node shows a row for each document. Note that only the titles of the documents are displayed in the table cells. If you want to take a look at the complete document use the Document Viewer node. After parsing the documents, they need to be transformed into a bag of words. Therefore use the BoW creator node. Once You have a bag of words, You can use the Term-co-ocurrence counter node, to compute the term co-ocurrence values.
Attached You find an example workflow reading pdf files, creating a bag of words, filtering the bow and finally computing term co-ocurrence values.
Sorry for the delay. Thank you for this explanation and for the workflow demonstration. There are tools I didn't know they existed. It's quite interesting.
I have been trying to text mine 2 pdf documents and would like to see the frequency of a list of words that I need ( i do not want the frequency of all words, just some key words that I have selected). Secondly I would like to view this in a visual format maybe a chart or tag cloud.
I have followed the examples given but the tag cloud produces all the words in the document.
How do I do this?
Secondly, I would like to do the same with an excel sheet containing text. Is there a way I could convert the rows in excel to be seen as documents so that I can perform text mining?
You should be able to filter the terms with standard KNIME tooling after replacing the square brackets - [] - with a String Manipulation operation. Seems like a Reference Row Filter node is in order for the actual filtering.
Re Excel, the "String to documents" node applied to the parse columns of interest will help you do that.
Another question is, the text mining is analysing each document as a single document. Is it possible to combine these pdf documents into one and then perform text mining on it?
To make my self clear: These documents are from one country and would like to see the frequency of certain words (I have a dictionary for this). When I run my work flow, it brings the frequency of each documents therefore am unable to find the total frequency of these words.
Or is there an option to add the frequencies from each document and have them with just one value other than multiple values?
It's quite helpful! Further using it, I would need to alter one more thing:
Generating tag cloud I find terms that I would like to remove from the cloud. These are usually specific terms. So I look for a node similar to Stop Word Filter but with customisable terms option.
So my question is: Is there a node that allows me to identify and remove a custom group of word of my choosing from the tag cloud?