I am new to KNIME and I'm looking to create what I think should be a very simple workflow.
I have ten .txt files, each containing an annual report of a business (so, a decade of annual reports). Very simply, I would like to know how many times a particular word or phrase occurs in these reports by year. I am not interested in any other words, though I'd be interested in some basic statistics about each report (number of words in total etc). Example words/phrases are: "water" and "climate change".
Secondly, how can I export the outcome of this analysis as a file that can be read in Exel?
The text processing nodes are what you want under knime labs/text processing.
first convert your txt files into a document format using String to Document node.
then you need to tag the words you desire using dictionary tagger node. To add the words you want, use a table creator node and enter in the words and connect this to this node.
Then use BoW Creator node to pull all words out of the documents.
Now use the General Tag filter to filter only for the tagged words you assigned earlier.
Now use the TF node to calculate the frequency per term in which you can specify your desired document.
to output this to excel, just connect up an column filter to just contain the desired columns you want, and then an XLS Writer node.
Thanks for getting back to me and for all the tips. However, I'm still confused about how to actually start the workflow.
Some basic questions:
- Where on my computer do I place the .txt files and how do I tell KNIME where they are?
- It seems to me that Strings to Document has to follow another node. If so, what type of node is this (File Reader? Parser?), and is this the first one?
- How should I configure each nodes? Specifically, do I have to do anything under 'Flow Variables', or simply leave those settings as they are (blank)?
You can place the txt files anywhere on the computer.
To Read the files in, just use the File Reader node, and select the location of the txt document in the node dialog. You can use multiple File Reader nodes to load in multiple txt files, and then join them altogether using the Concatenate node.
If you are feeling more adventurous, you can load them all in one go using the List Files node where you specify the directory containing the files. Then use a Tablerow to Variable Loop Start node, followed by a File Reader node which is connected via a variable red line (right click on a node and choose Show Flow Variable Ports). Inside the node, click on the little button next to Browse and select the variable containing the File Locations, probably called URL. Then finish after this node with a Loop End node. That will now load all the TXT files from one location.
For the Flow Variables tab, this is an advanced setting, and does not need to be altered in most cases, especially as a beginner to KNIME.
Great, thanks for this. However, each time I load individual text files into the File Reader the file reads as two columns with many rows down. Shouldn't I just have one (very long) row per document? If so, how would I best toggle the Basic Settings in File Reader?
You are right, in hindsight, this may not be the best input node.
Switch the file reader to the Flat File Document Parser. This should work more easily infact, all you need to do is specify the directory of where the TXT files are located. Make sure no other files are in the directory other than these TXT files. This will save you a node, as the documents will be loaded in Document format straight away, so you will no longer need the String to Document node, nor the Looping nodes described in my last comments.
how can I, after this, filter some terms? Because some of the terms appear differently. An example I have the term PM 6301, P 6301, 6301. The problem is that not all the people follow the same way when writing a text.
to filter terms / words in documents you can use some of the many filtering nodes available in the Preprocessing category of the Textprocessing extension. Handling different spellings of words is a different topic. Therefore you could do a dictionary based replacement with the Dict. Replacer node.