Basic getting started question from newbie

Hi All.
I’m just trying out Knime for the first time and would like to ask a couple of questions to get me started. I do quite a bit of text manipulation but can’t really see how to go about it in Knime. Here is a basic thing I would like to do.

Lets say I have the following text in a file, It is just two questions from the FAQ section. Each has a question (identified with a ?) and an answer part.

“Can I modify, publish, transmit, transfer or sell, reproduce, create derivative works from, distribute, perform, display, or in any way exploit any of the content, in whole or in part?
You may not modify, publish, transmit, transfer or sell, reproduce, create derivative works from, distribute, perform, display, or in any way exploit any of the content, in whole or in part, except as otherwise expressly permitted in writing by the copyright owner.
KNIME is available under a dual licensing model. An open source license is available for non-profit use. For commercial usage of KNIME, please contact us. Please refer to the license for more information.
How much data can I process with KNIME?
Basically, there are no limits, since the data is buffered in an intelligent way. Nevertheless, some algorithms may require too much time and memory for very huge datasets.”

I can read this in using a text reader and I see that it creates a table with one column and 5 rows, one for each line in the file ended with a newlilne.

What I want to do now is manipulate this text so that I get a table with two columns and two rows. The first column should be called Question and the second column should be called Answer.

So my approach would be to identify a row which has a question in it using the ? and then concatenate all following rows until a new question is reached, moving the concatenated text to the “answer” column associated with the question.

So my questions are

  1. Is there an easy of doing this whilst reading the file in the first place.
  2. If not, how would you go about doing this.?
  3. Do I have to code this in java or python?.. which I can’t do.

Cheers
Pash

Hi Pash,

at http://labs.knime.org/textprocessing you can find a textprocessing plugin for KNIME.
It is still an in experimental stage but usable. See the examples and documentation
to get a clue what the plugin can do for you and how.
The plugin provides a new data cell, a document cell. This cell wraps a complete document
and the nodes provided by the plugin can handle this type of cells.
You can preprocess documents in certain ways, like part-of-spreech tagging,
named entity recognizing filtering, document vector creation and so on.
For you problem there is unfortunately no node available that can split text documents
into question and answer parts. But in about three weeks a new version of the textprocessing
plugin will be released, based on KNIME 2.1.0. The new version will provide a node that
extracts all sentences of a document and dumps them as a string cell, one row for each sentence.
A java snippet node can then be used to check the last character of that string, which is
the punctuation mark and than mark the rows related to that. A row filter can filter based on the
flag set by the java snippet node. This would be a way to separate questions and answers.
Afterwards the separated tables could be concatenated or joined again.
Hope this helps.

  • Kilian