Question and Answer extraction from forums and discussion groups

Hi All,

I am relatively new to Knime platform. I have learnt, how to extract data from forums using the example given with Knime. My question is that whether Knime is suitable tool to extract Question and their relevent answers from forums and discussion groups. I can tag the extracted data as question and answers to train the system.

If knime can be used for the purpose, then i would like to know, what are the Knime nodes suitable to train the system and extract question and answers. I have gone through some research papers on the subject, but all of them are theoritical and have no reference to the practical implementation.

Any pointers in right direction will be very helpful..

Thanks & Regards


Hi Kichenin,

extracting questions and answers from free text is not an easy task. What you could do is to crawl a forum using the Palladian nodes and extract initial posts and answer posts using the XPath node of the XML extension. Of course you would not get exactly the questions and answers from these posts. However, you could refine the approach e.g. by scanning the initial posts for question marks to figure out if the post contains a question. Extracting the answers from the reply posts would be definitely more tricky. Since you already know the question you could search for terms that occur in the question in the reply posts to figure out which sentences would be essential.

Cheers, Kilian

Hi Kilian,

Thank you for the reply. I understand the task is not easy. I am doing this because the subject facinates me and I am willing spend time to learm more.  As you have suggested, I have used Knime forum extraxtion example as the base and could extract all the question and answer from a particular forum topic(about 2000 question ans answers). After cleaning the data, I have manually tagged extracted text as guestion and answer after reading each sentence and grouped question with answer and created a table.

How can I use this table to train the system, so that I don't have to manually tag the text each time we extract the data ?. Are there any QA corpus available to train the system ?  

Thanks & Regards