Question about a project that requires text processing


I am new to Knime, and I am working on an analytics project for a fundraising institution. The goal of this project is to find factors and correlations that can predict individuals most likely to receive an award.

I was considering to do logistic regression for this, but when I received the data (a large Excel file), I realized that the most relevant information was unestructured data in the form of long paragraphs of text contained in cells (i.e. the educational background and the summary of the professional experience).

Now I think that in order to get the information I need, it would be necessary to do text mining (I do not have experience on this). I would really appreciate if somebody could help me with finding the best way to approach this project. Do you think that assigning a score based on the recurrence of terms from a dictionary could help to get a numeric value that can be used later for a logistic regression? In that case, what could be the best way to analize the text contained in a specific number of cells from a Excel file?

Thank you very much for your help!




Hi Miguel,

do you have the educational background and summary of experience somehow in a structured form, e.g. a field containing the university, one containing the skills? Or is it all free unstructured text?

If it is all unstructured it sounds like a task like cv information extraction and normalization first, and second predictive modeling on the extracted, normalized data. However, cv information extraction and normalization is not an easy task, and to be honest i don't have experience in this field. It takes intelligent parsing techniques to get all the names and skills extracted and recognizes as those. As a first attempt i would try not to really extract the cv information into a structured form, but to build a bag of words for these summaries and use those for prediction.

The (pre)processing could be:
POS tagger, stop word filtering, case converter (to lower case), POS filter (keep only nouns and maybe verbs), n chars filter (filter out all words with less than 4 characters), number filter
Then see what terms are left in your bow and try to build a prediction model. Therefore, of course you need the class label whether individuals have received an award or not. In this case this would be a binary classification. Try different models like decision trees (ensembles of trees), naive bayes, svn to see if any model performs and if the individuals can somehow be predicted based on these bow features. If so, you can fine tune the preprocessing chain and the model building.

If you have a list of skills, schools, universities and so on you can use this list in combination with the dictionary tagger, to extract those terms from the free text fields and filter out the rest of the terms.

I hope this helps.
Cheers, Kilian