For example, if a resume has Java , dot net and a few other skills. How to extract them from each document in the folder with KNIME?
You can use the Tika Parser node to extract textual content of the files in a folder, which will be output as a string. You can then use the String Manipulation node or the new Column Expressions node to extract string of length x after certain key words (e.g., “Skill set”) using indexOf and substr functions.
I was able to read the documents successfully. I have done pre-processing too. It is the extraction of skill sets that I am having trouble with. From every document parsed, I want to extract top two or any 2 skillsets. Could someone please tell me how that can be done?
as I suggested earlier, you can use the unprocessed strings (before converting them to documents) and to extract words after a certain sequence of words.
Another option would be to do preprocessing and to use the Term Neighborhood Extractor node to extract N neighbour terms and then filter out only N succeeding the “Skills” term.
It won’t be that easy I guess. Provided you’ve successfully imported the files as Documents, you can indeed preprocess, in particular stop words and case - think carefully before applying any generic correction for punctuation, numbers and accents. Then, using TF-IDF you can assess which words are most informative - this will compare the terms across all parsed documents. However, the words won’t be necessarily skills.
If there are particular skills you’re looking for, create a table of them and tag them in the documents using the Dictionary Tagger. Now you can filter out all untagged words. Voilà …
Final note: some applicants fool the process by inserting desired skill sets in transparent or background font color, even though you could not detect them when reading the resumés manually.
Sir/Madam, how to filter out all untagged words…?
With the Dictionary Tagger, tick the box “Set named entities unmodifiable”, then you can apply a very narrow Tag Filter, RegEx Filter or N Chars Filter and be sure not to ignore the unmodifiable tag.