How to parse PDF and word resumes to extract two skill sets from each

#1

For example, if a resume has Java , dot net and a few other skills. How to extract them from each document in the folder with KNIME?

1 Like

#2

Hi @Knimeforum123,

You can use the Tika Parser node to extract textual content of the files in a folder, which will be output as a string. You can then use the String Manipulation node or the new Column Expressions node to extract string of length x after certain key words (e.g., “Skill set”) using indexOf and substr functions.

Best,
Anna

0 Likes

#4

I was able to read the documents successfully. I have done pre-processing too. It is the extraction of skill sets that I am having trouble with. From every document parsed, I want to extract top two or any 2 skillsets. Could someone please tell me how that can be done?

0 Likes

#5

Hi @Knimeforum123,

as I suggested earlier, you can use the unprocessed strings (before converting them to documents) and to extract words after a certain sequence of words.

Another option would be to do preprocessing and to use the Term Neighborhood Extractor node to extract N neighbour terms and then filter out only N succeeding the “Skills” term.

Best,
Anna

0 Likes

#7

It won’t be that easy I guess. Provided you’ve successfully imported the files as Documents, you can indeed preprocess, in particular stop words and case - think carefully before applying any generic correction for punctuation, numbers and accents. Then, using TF-IDF you can assess which words are most informative - this will compare the terms across all parsed documents. However, the words won’t be necessarily skills.

If there are particular skills you’re looking for, create a table of them and tag them in the documents using the Dictionary Tagger. Now you can filter out all untagged words. Voilà …

Final note: some applicants fool the process by inserting desired skill sets in transparent or background font color, even though you could not detect them when reading the resumés manually.

0 Likes

#8

Sir/Madam, how to filter out all untagged words…?

0 Likes

#9

With the Dictionary Tagger, tick the box “Set named entities unmodifiable”, then you can apply a very narrow Tag Filter, RegEx Filter or N Chars Filter and be sure not to ignore the unmodifiable tag.

0 Likes