How to parse PDF and word resumes to extract two skill sets from each

Knimeforum123 · July 19, 2018, 11:28am

For example, if a resume has Java , dot net and a few other skills. How to extract them from each document in the folder with KNIME?

amartin · July 23, 2018, 10:02am

Hi @Knimeforum123,

You can use the Tika Parser node to extract textual content of the files in a folder, which will be output as a string. You can then use the String Manipulation node or the new Column Expressions node to extract string of length x after certain key words (e.g., “Skill set”) using indexOf and substr functions.

Best,
Anna

Knimeforum123 · July 23, 2018, 11:14am

I was able to read the documents successfully. I have done pre-processing too. It is the extraction of skill sets that I am having trouble with. From every document parsed, I want to extract top two or any 2 skillsets. Could someone please tell me how that can be done?

amartin · July 30, 2018, 12:41pm

Hi @Knimeforum123,

as I suggested earlier, you can use the unprocessed strings (before converting them to documents) and to extract words after a certain sequence of words.

Another option would be to do preprocessing and to use the Term Neighborhood Extractor node to extract N neighbour terms and then filter out only N succeeding the “Skills” term.

Best,
Anna

Geo · June 30, 2019, 11:51am

It won’t be that easy I guess. Provided you’ve successfully imported the files as Documents, you can indeed preprocess, in particular stop words and case - think carefully before applying any generic correction for punctuation, numbers and accents. Then, using TF-IDF you can assess which words are most informative - this will compare the terms across all parsed documents. However, the words won’t be necessarily skills.

If there are particular skills you’re looking for, create a table of them and tag them in the documents using the Dictionary Tagger. Now you can filter out all untagged words. Voilà …

Final note: some applicants fool the process by inserting desired skill sets in transparent or background font color, even though you could not detect them when reading the resumés manually.

SIDD · June 30, 2019, 5:21pm

Sir/Madam, how to filter out all untagged words…?

Geo · June 30, 2019, 6:22pm

With the Dictionary Tagger, tick the box “Set named entities unmodifiable”, then you can apply a very narrow Tag Filter, RegEx Filter or N Chars Filter and be sure not to ignore the unmodifiable tag.

system · June 2, 2023, 9:44pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.