I have tagged my dataset based on a pre-defined dictionary. Now I would like to extract all surrounding words (+/- 5) around each tag. I have tried to do so with the Term Neighborhood Extractor, yet this node does this for every word and not only for the ones that are tagged. Technically I could use a row filter after the Term Neighborhood node, but this will require a lot of computational power since my data set is very large.
So here is an example:
“I’ve been a customer[Tag] at this shop[Tag] for about three years now. Even though I currently live in Los Angeles[Tag], I find myself visiting the shop[Tag] at least once a month.”
For each word that has a [Tag], I’d like to extract the sorrounding words (+/- 5), plus the tagged word itself.
I do not think it is possible to get the neighbours of selected words using the term neighbourhood extractor. Since you’re dealing with a large dataset here are a couple of approaches that might help:
Parallel Processing: To handle large datasets more efficiently, consider splitting your text into chunks and running the extraction process in parallel. This can help speed up processing by leveraging multiple CPU cores. You can find more information on how to set up parallel processing in KNIME here.
Regex Extraction: For extracting specific contexts around your tagged words, you can use regular expressions. You can get details on how to use regex in KNIME with an example here.
Hello @annikawagner
Did you get any progress with this challenge?
Due to some last days challenges in forum, I’ve been experimenting quite a lot with regex extraction, and I can put some nodes together to sketch a solution for your challenge.
You can consider this workflow as a draft, as I did some assumptions in the handling of logical rules:
a ‘tag’ word can not be considered in the surrounding group; then two close tag words will reduce the word window.
punctuations are ignored…
Once functional requirements are clarified, we can arrange all together into a component. A component can perform as a function, where you can define the tag word [1] and word windows [2] as inputs…
This is the output of the workflow for your description use case:
thank you so much for your effort - looks nice! I have actually solved my problem by performing the task in python and then adding the file as an input to Knime.
Let me try it with your workflow - this will for sure make the process easier (I will need some time to test it out).