Term Neighborhood extractor

Hello
I’m trying to get the N surrounding words to a specific term in a text. The node in question works fine, but it forces me to use a post-filter to get only the neighbourhoods that work for me. Also, when the texts to be searched are very long, the amount of records obtained is excessively large, and it takes a long time.
Is there a node like the Term Neighborhood Extractor, but that only extracts the neighbourhoods of specific terms? Or is there a way to make a nice improvement to that node?
Thanks

1 Like

Hi @lsandinop

I’m not aware of a similar node to the -Term Neighborhood Extractor- node which could do the same job with the constrain you are mentioning, i. e. only extracts the neighbourhoods of specific terms.

I do not think that implementing this by hand (which it is possible in KNIME using for instance the -Lag Column- node) would be more efficient. I’m not hence suggesting here this possible solution because it will not be of much help.

However, I have tried the -Term Neighborhood Extractor- node and noticed that it doesn’t take advantage of several CPUs to achieve its task, it doesn’t work in multitask mode. Given this and if your text is made of several sentences, I would suggest to split it into sentences or at least big chunks of text, as many as cores (or threads) your computer may have and then run in parallel as many -Term Neighborhood Extractor- nodes as CPU threads you can have. For instance, if your computer can have up to 4 threads, then you could parallelize the Term Neighborhood Extraction as follows:

In this example, the Term Neighborhood Extraction is done in parallel by 4 extraction nodes and hence 4 times faster than doing it all only based on one -Term Neighborhood Extraction- node. You can adapt this solution to your own number of CPU threads to take the maximum advantage of them.

Hope this is clear enough and helps. Otherwise, please reach out again for extra help.

Best

Ael

6 Likes

Hi @lsandinop

Just wondering if the solution I posted here was of any help for you to solve your problem of efficiency with the -Term Neighborhood Extractor- node ?

Your feedback would be very much appreciated :slight_smile:

Best
Ael

Hello
Sorry for the delay in replying. I appreciate your input, but I really do need the words surrounding a specific one, not the whole sentence. Anyway, your solution helped me to implement a loop that runs through iterations all the records.
Thank you very much

2 Likes

Thank you @lsandinop for your feedback. If you think this question is solved, please feel free to check :white_check_mark: the post with solution as solved so that other people can more easily find it when searching for this topic.

Best
Ael

1 Like

@aworker Do you know whether this is the case by default. E.g. If I break my flow into 4 partitions then KNIME automatically leverages 4 cores? Is that somewhere documented?
Thanks and br

Hi @Daniel_Weikert

Normally yes, I do not know whether this is documented or not but by experience, I can say that this is the case. At that point that if you do not want to have several cores used by parallel branches of a same workflow, better to force them to work sequentially by adding variable connexion dependencies between them to establish the sequential order at which you need them to work.

Besides this, I often check how many cores are used by every node when they are run. I use for that the ressource monitor utility in windows, as for instance here below where you can see that the -RDKit Substructure Filter- node is employing all the cores:

This allows me to optimize the use of nodes and the performance of the whole workflow based on whether they can implicitly handle parallelism or not. This is partially explained in the the following post:

Hope this helps.

Best,
Ael

4 Likes

Yes that helps thanks a lot. I always like to speed up things so everything which can be parallelized - go for it.
Thanks for your detailed exlanation post (as always) highly appreciated!
br

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.