Could the TF node incorporate parralel processing?

AngusVeitch · January 20, 2021, 2:32am

As anyone who processes large text collections would be aware, the TF node can be painfully slow to execute, to the extent that it can become a major bottleneck in my workflow, and I sometimes revert to using DF instead. (As I write this, the TF node has been chewing through a collection of 11,000 news articles for several minutes, and is not even half done.) I assume that the slowness results from every document being processed in succession rather than in parallel. I make this assumption because the processing can be sped up considerably by containing the node in a parallel chunk loop.

Given that many other text processing nodes have inbuilt parallel processing options, is there any reason why the TF node does not? And parallel processing aside, might there be any room for other efficiency improvements in this node?

AngusVeitch · January 20, 2021, 3:01am

On a related note, I wonder if the Unique Term Extractor node could be modified to account for term tags, as this functionality would negate my need to use the TF node in many cases (specifically, when I want to count the overall TF of specific NE types), and the Unique Term Extractor is much, MUCH faster than the TF node.

julian.bunzel · February 11, 2021, 7:19am

Hi @AngusVeitch,

sorry about the late response.
Some text processing nodes have not been implemented to run in parallel, but it’s definitely something we can change. I’ll create tickets for that plus one for adding an option to the Unique Term Extractor to allow extracting unique term tag pairs.

Best,
Julian

system · June 2, 2023, 9:40pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.