Could the TF node incorporate parralel processing?

As anyone who processes large text collections would be aware, the TF node can be painfully slow to execute, to the extent that it can become a major bottleneck in my workflow, and I sometimes revert to using DF instead. (As I write this, the TF node has been chewing through a collection of 11,000 news articles for several minutes, and is not even half done.) I assume that the slowness results from every document being processed in succession rather than in parallel. I make this assumption because the processing can be sped up considerably by containing the node in a parallel chunk loop.

Given that many other text processing nodes have inbuilt parallel processing options, is there any reason why the TF node does not? And parallel processing aside, might there be any room for other efficiency improvements in this node?

On a related note, I wonder if the Unique Term Extractor node could be modified to account for term tags, as this functionality would negate my need to use the TF node in many cases (specifically, when I want to count the overall TF of specific NE types), and the Unique Term Extractor is much, MUCH faster than the TF node.

Hi @AngusVeitch1,

sorry about the late response.
Some text processing nodes have not been implemented to run in parallel, but it’s definitely something we can change. I’ll create tickets for that plus one for adding an option to the Unique Term Extractor to allow extracting unique term tag pairs.

Best,
Julian

1 Like