extraction of tagged words from a tagged document

Hello everyone,

My goal is to tag hundrets of documents with tags from a predefined tag list, so I can later assign the tags in a Literature database.

Everything up until the Tagging of the document works fine, I have a list of Tags with multiple keywords each, which then are assigned if the keyword appears in the text.

I then plan to count the number of times keywords for a specific Tag occur, and then assign the tags based on the count of occurences.

However, I cannot find a suitable method for counting the amount of keywords in a tagged document. From my searching Tag filter → Bag of words creator → Group by seems to be the way, but I cannot find any references working with an own set of tags.

attached is my current workflow,

and an example of a tagged document

Any method I can extract and count the occurences?

Thanks a lot in advance!

Hi @M_FL and welcome to the forum.

Any chance you can upload the workflow itself with some sample input files (assuming they are not proprietary)? Then maybe community members would be able to provide better assistance.

Hello @M_FL and welcome to the KNIME Community!

You can take a look to the referenced WF in this topic for ideas. It’s a very draft deploy, but allows you to count tagged words from the output.

You can loop this workflow -adapted to your requirements-, stepping throughout a reference tagged items list.

As @ScottF suggested, if you have same sample data; we can take a look into it.

BR

Hello Everyone,

first of all, thanks a lot for your replies! And sorry for my late reply, i was struggeling around with the whitescreen/workspace issue from the latest patch.

Attached is the workflow + 1 example paper + an excel with tags (row 1) and keywords (below the according tag), to get it running i reduced it down to 3 tags + keywords. They do not necessarily make sense, but are appearing frequently in the paper.

automatisiertes_tagging_v3.knwf (185.3 KB)

keywords_transposed_grouped_Example.xlsx (8.6 KB)

NiTi alloy helical lattice structure with high reusable energy-compressed.pdf (863.4 KB)

Please excuse the description and maybe some random comments, it’s all a bit mangled together from different examples I tried to build on.

@gonhaddock I will take a deep look at the example workflow, thanks!

BR and thanks a lot in advance!

1 Like

Hello @M_FL
I’m taking a look into it.
Be aware that some functions in the example workflow are not necessary for your use case. As the workflow extracts #heading and #trailing words, not needed in your challenge. As the target is just the tagged word…

I will test different approaches out from regex coding.

BR

Hello @M_FL
I’ve tested to count tagging from two approaches, getting back slight differences in the results. The source text is a complex pdf document. Then for simplicity, I would rely on ‘Unique Term Extractor’ node.


20250724_counting_tag_occurrences_v00.knwf (1.0 MB)

BR

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.