Extract all instances of term from string

RIchardC · August 5, 2022, 1:39pm

I have a string that contains the content of a book. I want to output every instance of a certain term in that book (plus ~100 characters before and after).

I started with a column expression: indexOf(column("Content") , variable("search_word"))

That gets me the first instance, but I suspect an entirely different approach might be preferable to get all instances.

I’ve toyed with Dictionary Tagger and Document Data Extractor, but I’d like to know if I’m jumping into the right rabbit hole.

All input will be appreciated.

Cheers, Richard

ArjenEX · August 5, 2022, 3:08pm

Hi @RIchardC

Could you please enrich your post with some screenshots, sample data, expected output, your workflow (excluding sensitive data), etc… They more you provide, the more accurate people can help you.

How important is your positive and negative look behind/ahead of about 100 chars? I’m assuming you want to include the context in which the term is used. Would the sentence and some “neighbours” where the term occours also be sufficient? In that case you could opt for a more straightforward approach by splitting the string based on a period and then evaluate each sentence.

For example, if I consider Lorem Ipsum and search for the term reprehenderit, it’s available in rowID 2 .

With some basic operations I’m able to define that I want to output rowID 2 and the previous one (rowID 1) to get more context of the preceding text.

Let me know if this would be a feasible solution to your use case and I will further elaborate on the details of this And otherwise, we’ll continue to cook another solution.

Daniel_Weikert · August 5, 2022, 4:47pm

With text of books which are documents jumping into the KNIME document nodes sounds like the right rabbit whole. “Follow the white Rabbit Neo” (You have to be a little bit older to understand that joke )

RIchardC · August 5, 2022, 7:28pm

Sorry I wasn’t more explicit. I’m using Tika Parser to read in the book, so the contents of the book are in a string column (named “contents”) and there is only one row.

I pipe that string into a column expressions node that looks like this:

This gives me the output I want, but only for the first instance of my search_word. Yet my search word could exist many times inside the book.

I’m wondering what is the best approach to extracting the same word from different parts of the same column/row.

Thanks.

P.S. The idea that Matrix references are for old people makes me feel really old. I’m still quoting Caddyshack.

victor_palacios · August 5, 2022, 7:38pm

Here’s a solution for your problem:

I use regex (it is my go-to strategy for most text-related problems like this).

Like you, I have a blob of text in a single row. I want to find all the context related to the word “in”. Context is defined as 10 characters before and after “in”. You can modify this.

Here is the Regex used:
.{10}\sin\s.{10}

In English:
.{10} – Find 10 characters (spaces are technically characters)
\s – literal space
in – target word

P.S. The idea that Matrix references are for old people makes me feel really old. → Yes, please don’t make me feel so old haha. @Daniel_Weikert

For more on Regex, see:

victor_palacios · August 5, 2022, 7:41pm

Shameless self promotion:

I’ll also be hosting a free webinar for PDF text extraction using regex, knime, and python if you’re interested:

RIchardC · August 6, 2022, 1:33am

@victor_palacios, Thanks for the regex suggestions. I seem to have a black hole in my brain where my understanding of regex should be.

In many ways, I consider regex to be the antithesis of knime, which is visual, intuitive, and friendly while regex is cryptic, unfathomable, and vengeful.

badger101 · August 7, 2022, 7:46pm

Hi @RIchardC ,

May I know how many words are there altogether in the book/pdf?

Reason for asking: To suggest an efficient workflow that suits your data.

Additionally, do you need a search box that allows you to query for any word (plus ~100 characters before and after the queried word)?

Reason for asking: To determine whether a widget is needed for the workflow.

Going above and beyond, what would you like to see or do, once you’ve obtained the resultant table?

Reason for asking: To provide a complete workflow that meets your end goal.

Daniel_Weikert · August 8, 2022, 5:07pm

Not intended. You have great movie taste @victor_palacios
br

RIchardC · August 9, 2022, 9:11pm

@badger101
I’d estimate maybe 70,000 words in the average non-fiction book. I can create a search box, and I only need to see the resultant table, but I appreciate you going above and beyond.

Cheers, Richard

badger101 · August 10, 2022, 3:23am

Hi @RIchardC , thank you for the info regarding the size of the input.

I have tested a few things using a PDF file of a book containing around 150k words from cover to cover, which is double the size you’re aiming for.

For the issue you’re solving, there’s already a built-in node in KNIME specifically catering for that; the Term Neighborhood Extractor Node.

I have used this node quite a few times eversince I started with Knime. The reason I asked you for the size is because I know that this node will have performance issues when dealing with cells with too many sentences. In our case, it’s even worse, since a PDF file of a book parsed by TIKA will result in, as you noticed it yourself, a compilation of all sentences in one cell.

For the Pdf file I mentioned earlier (with 150k words), running the node in 20 minutes showed no results, so I stopped it and I decided to find my own workaround.

Basically, the workaround starts with converting that one all-inclusive cell to a column where each row represents one word, using space as the separator of rows. This can be done using the Cell Splitter Node, which in my case, took literally only 1 second. You’ll have to click the option of ‘List’ instead of ‘new columns’ in the configuration window of the Cell Splitter Node. You’ll end up with one row with a collection type, which you’ll be subjecting afterwards to the Ungroup Node. If you’re familiar with the Bag of Words Node, the resultant table looks similar. The only difference is that the BoW Node enlists all unique words, while the Cell Splitter enlists all available words according to the sequence that they appear in the book (Caution: A proper data cleaning must be made first before all of this, but that is another topic by itself).

From this list, you can start thinking of how to proceed. I have looked at your past threads in the KNIME forum, and I think you’ve had enough experience on how to proceed from here. The complete solution is doable, but unfortunately I won’t be available for the next 2 weeks. The other day when I wrote a reply to this thread, I had some free time.

In summary, you should now have a column of words ordered in sequence as they originally appeared in the book. That is all you need as a starter to create a workaround from.

I wish you all the best. If you still haven’t found a solution after 2 weeks from today, let me know by tagging my name here.

RIchardC · August 10, 2022, 12:36pm

Thanks, @badger101. That’s a great answer. I’ve used BoW before, but your explanation of BoW vs Cell Splitter is enlightening. I’m out myself right now, but when I get back to my Knime machine I’ll sort this out and see if I can contribute something to the documentation. Cheers.

RIchardC · August 13, 2022, 2:12am

@badger101 I’m following your recommendations and I now have each word in separate rows of a column. To read it in context, I could calculate plus and minus say 20 words based on column numbers. Then I could create a column that concatenates (with spaces) those 40 words.

Any suggestions on which nodes I could use to grab the previous 20 and next 20 words would be appreciated.

badger101 · August 17, 2022, 4:05pm

@RIchardC I can only go through the forum casually until next week. I’m busy with preparation of a few things in real life at the moment. I wish I could help, but I can’t spend time on projects at the moment or else I’ll be drawn in. When I come back later, and if at the time, you still haven’t posted or found a solution, I’ll try to create a complete workflow for you.

Best wishes.

RIchardC · August 18, 2022, 8:35pm

Thanks @badger101. I made it work. Cheers.

system · August 25, 2022, 8:35pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.