What workflow is best for reading several files and extract similar text?

TorOle · November 22, 2023, 10:11am

Hey and sorry that this has likely been asked before, but hopefully you can bear with me!

The problem:
As part of doing risk analyzis (an excel report) you have a range of systems with their own anlyzis. While each report is a standalone file, they follow a commot format where one collumn is particular interesting: Suggestions for actions on how to handle the risks. The issue is that since this is a free text field, the wording can change from time to time while the general suggestions are somewhat the same. This makes it challenging to see which suggestions are repeatable and gives most value to be fixed as it solves several system’s risks through the same action.

Therefore there is a need to use the power of tools such as this to help structure the madnes!

The logic:

Reads a folder where all these reports are and all the excel files within
Ignores anything other than the actual file name and this one column, which we can call “suggested actions”
Analyze the text to see which suggestions are roughly the same
Aggrigate / collect / combine (not sure what to use sorry) the results and produce a new list which can be exported as a excel or CSV file. Ideally the suggestions should be consise, like a sentence or two for consistency, maybe bulletpoints could work too.
This list should contain the various suggestions sorted from high to low depending on their occurences. Ideally there should be an extra column to explain / count how many times this type of suggestion has been asked as well as a reference to which document it was found it. Nothing too fancy, just that it gives the names of the reports where this suggestion is present

Again, sorry for language (not mother tongue) and strange questioning. I have somewhat an understanding of the logic I want, but how to use the tool to archieve this is out of my knowledge.

AlexanderFillbrunn · November 22, 2023, 10:53am

Hi @TorOle,
Welcome to the KNIME Forum! Let me give you a few hints for your individual points:

You can use the Excel Reader node. It has a “Files in folder” option to read multiple files. In the advanced tab of its configuration window, you can also set it to add the file path as an additional column.
In the “Transformation” tab of the Excel Reader’s config you can exclude all the columns you do not need from the dataset. Alternatively, you can use a Column Filter node afterwards.
This is the difficult part. You could use the BERT Embedder node to turn the text into a numeric vector representation, where similar concepts are close in vector space. Here is an example workflow where the vectors are created for movie recommendations. Once you have a vector for each text, you need to cluster them for similarity. If you know how many suggestions there should be in total, you can use k-means or k-median clustering (might not work well due to too many dimensions, though). You can also try the DBSCAN algorithm, but there you have to find a good Epsilon parameter that specifies up to which “vector distance” you consider two rows the same. You also pass it a distance measure, which could be euclidean, but also cosine distance could work well. Once you have a cluster assigned to each row, you know which ones are similar.
Creating one suggestion out of multiple similar ones is pretty hard. You could use our OpenAI integration to let ChatGPT do that, but you will need an account. For that, you could engineer a prompt for ChatGPT by first grouping by found cluster using a GroupBy node with concatenate aggregation on the suggestions in that cluster, then using a String Manipulation node to create a prompt like:

Summarize the following suggestions for in one sentence: <concatenated_suggestions>

The sorting you can do beforehand. In the GroupBy, just add a count aggregation followed by a Sorter node.

I hope these pointers help. If you need additional input for something specific, feel free to ask! I think point 3 is the most difficult one to get right.
Kind regards,
Alexander

TorOle · November 22, 2023, 10:55am

Thanks a lot for your insights! I will take them with me and use it in the future research for (good enough) workflow

system · February 20, 2024, 10:55am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.