TEXT MINNING - workflow by processing text

26AngelG · September 17, 2024, 3:09pm

Hello everyone again,

For this project, I have been asked to conduct a study for an insurance company in their AUTO division. The task consists of analyzing the reports provided by clients on how their vehicle accidents occurred. An example Excel table is provided (attached) with the reports of all accidents, their occurrence, repair costs, make, model…

My idea for analyzing these reports is:

Convert everything to lowercase
Remove accents and punctuation marks
Eliminate prepositions and determiners
Analyze pronouns (to understand culpability if possible)

Once all these operations are performed, extract the most repeated words into a table where we can visualize the top 50 words with the highest total accumulation for these reports.

Therefore, I need to create a workflow with KNIME to perform this operation.

I have been analyzing several options, and this is what I came up with, let me know what you think.

Step 1: Data Preparation

Import Data: Use the Excel Reader node to import your Excel table into KNIME.
Data Cleaning: Use nodes like String Manipulation to remove prepositions and determiners from the reports. You can create a list of common words to remove and apply a replacement function.

Step 2: Word Frequency Analysis

Tokenization: Use the Strings to Document node followed by Document Data Extractor to convert the reports into documents and extract the words.
Word Filtering: Apply the Stop Word Filter node to remove common words that do not add value.
Word Counting: Use the TF (Term Frequency) node to count the frequency of each word in the reports.

Step 3: Visualization

Frequency Charts: Use nodes like Bar Chart or Word Cloud to visualize the most frequent words.

izaychik63 · September 17, 2024, 4:43pm

You can look for some examples for ideas

26AngelG · September 18, 2024, 9:39am

Thanks @izaychik63 i will review it!!