TEXT MINNING - workflow by processing text

Hello everyone again,

For this project, I have been asked to conduct a study for an insurance company in their AUTO division. The task consists of analyzing the reports provided by clients on how their vehicle accidents occurred. An example Excel table is provided (attached) with the reports of all accidents, their occurrence, repair costs, make, model…

My idea for analyzing these reports is:

  • Convert everything to lowercase
  • Remove accents and punctuation marks
  • Eliminate prepositions and determiners
  • Analyze pronouns (to understand culpability if possible)

Once all these operations are performed, extract the most repeated words into a table where we can visualize the top 50 words with the highest total accumulation for these reports.

Therefore, I need to create a workflow with KNIME to perform this operation.

I have been analyzing several options, and this is what I came up with, let me know what you think.

Step 1: Data Preparation

  1. Import Data: Use the Excel Reader node to import your Excel table into KNIME.
  2. Data Cleaning: Use nodes like String Manipulation to remove prepositions and determiners from the reports. You can create a list of common words to remove and apply a replacement function.

Step 2: Word Frequency Analysis

  1. Tokenization: Use the Strings to Document node followed by Document Data Extractor to convert the reports into documents and extract the words.
  2. Word Filtering: Apply the Stop Word Filter node to remove common words that do not add value.
  3. Word Counting: Use the TF (Term Frequency) node to count the frequency of each word in the reports.

Step 3: Visualization

  1. Frequency Charts: Use nodes like Bar Chart or Word Cloud to visualize the most frequent words.

You can look for some examples for ideas

2 Likes

Thanks @izaychik63 i will review it!!