Hello everyone again,
For this project, I have been asked to conduct a study for an insurance company in their AUTO division. The task consists of analyzing the reports provided by clients on how their vehicle accidents occurred. An example Excel table is provided (attached) with the reports of all accidents, their occurrence, repair costs, make, model…
My idea for analyzing these reports is:
- Convert everything to lowercase
- Remove accents and punctuation marks
- Eliminate prepositions and determiners
- Analyze pronouns (to understand culpability if possible)
Once all these operations are performed, extract the most repeated words into a table where we can visualize the top 50 words with the highest total accumulation for these reports.
Therefore, I need to create a workflow with KNIME to perform this operation.
I have been analyzing several options, and this is what I came up with, let me know what you think.
Step 1: Data Preparation
- Import Data: Use the
Excel Readernode to import your Excel table into KNIME. - Data Cleaning: Use nodes like
String Manipulationto remove prepositions and determiners from the reports. You can create a list of common words to remove and apply a replacement function.
Step 2: Word Frequency Analysis
- Tokenization: Use the
Strings to Documentnode followed byDocument Data Extractorto convert the reports into documents and extract the words. - Word Filtering: Apply the
Stop Word Filternode to remove common words that do not add value. - Word Counting: Use the
TF(Term Frequency) node to count the frequency of each word in the reports.
Step 3: Visualization
- Frequency Charts: Use nodes like
Bar ChartorWord Cloudto visualize the most frequent words.