Hello everyone again,
For this project, I have been asked to conduct a study for an insurance company in their AUTO division. The task consists of analyzing the reports provided by clients on how their vehicle accidents occurred. An example Excel table is provided (attached) with the reports of all accidents, their occurrence, repair costs, make, model…
My idea for analyzing these reports is:
- Convert everything to lowercase
- Remove accents and punctuation marks
- Eliminate prepositions and determiners
- Analyze pronouns (to understand culpability if possible)
Once all these operations are performed, extract the most repeated words into a table where we can visualize the top 50 words with the highest total accumulation for these reports.
Therefore, I need to create a workflow with KNIME to perform this operation.
I have been analyzing several options, and this is what I came up with, let me know what you think.
Step 1: Data Preparation
- Import Data: Use the
Excel Reader
node to import your Excel table into KNIME. - Data Cleaning: Use nodes like
String Manipulation
to remove prepositions and determiners from the reports. You can create a list of common words to remove and apply a replacement function.
Step 2: Word Frequency Analysis
- Tokenization: Use the
Strings to Document
node followed byDocument Data Extractor
to convert the reports into documents and extract the words. - Word Filtering: Apply the
Stop Word Filter
node to remove common words that do not add value. - Word Counting: Use the
TF
(Term Frequency) node to count the frequency of each word in the reports.
Step 3: Visualization
- Frequency Charts: Use nodes like
Bar Chart
orWord Cloud
to visualize the most frequent words.