WhatsApp Chat Analysis

prashantk · June 21, 2019, 6:19am

Hi,

While processing whatsapp chats, we are facing problem. Every new chat start with date,time-user: “ChatText”. If “ChatText” contains “Enter” ( pressing "Enter"while discussing in whatsapp) then chat starts from new line.

While processing using “File Reader” , First row contains “Start of the chat with date,time-user” but text after “Enter” going in to next row. No. of rows increases with no. of “Enter”.

We want all the chat text as a single record in single row instead of in multiple row.

Sample input Text file is as below:

11/30/18, 10:09 AM - xxxxxxxxxxx: Hi all ,
Could anybody please help with engineering management universities
I’m give TOEFL tomorrow
Need to add score recipients
11/30/18, 12:20 PM - xxxxxxxxxxx: Manhattan premium account for sale at a low price. If anyone interested PM.
11/30/18, 12:51 PM - xxxxxxxxxxx: its fine for Germany???
11/30/18, 12:52 PM - xxxxxxxxxxx: Can anybody review my profile for rwth and Mannheim data science
TOEFL - 103
3+ years of experience in data field as a software engineer
BTech - 8.12 CGPA
12th- 74
11/30/18, 12:53 PM - xxxxxxxxxxx: Woaajh

Expected Ourput

11/30/18, 10:09 AM - xxxxxxxxxxx: Hi all ,Could anybody please help with engineering management universities I’m give TOEFL tomorrow Need to add score recipients
11/30/18, 12:20 PM - xxxxxxxxxxx: Manhattan premium account for sale at a low price. If anyone interested PM.
11/30/18, 12:51 PM - xxxxxxxxxxx: its fine for Germany???
11/30/18, 12:52 PM - xxxxxxxxxxx: Can anybody review my profile for rwth and Mannheim data science TOEFL - 103 3+ years of experience in data field as a software engineer BTech - 8.12 CGPA 12th- 74
11/30/18, 12:53 PM - xxxxxxxxxxx: Woaajh

Every row should start with date,time-user: .
whatever the text comes in next row until next chat should come under above row.

prashantk · June 21, 2019, 7:32am

I mean to say, Every chat in the discussion should read as new record to process it further for the analysis.

ipazin · June 21, 2019, 10:49am

Hi there,

either you can, based on format of your file and playing with File Reader many options, try to get expected output directly from File Reader node. Or after you read it as above shown apply logic in KNIME to get your output.

Logic (mine at least) would be something like:

add unique indicator (number for example) to rows belonging to the same message
use Group Loop Start node on that indicator and in each iteration transpose rows and then combine columns

Anyways sharing a sample file could help

Br,
Ivan

prashantk · June 21, 2019, 11:15am

Thank you for your reply.

Please find attached file.ForProcessing.txt (3.1 KB)

prashantk · June 21, 2019, 11:53am

Output file would be like ForProcessingOutput.txt (3.0 KB)

ipazin · June 25, 2019, 2:06pm

Hi there,

Logic used in workflow attached is a bit unusual but there are some comments and the idea is pretty much what I have wrote above. Just to get to this idea funny methods are used

If any questions feel free to ask.
2019_06_25_WhatsUp_Analysis.knwf (19.4 KB)

Br,
Ivan

prashantk · June 26, 2019, 5:09am

Hi ipazin,

Thanks for your reply, I have used below nodes.

Please find attached.WhatsApp.knwf (12.4 KB)

ipazin · June 26, 2019, 2:24pm

Hi @prashantk,

I see now. That is fine as well. Only two different messages can have same time I guess so grouping might be wrong. That is way I would rather go with identifier which should be unique. Additionally you can change delimiter in GroupBy node. Default is comma but you can put space.

Actually your solution gave me idea which simplifies workflow a lot! Instead of rowindex() -1 missing value is inserted. This way you can use Missing Value node immediately. Check it out. Also regex is bit improved

2019_06_25_WhatsUp_Analysis.knwf (14.0 KB)

Br,
Ivan

prashantk · June 27, 2019, 11:05am

Hi @ipazin,

Its a great solution…

Thank You.

prashantk · July 1, 2019, 6:16am

Hi,

Now we have date and time. How can we run a report : No of discussions in an hour of the day.

By looking at this report we can see at what time most discussions happen and at what time less discussion happen.

I have split date and time but unable to operate on time part. its in HH:mm a but in string format.

Thank You.

prashantk · July 1, 2019, 7:15am

Hi,

Solved by using “h:mm a” in “String to Date&Time” Node.

Thank You.

ricknime · December 1, 2021, 1:08pm

Based in your example, I started to test multiple Chat exports. When they come from different sources, the chat structure changes.

(He said, She said Group Chats)

I’ve been testing to split the chat: Date - Sender - Message

This is the Idea.

Regex Split.
[\[]?((?:\d{1,4}.+[ap]\..m\.)|(?:\d{1,4}.+\d{2}:\d{2})|(?:.+))(?:\s-\s|\]\s)(.+?)[:](.*$)

Requieres a Definition file to uploads the Chats.
chatCode,encode,dateFormat,path

not finished.

nxfxcom · January 16, 2022, 5:12pm

Hello,

did you ever figure this out?