I am working on a topic for my dissertation. The topic is about analyzing the body text of 500 emails to get customer service insights. Basically, I would like to do a Latent Semantic Analysis. Alternatively, is it possible to link the text of each email to a particular topic? Is it possible to know how many times a specific keyword is repeated? Is it possible to create groups of words to categorize emails?
Is it also possible to read emails from mbox files to get just the body text? Or should I import a file txt/cvs?
Do you know if there is a tool that allows me to export just the body text from my batch of emails?
I have tried to work with a txt file but the software does not put the text just in one row. Basically, the body text comes in differents rows. Is it possible to put a message just in one row?
use the Tika Parser Node to extract Metadata and Body-Content from Mails
Thank you so much for your quick response.
I would like to know if it is possible to eliminate all the multiple existing text blocks in forwarded or replied emails.
I’m sorry … the structure of the body, regarding forwarding or replying etc, depends on the email programs used by the individual users.
The body is a string content, so you can eleminate every char after the first substring with forward or reply message.
Thank you for your help @Andrew_Steel.
I would like to know if it is possible to split a string into multiple strings. Basically, I used Tika parser to get the content of each email. The problem is that some times one cell contains one conversation, and so multiple emails.
I would like to split each email from the same cell. Is it possible to do that? I will attach an example.
How can I split those different emails ?
that looks bad … I think there is no continuous usable structure in the content cell …
Which mimetype does the raw email have? Which extension? All of the same format?
In a forensic approach that I use, I separate the emails that mostly come from an outlook mailbox at the boundary entries. However, I analyze the email in raw format (eml).
All the files are in the same format: .eml
The problem is that most of the emails report conversations and not just one email.
I do not really know how to split a conversation in single emails. Do you know how to do that in outlook? I could try to transfer all these emails in outlook if I could split them in single emails.
This problem is caused by the reply function.
I am using Thunder Bird so I can provide even csv and txt file-
I will upload a sample in eml format Nuovo Archivio WinRAR ZIP.zip (581.1 KB)
that’s a little bit tricky … in my opinion you should use the html code inside the email …
For your sample email:
It’s a nested multipart email. You will find the first multipart-boundary
----=_NextPart_000_0072_01D3CC03.A5E060B0 in the email-header. The first part is the body-part, the followed four parts are attachements.
The content-type of the body-part is a multipart/alternative, the boundary
----=_NextPart_001_0073_01D3CC03.A5E060B0. The first part of these nested multipart is a raw text body (Content-Type: text/plain). The second part is the body in html form (Content-Type: text/html). The raw text has no structure but the html-text does.
Load the raw text (e.g.
Load text-based file; Vernalis KNIME Nodes), extract the html-content and select your body text …
I hope this helps at least a little bit
Thank you so much for your help @Andrew_Steel.
Should I extract the body text for each email?
In this case would be really slow to get my data
I think yes, but I don’t know your requirement for the result.
Parallelise your workflow with a
Parallel Chunk Loop to increase the performance.
I am going to do keyword extraction and clustering I think.
I could download the emails in txt format and edit ‘‘by hand’’ all the emails one-by-one.
This is going to take time too, but at least I will get my data sourted for sure