I am working on the Enron Email dataset to do a NLP project in KNIME.
Kindly help me extract the information in this dataset from this format to a cleaned one with only words, and then do a supervised and unsupervised learning in KNIME.
You’ll need to do some parsing if you want to get nice text data from this. Probably the mbox nodes which I built a while ago might partly help you with the first steps on parsing the raw mail data:
The “TIKA” node might also help, as it supposedly can digest eml files:
Don’t expect a “one click” solution though, this will still require quite some efforts
then do a supervised and unsupervised learning in KNIME.