Im doing some textprocessing and analysis on support tickets (emails).
I see stemming recommended in all examples and presenations, however in my experience the stemming seems way to agressive. It might be working for analytical purposes but it renders 'nonsense' so to speak so can't use it to present a Tag Cloud in a report.
The porter stemmer for instance just cuts of any trailing 'e' so it seems ('please' becomes 'pleas')
The Snowball stemmer seems a bit better, but even this one cuts of the trailing 'e' from verbs ('provide' becomes 'provid')
Is there anything I can do to influence this or am I grossly overlooking something here?
It is possible to set terms to "unmodifiable" when tagging. That way, the stemmer would not alter the tagged terms. Apart from that, there is not much that you can do to influence the behavior of the stemmer - the results you get are perfectly normal. E.g. "provide" becomes "provid", but so does "providing", "provides", etc. Terms are reduced to their common stem, and this stem is what you get as a result. One other thing you could do is create your Tag Cloud using terms that have not been stemmed. Would this help in your case?
thx for your response and apologies for my late reply to this. Somehow I missed your answer(?)
I guess it indeed makes sense from the analysis point of view and I can work around it in some way to have sensible ‘reporting’ based on this. (turn provid back in to provide in the visualization)