Text processing and comparison of two different set of documents


Dear Knime community,

first of all it’s really amazing to have this very vivid and supportive community and I hope you can help me with my case.

For my Master Thesis, I want to research the differences of company values between family and non-family firms. Therefore, I would like to conduct a Text Mining task on the company’s annual reports (in PDF format) in which you usually find the company’s values and beliefs.
So as an outcome, I would like to generate a list of words/terms that represent the most stated values for each, a family and non-family firm.

I’ve already started working on a workflow which you can find attached. However, I do need your help for a couple of points to make it as good as possible:

  1. I didn’t find a way how to separate the analysis for family firms vs. non-family firms. Should I feed in the annual reports via 2 separate PDF Parser or is there any way to split up the documents depending on family firm / non-family firm? Right now, I have all the documents in one column, so it’s not possible for me to separate between the organization forms.

  2. Based upon my first question: Is there any tool to analyze/compare the outcomes of family firms/ non-family firms?

  3. Is there any tool to additionally filter the terms in such a way, that there will be only company values generated such as diversity, team-orientation, respect, collaboration and so on?

  4. Do you have any other analyses in mind how to draw some more findings/insights out of it? For example, would it make sense to do a sentiment analysis?

Sorry for so many questions at once but I’m new to Knime and not aware of the endless opportunities this program offers.
Your support would be highly appreciated!!


1 Like


Hi @cpauly93 and welcome to the KNIME community,

You have a lot of questions, but also already a workflow, so that’s a good start. Regarding your questions.
It is highly recommend to label your input data in 2 groups. The family firms and the non-family firms.You will need it to compare both groups in your analysis. If it is not possible to extract it from your initial input, I suggest to read both groups of pdf’s seperate from each other, see this example.
It may be some extra work. But he, data preparation takes a lot of time, but doing it well, will pays off for sure!

You can use a LDA topic extraction to find which words leads to the different topics for family and non family firms. For example this workflow. Or if you have already al list of words the String Matcher Node may be of some help for you. But also the TDF-IDF analysis can be helpfull.
It’s important to have your research question well formulated, then the choice of how to analyze the data is more straightforward.

In general I would advice you to take look at the KNIME hub with lots of examples and workflows.

Hope this helps a little,
gr Hans