First of all I want to apologize if a similar problem has been answered before. I am not familiar with the correct terminology which makes researching more complicated.
I am a newbie to Knime/Data science and had a few days of training in the general concepts and methods 4 years ago. In this course we used Knime. I have no further education or practical experience in this area, therefore with some guidance, I hope get more data insight with Knime.
My problem is as follows:
I have list of projects which lead to negative margins in the past years. I have different category-columns such as
- project volume - grouped/binned in two 2 columns (1 column with 3 bins, 1 column with 9)
- (business) sector (3 columns representing the levels of segregation, like a hierarchy 5, 25 and 70),
- customer type,
- percent of completion (per each years end), in my opinon weak
- country
- region (as subcategory of states, some countries not split in regions => 1:1)
- segment (top hierarchy 11 classes)
- business unit (lower level 56 classes)
- margin categories (binned < -1%, -1 to -5%, -5 to -10,…)
Furthermore I have columns for:
- the year (as integer, five years),
- the total project volume (latest plan, decimal => every year of occurence the same value, to overcome the problem of changes over the years)
- the relevant turnover (in a year, decimal)
- absolute value margin (in a year, decimal)
- and the relative margin per year (decimal).
Each project is represented with one line per year where there was a negative margin. This means positive years are already excluded form the data set, and each project may be represented with 1-5 lines (depending of the number of years it has shown negaive margin). Some of the projects might also be finished in a single year (therefore also only represented once in the data set with 100% completion in this year)
Small projects are overrepresented 60% of more than 5.000 lines (size class 1 in first category (ranging 1-3) or 1-2 in the second category (ranging 1-9). So I think of splitting data analysis in two two subsets.
I am not planning to do a time series analysis over the years (in the first step)
I would like to find significant/relevant clusters/combinations of categories which exhibit best correlation with negative margin of a project.
At the moment I have not carried out any normalisation of data.
I hope I was specific enough, and I am glad to give you more details that might be required.
Please consider my lack of in depth knowledge and experience with data analytics and provide some guidance how to visualise/represent the results.