Suitable data model to identify siginifcant categorie/s for a numeric result

FlorianSpring · April 17, 2024, 9:06am

First of all I want to apologize if a similar problem has been answered before. I am not familiar with the correct terminology which makes researching more complicated.

I am a newbie to Knime/Data science and had a few days of training in the general concepts and methods 4 years ago. In this course we used Knime. I have no further education or practical experience in this area, therefore with some guidance, I hope get more data insight with Knime.

My problem is as follows:
I have list of projects which lead to negative margins in the past years. I have different category-columns such as

project volume - grouped/binned in two 2 columns (1 column with 3 bins, 1 column with 9)
(business) sector (3 columns representing the levels of segregation, like a hierarchy 5, 25 and 70),
customer type,
percent of completion (per each years end), in my opinon weak
country
region (as subcategory of states, some countries not split in regions => 1:1)
segment (top hierarchy 11 classes)
business unit (lower level 56 classes)
margin categories (binned < -1%, -1 to -5%, -5 to -10,…)

Furthermore I have columns for:

the year (as integer, five years),
the total project volume (latest plan, decimal => every year of occurence the same value, to overcome the problem of changes over the years)
the relevant turnover (in a year, decimal)
absolute value margin (in a year, decimal)
and the relative margin per year (decimal).

Each project is represented with one line per year where there was a negative margin. This means positive years are already excluded form the data set, and each project may be represented with 1-5 lines (depending of the number of years it has shown negaive margin). Some of the projects might also be finished in a single year (therefore also only represented once in the data set with 100% completion in this year)

Small projects are overrepresented 60% of more than 5.000 lines (size class 1 in first category (ranging 1-3) or 1-2 in the second category (ranging 1-9). So I think of splitting data analysis in two two subsets.

I am not planning to do a time series analysis over the years (in the first step)

I would like to find significant/relevant clusters/combinations of categories which exhibit best correlation with negative margin of a project.

At the moment I have not carried out any normalisation of data.

I hope I was specific enough, and I am glad to give you more details that might be required.

Please consider my lack of in depth knowledge and experience with data analytics and provide some guidance how to visualise/represent the results.

ScottF · April 23, 2024, 7:44pm

Hi @FlorianSpring and welcome to the forum.

Do you by chance have a sample dataset you could upload? Folks are more likely to explore options when you hand them the data.