Help with my school work with Knime

So i have this school work, where we have to predict how many attending students will graduate and how many will resign. I have been trying to solve this, but i am not able. Can anyone help, what should i do with Knime?

We know students’

  • entrance examination points
  • last grade
  • how many days they have been away from school
    Also there is information about already graduated and resigned students.
    I have also calculated correlation between last grades and absence days.

I was thinking to use the linear regression learned, but now i’m not sure if its right way to go.

Please help me :slight_smile:

Hi @suetus,

can you provide the data for testing please? Without it, trying to give support is quite a challenge.

Best
Mike

1 Like

Hi @suetus,

sounds like a classification problem :slight_smile: Have you tried to look for examples on the hub?

Have a nice day,
Raffaello Barri

5 Likes

Kokeilijan projektin data.xlsx (14.3 KB)

Hi. Yes, here is the provided data. Sorry, it’s partially in finish. I am familiar with using the decision tree, that is recommended under, but i don’t get, how to use it in this situation. It is goal to predict, are attending students going to resign or graduate.

Hi @suetus,

While I’m not a professional data analyst and lack formal training to decide which correlation type and sided variant to use based on the data distribution (so take what I provide you with a fair grain of salt!), here is one approach.

For simplicity, we ignore the group “Ryhmä” and municipality “Kotikunta” for now and accept the data is valid, has no missing values, each name equals one student etc… Though, I seem some values with zero are present and likely must be removed as it, so I assume means, “not taken” (i.e. because of illness)

The columns translate to:

  1. Tilanne - Status
  2. Ryhmä - Group: This could refer to the student’s class, cohort, or group designation within the institution.
  3. Kotikunta - Municipality/Home Town
  4. Pääsykoe - Entrance Exam: The scores or results from an entrance examination.
  5. Arvosana - Grade: This likely represents the students’ grades or marks as an average / median across all
  6. Poissaolot - Absences: Number of absences the

First, I remove the zeros to get a more accurate picture. The problem here, it makes up 46 out of 183 data point leaving only 137. This drastically reduces the significance of the results! So add a few table spoons of salt again. Removing those which have the status “attending” reduces this to 51 data point.

Using the Auto Binner to remove outliers by using the quantiles “0.0, 0.01, 0.05, 0.25, 0.5, 0.75, 0.95, 0.99, 1.0” helps us identifying the lower and upper one or five percent. Using the 5 % threshold the data set is further reduced to 41 data points, filtering out 45.

Note to myself, is what I am doing valid? Seems to make little sense filtering out so much data. Unfortunately doing a statistical power analysis to determine if the data set is enough goes a little bit beyond the scope and there is no node in knime handling that.

Checking the data distribution using a Density Plot – KNIME Community Hub node, we can confirm a normal distribution for resigned and graduated. Interestingly, those attending already tend to form two bulges.

Plotting Absenses vs. the score during the entrance exam and coloring the status, there also seems a correlation between lower entrance exam scores and absenses amongst those who resigned vs. those which graduated. There is more interesting stuff to see but let’s not get distracted.

Knowing we have a normal distribution we can safely choose, using a Rank Correlation – KNIME Community Hub node, the Pearson correlation coefficient.

About the p-value calculation, let’s play dump for now and say we don’t know and are interested into any correlation of Pääsykoe, Arvosana or Poissaolot and the likelihood if resigning or graduating. Hence, chposing a two-sided p-value calculation.

Furthermore, using a regular Statistics – KNIME Community Hub in combination with a Group Loop Start, you can get a quick low effort overview.

Using the quite reduced data set of just 41 data points, it is quite difficult to read anything the the stats. That seems to be confirmed by the low correlation values with moderate probability.

First column name Second column name Correlation value p value Degrees of freedom Situation
Entrance exam Grade 0.12 0.55 27 %graduated%
Entrance exam Absenses -0.14 0.48 27 %graduated%
Grade Absenses -0.14 0.47 27 %graduated%
Entrance exam Grade 0.21 0.51 10 %resigned%
Entrance exam Absenses -0.14 0.68 10 %resigned%
Grade Absenses -0.32 0.32 10 %resigned%

If that is correct, it means the data provided is insufficient to give a good estimate whenever a student graduates or resigns.

Factoring in the outliers, the histogram seems to indicate a correlation (not a causality!) between the amount of “Pääsykoe” (roughly translated to “Entrance exam”) and whenever students graduated or resigned. But, hence no causality, there is a skewed distribution to the left (higher count of Pääsykoe) amongst those still attending.

Checking in on the correlation data what was concluded before still seems valid. There is no correlation or the data set (statistical power) is insufficient.

First column name Second column name Correlation value p value Degrees of freedom Situation
Entrance exam Grade -0.02 0.86 56 %graduated%
Entrance exam Absenses 0.06 0.68 56 %graduated%
Grade Absenses -0.04 0.74 56 %graduated%
Entrance exam Grade 0.27 0.17 26 %resigned%
Entrance exam Absenses 0.09 0.64 26 %resigned%
Grade Absenses -0.05 0.79 26 %resigned%

Here is the workflow:

At the very least I want to introduce you to an AI approach explained in this article containing a sample workflow. However, for more sophisticated approaches a properly sized data model is absolutely necessary. You could use data augmentation but … let’s not get too deep into the rabbit hole :wink:

Final note - Please “roast” me
I am happy to learn more and challenge myself each day. If there is someone reading this and checking my workflow, please feel free to “roast” me … constructively :wink:

Happy “kniming”
Mike

2 Likes

PS: I noticed, while enjoying a string double espresso, that I made a few minor mistakes which were:

  1. Filtering absences of zero: Fixing that increased the data set size a little bit
  2. Filtering only the bins 2 and 6 instead of 1-2 and 6-7. Fixing that decreased the sample size again slightly

As follows the correlation for the strictly filtered data set:

First column name Second column name Correlation value p value Degrees of freedom Situation
Entrance exam Grade 0.0 1.0 25 %graduated%
Entrance exam Absenses -0.08 0.7 25 %graduated%
Entrance exam Grade -0.38 0.4 5 %resigned%
Entrance exam Absenses 0.0 1.0 5 %resigned%
Grade Absenses -0.1 0.61 25 %graduated%
Grade Absenses 0.42 0.35 5 %resigned%

There is now a moderate correlation but with a low confidence between absences and resigning

Including the outliers again only dilutes the the results:

Entrance exam Grade 0.0 1.0 76 %graduated%
Entrance exam Absenses 0.13 0.24 76 %graduated%
Entrance exam Grade 0.27 0.17 26 %resigned%
Entrance exam Absenses 0.09 0.64 26 %resigned%
Grade Absenses -0.02 0.88 76 %graduated%
Grade Absenses -0.05 0.79 26 %resigned%

It seems, though, the initial albeit slightly false data wasn’t that far off bringing me back to the fact that the data size is too small to give any estimation.

I updated my workflow and also enriched the data set by geo data and a loop to analyze all possible correlation. But, as mentioned before, the data set is too small.