Assignment 5: kNN, k-means clustering

Assignment 5: kNN, k-means clustering






Q1: There are 7,043 observations (rows) and 7 variables (columns) in the raw data.

Q2: The variables and their types

  • Gender (Categorical)
  • Senior Citizen (Categorical)
  • Tenure (Numeric)
  • Service (Categorical)
  • Monthly Charges (Numeric)
  • Total Charges (Numeric)
  • Churn (Categorical)

Q3: The numeric variables that should be treated as categorical are Tenure, Monthly Charges, and Total Charges.

Q4: The table of statistics for the numeric variables

Variable Missing Values Min Max Median Mean Standard Deviation Skewness Kurtosis
Tenure 0 29 32.37 34.81 37.9 24.48 0.43 -1.51
Monthly Charges 0 116.8 70.35 64.76 72.5 30.09 0.19 -1.38
Total Charges 11 8694 4504.7 2283.3 537.90 2266.81 0.88 -0.11

Q5: The table of outlier counts for the numeric variables is as follows:

Variable Outliers
Tenure 0
Monthly Charges 0
Total Charges 3

Q6: The table of unique values and counts for the categorical variables is as follows:

Variable Unique Values Counts
Gender 2 7043
Senior 2 7043
Service 3 7043
Churn 2 7043

Q7 : The missing values are imputed using the KNIME Missing Value node. This node uses a machine-learning based algorithm to fill in the missing values with “likely” values for the row.

Q8 : Histograms for the numeric variables

churnCount_yes = pd.DataFrame(df[df[‘Churn’]== ‘Yes’])

churnCount_no = pd.DataFrame(df[df[‘Churn’]== ‘No’]) plt.hist(data = churnCount_yes, x = ‘tenure’,label = ‘Yes’,color=‘Orange’,edgecolor=‘black’) plt.hist(data = churnCount_no, x = ‘tenure’,label = ‘No’,alpha=.4,color=‘Blue’,edgecolor=‘black’) churnCount_no.describe()

Q9 A bar chart for each of the categorical variables

Citizens = df[‘SeniorCitizen’].sum()

male_citizens = df.loc[df[‘gender’]==‘Male’,‘SeniorCitizen’].sum()

female_citizens = df.loc[df[‘gender’]==‘Female’,‘SeniorCitizen’].sum()

print('Total senior citizen–> ',Citizens)

print('Total male senior citizen–> ',male_citizens)

print('Total female senior citizen–> ',female_citizens)

Total senior citizen–> 1142

Total male senior citizen–> 574

Total female senior citizen–> 568

citizen_data = [male_citizens,female_citizens]

data_labels = [‘Male’,‘Female’]

explode = (0, 0.01)

plt.pie(citizen_data,labels = data_labels,autopct = ‘%1.2f%%’)

plt.title(‘Senior Citizen’)

Q10. (1)


import pandas as pd

Load your preprocessed data into a DataFrame

data = pd.read_csv(“path/to/your/preprocessed_data.csv”)

Separate features (X) and target label (y)

X = data.drop(“churn”, axis=1)

y = data[“churn”]

from sklearn.model_selection import train_test_split

Split data into training (70%) and validation (30%) sets

X_train, X_val, y_train, y_val = train_test_split (X, y, test_size=0.3, random_state=42)

from sklearn.neighbors import KNeighborsClassifier

Create a kNN classifier

knn_model = KNeighborsClassifier(n_neighbors=5) # You can choose the number of neighbors as per your preference

Train the model on the training data, y_train)

Q10(2) Score the validation data (predict) using the model. Make a conditional boxplot that shows the prediction probability distributions for the true positive and true negative populations for comparison .

Q11(3) k-means clustering algorithm produce two clusters of data (true negative)

Q12(1) Cluster_0 Model

The number of neighbors (k) that minimizes the error rate is 10 as shown in the figure. Hence it indicates that the k-nearest neighbors jumped down dramatically as shown on the output.

Q12(2) Score the validation data (predict) using the model. Make a conditional boxplot that shows the prediction probability distributions for the true positive and true negative populations for comparison.

Q12(3) Confusion table

Roc curve



Precision = 30/30+30=0.5

Misclassification rate

Misclassification Rate = (false positive + false negative) / (total predictions)


=0.02 or 2%


Import KNeighborsClassifier

from sklearn.neighbors import KNeighborsClassifier

Create arrays for the features and the target variable

y = churn_df[“churn”]. values

X = churn_df[[“account_length”, “customer_service_calls”]].values

Create a KNN classifier with 6 neighbors

knn = KNeighborsClassifier(n_neighbors=6)

Fit the classifier to the data, y)

Predict the labels for the X_new

y_pred = knn.predict(X_new)

Print the predictions for X_new

print(“Predictions: {}”.format(y_pred))

X_new = np.array(

[[30.0, 17.5], [107.0, 24.1], [213.0, 10.9]])

Q13(2) Predictions: [0 1 0] As we can see from the predictions, The model has predicted the first and third customers will not churn in the new array.



plt. show()

plt.boxplot(churnCount_no[‘MonthlyCharges’]) plt.title(‘NonChurn’)


The comparison of the metrics between all the models show that the model created on all the data before clustering has the highest accuracy and ROC AUC score. The kNN models on the individual clusters do not perform better than the kNN model created on all the data before clustering.

I’m… not sure what is happening here. :sweat_smile:

You’ve pasted a homework assignment, I gather? Do you have a particular question you need help with?

(Is this specifically related to KNIME at all?)


This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.