**Assignment 5: kNN, k-means clustering**

Student

Course

Institution

Professor

Date

**Q1:** There are 7,043 observations (rows) and 7 variables (columns) in the raw data.

**Q2:** The variables and their types

- Gender (Categorical)
- Senior Citizen (Categorical)
- Tenure (Numeric)
- Service (Categorical)
- Monthly Charges (Numeric)
- Total Charges (Numeric)
- Churn (Categorical)

**Q3:** The numeric variables that should be treated as categorical are Tenure, Monthly Charges, and Total Charges.

**Q4:** The table of statistics for the numeric variables

Variable | Missing Values | Min | Max | Median | Mean | Standard Deviation | Skewness | Kurtosis |
---|---|---|---|---|---|---|---|---|

Tenure | 0 | 29 | 32.37 | 34.81 | 37.9 | 24.48 | 0.43 | -1.51 |

Monthly Charges | 0 | 116.8 | 70.35 | 64.76 | 72.5 | 30.09 | 0.19 | -1.38 |

Total Charges | 11 | 8694 | 4504.7 | 2283.3 | 537.90 | 2266.81 | 0.88 | -0.11 |

Q5: The table of outlier counts for the numeric variables is as follows:

Variable | Outliers |
---|---|

Tenure | 0 |

Monthly Charges | 0 |

Total Charges | 3 |

Q6: The table of unique values and counts for the categorical variables is as follows:

Variable |
Unique Values |
Counts |
---|---|---|

Gender |
2 |
7043 |

Senior |
2 |
7043 |

Service |
3 |
7043 |

Churn |
2 |
7043 |

**Q7** : The missing values are imputed using the KNIME Missing Value node. This node uses a machine-learning based algorithm to fill in the missing values with “likely” values for the row.

**Q8** : Histograms for the numeric variables

churnCount_yes = pd.DataFrame(df[df[‘Churn’]== ‘Yes’])

churnCount_no = pd.DataFrame(df[df[‘Churn’]== ‘No’]) plt.hist(data = churnCount_yes, x = ‘tenure’,label = ‘Yes’,color=‘Orange’,edgecolor=‘black’) plt.hist(data = churnCount_no, x = ‘tenure’,label = ‘No’,alpha=.4,color=‘Blue’,edgecolor=‘black’) plt.show() churnCount_no.describe()

Q9 A bar chart for each of the categorical variables

Citizens = df[‘SeniorCitizen’].sum()

male_citizens = df.loc[df[‘gender’]==‘Male’,‘SeniorCitizen’].sum()

female_citizens = df.loc[df[‘gender’]==‘Female’,‘SeniorCitizen’].sum()

print('Total senior citizen–> ',Citizens)

print('Total male senior citizen–> ',male_citizens)

print('Total female senior citizen–> ',female_citizens)

Total senior citizen–> 1142

Total male senior citizen–> 574

Total female senior citizen–> 568

citizen_data = [male_citizens,female_citizens]

data_labels = [‘Male’,‘Female’]

explode = (0, 0.01)

plt.pie(citizen_data,labels = data_labels,autopct = ‘%1.2f%%’)

plt.title(‘Senior Citizen’)

plt.show()

**Q10. (1)**

**Model**

import pandas as pd

# Load your preprocessed data into a DataFrame

data = pd.read_csv(“path/to/your/preprocessed_data.csv”)

# Separate features (X) and target label (y)

X = data.drop(“churn”, axis=1)

y = data[“churn”]

from sklearn.model_selection import train_test_split

# Split data into training (70%) and validation (30%) sets

X_train, X_val, y_train, y_val = train_test_split (X, y, test_size=0.3, random_state=42)

from sklearn.neighbors import KNeighborsClassifier

# Create a kNN classifier

knn_model = KNeighborsClassifier(n_neighbors=5) # You can choose the number of neighbors as per your preference

# Train the model on the training data

knn_model.fit(X_train, y_train)

**Q10(2) Score the validation data (predict) using the model. Make a conditional boxplot that shows the prediction probability distributions for the true positive and true negative populations for comparison** .

Q11(**3)** **k-means clustering algorithm produce two clusters of data (true negative)**

Q12(1) **Cluster_0 Model**

The number of neighbors (k) that minimizes the error rate is 10 as shown in the figure. Hence it indicates that the k-nearest neighbors jumped down dramatically as shown on the output.

Q12(2) Score the validation data (predict) using the model. Make a conditional boxplot that shows the prediction probability distributions for the true positive and true negative populations for comparison.

**Q12(3) Confusion table**

**Roc curve**

Q12(4)

**Accuracy**

Precision = 30/30+30=0.5

**Misclassification rate**

Misclassification Rate = (false positive + false negative) / (total predictions)

30+10/1960

=0.02 or 2%

13.(1)

Import KNeighborsClassifier

from sklearn.neighbors import KNeighborsClassifier

# Create arrays for the features and the target variable

y = churn_df[“churn”]. values

X = churn_df[[“account_length”, “customer_service_calls”]].values

# Create a KNN classifier with 6 neighbors

knn = KNeighborsClassifier(n_neighbors=6)

# Fit the classifier to the data

knn.fit(X, y)

# Predict the labels for the X_new

y_pred = knn.predict(X_new)

# Print the predictions for X_new

print(“Predictions: {}”.format(y_pred))

X_new = np.array(

[[30.0, 17.5], [107.0, 24.1], [213.0, 10.9]])

Q13(2) Predictions: [0 1 0] As we can see from the predictions, The model has predicted the first and third customers will not churn in the new array.

plt.boxplot(churnCount_yes[‘MonthlyCharges’])

plt.title(‘Churn’)

plt. show()

plt.boxplot(churnCount_no[‘MonthlyCharges’]) plt.title(‘NonChurn’)

Q14

The comparison of the metrics between all the models show that the model created on all the data before clustering has the highest accuracy and ROC AUC score. The kNN models on the individual clusters do not perform better than the kNN model created on all the data before clustering.