Assignment 5: kNN, k-means clustering
Student
Course
Institution
Professor
Date
Q1: There are 7,043 observations (rows) and 7 variables (columns) in the raw data.
Q2: The variables and their types
- Gender (Categorical)
- Senior Citizen (Categorical)
- Tenure (Numeric)
- Service (Categorical)
- Monthly Charges (Numeric)
- Total Charges (Numeric)
- Churn (Categorical)
Q3: The numeric variables that should be treated as categorical are Tenure, Monthly Charges, and Total Charges.
Q4: The table of statistics for the numeric variables
Variable | Missing Values | Min | Max | Median | Mean | Standard Deviation | Skewness | Kurtosis |
---|---|---|---|---|---|---|---|---|
Tenure | 0 | 29 | 32.37 | 34.81 | 37.9 | 24.48 | 0.43 | -1.51 |
Monthly Charges | 0 | 116.8 | 70.35 | 64.76 | 72.5 | 30.09 | 0.19 | -1.38 |
Total Charges | 11 | 8694 | 4504.7 | 2283.3 | 537.90 | 2266.81 | 0.88 | -0.11 |
Q5: The table of outlier counts for the numeric variables is as follows:
Variable | Outliers |
---|---|
Tenure | 0 |
Monthly Charges | 0 |
Total Charges | 3 |
Q6: The table of unique values and counts for the categorical variables is as follows:
Variable | Unique Values | Counts |
---|---|---|
Gender | 2 | 7043 |
Senior | 2 | 7043 |
Service | 3 | 7043 |
Churn | 2 | 7043 |
Q7 : The missing values are imputed using the KNIME Missing Value node. This node uses a machine-learning based algorithm to fill in the missing values with “likely” values for the row.
Q8 : Histograms for the numeric variables
churnCount_yes = pd.DataFrame(df[df[‘Churn’]== ‘Yes’])
churnCount_no = pd.DataFrame(df[df[‘Churn’]== ‘No’]) plt.hist(data = churnCount_yes, x = ‘tenure’,label = ‘Yes’,color=‘Orange’,edgecolor=‘black’) plt.hist(data = churnCount_no, x = ‘tenure’,label = ‘No’,alpha=.4,color=‘Blue’,edgecolor=‘black’) plt.show() churnCount_no.describe()
Q9 A bar chart for each of the categorical variables
Citizens = df[‘SeniorCitizen’].sum()
male_citizens = df.loc[df[‘gender’]==‘Male’,‘SeniorCitizen’].sum()
female_citizens = df.loc[df[‘gender’]==‘Female’,‘SeniorCitizen’].sum()
print('Total senior citizen–> ',Citizens)
print('Total male senior citizen–> ',male_citizens)
print('Total female senior citizen–> ',female_citizens)
Total senior citizen–> 1142
Total male senior citizen–> 574
Total female senior citizen–> 568
citizen_data = [male_citizens,female_citizens]
data_labels = [‘Male’,‘Female’]
explode = (0, 0.01)
plt.pie(citizen_data,labels = data_labels,autopct = ‘%1.2f%%’)
plt.title(‘Senior Citizen’)
plt.show()
Q10. (1)
Model
import pandas as pd
Load your preprocessed data into a DataFrame
data = pd.read_csv(“path/to/your/preprocessed_data.csv”)
Separate features (X) and target label (y)
X = data.drop(“churn”, axis=1)
y = data[“churn”]
from sklearn.model_selection import train_test_split
Split data into training (70%) and validation (30%) sets
X_train, X_val, y_train, y_val = train_test_split (X, y, test_size=0.3, random_state=42)
from sklearn.neighbors import KNeighborsClassifier
Create a kNN classifier
knn_model = KNeighborsClassifier(n_neighbors=5) # You can choose the number of neighbors as per your preference
Train the model on the training data
knn_model.fit(X_train, y_train)
Q10(2) Score the validation data (predict) using the model. Make a conditional boxplot that shows the prediction probability distributions for the true positive and true negative populations for comparison .
Q11(3) k-means clustering algorithm produce two clusters of data (true negative)
Q12(1) Cluster_0 Model
The number of neighbors (k) that minimizes the error rate is 10 as shown in the figure. Hence it indicates that the k-nearest neighbors jumped down dramatically as shown on the output.
Q12(2) Score the validation data (predict) using the model. Make a conditional boxplot that shows the prediction probability distributions for the true positive and true negative populations for comparison.
Q12(3) Confusion table
Roc curve
Q12(4)
Accuracy
Precision = 30/30+30=0.5
Misclassification rate
Misclassification Rate = (false positive + false negative) / (total predictions)
30+10/1960
=0.02 or 2%
13.(1)
Import KNeighborsClassifier
from sklearn.neighbors import KNeighborsClassifier
Create arrays for the features and the target variable
y = churn_df[“churn”]. values
X = churn_df[[“account_length”, “customer_service_calls”]].values
Create a KNN classifier with 6 neighbors
knn = KNeighborsClassifier(n_neighbors=6)
Fit the classifier to the data
knn.fit(X, y)
Predict the labels for the X_new
y_pred = knn.predict(X_new)
Print the predictions for X_new
print(“Predictions: {}”.format(y_pred))
X_new = np.array(
[[30.0, 17.5], [107.0, 24.1], [213.0, 10.9]])
Q13(2) Predictions: [0 1 0] As we can see from the predictions, The model has predicted the first and third customers will not churn in the new array.
plt.boxplot(churnCount_yes[‘MonthlyCharges’])
plt.title(‘Churn’)
plt. show()
plt.boxplot(churnCount_no[‘MonthlyCharges’]) plt.title(‘NonChurn’)
Q14
The comparison of the metrics between all the models show that the model created on all the data before clustering has the highest accuracy and ROC AUC score. The kNN models on the individual clusters do not perform better than the kNN model created on all the data before clustering.