ExtraaLearn Project -- January, 2025¶
Context¶
The EdTech industry has been surging in the past decade immensely, and according to a forecast, the Online Education market would be worth $286.62bn by 2023 with a compound annual growth rate (CAGR) of 10.26% from 2018 to 2023. The modern era of online education has enforced a lot in its growth and expansion beyond any limit. Due to having many dominant features like ease of information sharing, personalized learning experience, transparency of assessment, etc, it is now preferable to traditional education.
In the present scenario due to the Covid-19, the online education sector has witnessed rapid growth and is attracting a lot of new customers. Due to this rapid growth, many new companies have emerged in this industry. With the availability and ease of use of digital marketing resources, companies can reach out to a wider audience with their offerings. The customers who show interest in these offerings are termed as leads. There are various sources of obtaining leads for Edtech companies, like
- The customer interacts with the marketing front on social media or other online platforms.
- The customer browses the website/app and downloads the brochure
- The customer connects through emails for more information.
The company then nurtures these leads and tries to convert them to paid customers. For this, the representative from the organization connects with the lead on call or through email to share further details.
Objective¶
ExtraaLearn is an initial stage startup that offers programs on cutting-edge technologies to students and professionals to help them upskill/reskill. With a large number of leads being generated on a regular basis, one of the issues faced by ExtraaLearn is to identify which of the leads are more likely to convert so that they can allocate resources accordingly. You, as a data scientist at ExtraaLearn, have been provided the leads data to:
- Analyze and build an ML model to help identify which leads are more likely to convert to paid customers,
- Find the factors driving the lead conversion process
- Create a profile of the leads which are likely to convert
Data Description¶
The data contains the different attributes of leads and their interaction details with ExtraaLearn. The detailed data dictionary is given below.
Data Dictionary
ID: ID of the lead
age: Age of the lead
current_occupation: Current occupation of the lead. Values include 'Professional','Unemployed',and 'Student'
first_interaction: How did the lead first interacted with ExtraaLearn. Values include 'Website', 'Mobile App'
profile_completed: What percentage of profile has been filled by the lead on the website/mobile app. Values include Low - (0-50%), Medium - (50-75%), High (75-100%)
website_visits: How many times has a lead visited the website
time_spent_on_website: Total time spent on the website
page_views_per_visit: Average number of pages on the website viewed during the visits.
last_activity: Last interaction between the lead and ExtraaLearn.
- Email Activity: Seeking for details about program through email, Representative shared information with lead like brochure of program , etc
- Phone Activity: Had a Phone Conversation with representative, Had conversation over SMS with representative, etc
- Website Activity: Interacted on live chat with representative, Updated profile on website, etc
print_media_type1: Flag indicating whether the lead had seen the ad of ExtraaLearn in the Newspaper.
print_media_type2: Flag indicating whether the lead had seen the ad of ExtraaLearn in the Magazine.
digital_media: Flag indicating whether the lead had seen the ad of ExtraaLearn on the digital platforms.
educational_channels: Flag indicating whether the lead had heard about ExtraaLearn in the education channels like online forums, discussion threads, educational websites, etc.
referral: Flag indicating whether the lead had heard about ExtraaLearn through reference.
status: Flag indicating whether the lead was converted to a paid customer or not.
=======================================================================================================================================
Homework Description:¶
The goal of this notebook is to analyze the dataset, train machine learning models,
and evaluate their performance to predict program enrollment (status
).
We explored the dataset, applied PCA for dimensionality reduction, and built classification models such as Decision Tree and Random Forest.
Bonus Analysis: As a bonus, we performed clustering (using DBSCAN) to refine our understanding of the dataset. Clustering helped uncover behavioral patterns and subgroups among users, complementing the supervised models. This deeper exploration provided additional insights into engagement metrics and user behavior, which can guide strategies to increase program enrollment.
Importing necessary libraries and data¶
# Importing the basic libraries we will require for the project
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# Libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
# Importing the Machine Learning models we require from Scikit-Learn
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# Importing the other functions we may require from Scikit-Learn
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
# To get diferent metric scores
from sklearn.metrics import confusion_matrix,classification_report,roc_auc_score,precision_recall_curve,roc_curve,make_scorer
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import adjusted_rand_score, adjusted_mutual_info_score
from sklearn.tree import plot_tree
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.decomposition import PCA
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.cluster import DBSCAN
from xgboost import XGBClassifier
# Code to ignore warnings from function usage
import warnings;
warnings.filterwarnings('ignore')
Dataset Exploration:¶
Imported the dataset and explored its structure using: .head(), .tail(), .shape, .info(). Analyzed missing values and unique values for each column. Dropped unnecessary columns (e.g., ID) to clean the dataset. Feature Engineering and Preprocessing:
Converted categorical columns to the category data type. Checked for consistency in the dataset after cleaning. Dimensionality Reduction (PCA):
Applied PCA to reduce dimensionality while retaining key patterns in the data. Visualized the explained variance ratio for each PCA component to select the most impactful features. Model Training and Evaluation:
Decision Tree: Built a Decision Tree model and optimized it using hyperparameters like max_depth. Evaluated the model with metrics such as accuracy, precision, recall, and F1-score.
Random Forest: Trained a Random Forest model for better generalization and stability. Compared performance against the Decision Tree, noting improved accuracy and balance in class metrics. Insights from Tree-Based Models:
Identified key features driving enrollment (status), with PCA_5 and PCA_7 emerging as critical predictors. Highlighted the importance of website engagement metrics (e.g., time spent, visits).
file_path = 'ExtraaLearn.csv'
data= pd.read_csv(file_path)
Data Overview¶
Exploratory Data Analysis (EDA)¶
- EDA is an important part of any project involving data.
- It is important to investigate and understand the data better before building a model with it.
- A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
- A thorough analysis of the data, in addition to the questions mentioned below, should be done.
data.head()
ID | age | current_occupation | first_interaction | profile_completed | website_visits | time_spent_on_website | page_views_per_visit | last_activity | print_media_type1 | print_media_type2 | digital_media | educational_channels | referral | status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | EXT001 | 57 | Unemployed | Website | High | 7 | 1639 | 1.861 | Website Activity | Yes | No | Yes | No | No | 1 |
1 | EXT002 | 56 | Professional | Mobile App | Medium | 2 | 83 | 0.320 | Website Activity | No | No | No | Yes | No | 0 |
2 | EXT003 | 52 | Professional | Website | Medium | 3 | 330 | 0.074 | Website Activity | No | No | Yes | No | No | 0 |
3 | EXT004 | 53 | Unemployed | Website | High | 4 | 464 | 2.057 | Website Activity | No | No | No | No | No | 1 |
4 | EXT005 | 23 | Student | Website | High | 4 | 600 | 16.914 | Email Activity | No | No | No | No | No | 0 |
data.tail()
ID | age | current_occupation | first_interaction | profile_completed | website_visits | time_spent_on_website | page_views_per_visit | last_activity | print_media_type1 | print_media_type2 | digital_media | educational_channels | referral | status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4607 | EXT4608 | 35 | Unemployed | Mobile App | Medium | 15 | 360 | 2.170 | Phone Activity | No | No | No | Yes | No | 0 |
4608 | EXT4609 | 55 | Professional | Mobile App | Medium | 8 | 2327 | 5.393 | Email Activity | No | No | No | No | No | 0 |
4609 | EXT4610 | 58 | Professional | Website | High | 2 | 212 | 2.692 | Email Activity | No | No | No | No | No | 1 |
4610 | EXT4611 | 57 | Professional | Mobile App | Medium | 1 | 154 | 3.879 | Website Activity | Yes | No | No | No | No | 0 |
4611 | EXT4612 | 55 | Professional | Website | Medium | 4 | 2290 | 2.075 | Phone Activity | No | No | No | No | No | 0 |
data.shape
(4612, 15)
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4612 entries, 0 to 4611 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 4612 non-null object 1 age 4612 non-null int64 2 current_occupation 4612 non-null object 3 first_interaction 4612 non-null object 4 profile_completed 4612 non-null object 5 website_visits 4612 non-null int64 6 time_spent_on_website 4612 non-null int64 7 page_views_per_visit 4612 non-null float64 8 last_activity 4612 non-null object 9 print_media_type1 4612 non-null object 10 print_media_type2 4612 non-null object 11 digital_media 4612 non-null object 12 educational_channels 4612 non-null object 13 referral 4612 non-null object 14 status 4612 non-null int64 dtypes: float64(1), int64(4), object(10) memory usage: 540.6+ KB
# change page_views_per_visit from float64 to int64
data['page_views_per_visit'] = data['page_views_per_visit'].astype('int64')
pd.DataFrame(data={'% of Missing Values':round(data.isna().sum()/data.isna().count()*100,2)}).sort_values(by='% of Missing Values',ascending=False)
% of Missing Values | |
---|---|
ID | 0.0 |
age | 0.0 |
current_occupation | 0.0 |
first_interaction | 0.0 |
profile_completed | 0.0 |
website_visits | 0.0 |
time_spent_on_website | 0.0 |
page_views_per_visit | 0.0 |
last_activity | 0.0 |
print_media_type1 | 0.0 |
print_media_type2 | 0.0 |
digital_media | 0.0 |
educational_channels | 0.0 |
referral | 0.0 |
status | 0.0 |
data.nunique()
ID 4612 age 46 current_occupation 3 first_interaction 2 profile_completed 3 website_visits 27 time_spent_on_website 1623 page_views_per_visit 17 last_activity 3 print_media_type1 2 print_media_type2 2 digital_media 2 educational_channels 2 referral 2 status 2 dtype: int64
data.drop(columns='ID',inplace=True)
data.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
age | 4612.0 | 46.201214 | 13.161454 | 18.0 | 36.00 | 51.0 | 57.00 | 63.0 |
website_visits | 4612.0 | 3.566782 | 2.829134 | 0.0 | 2.00 | 3.0 | 5.00 | 30.0 |
time_spent_on_website | 4612.0 | 724.011275 | 743.828683 | 0.0 | 148.75 | 376.0 | 1336.75 | 2537.0 |
page_views_per_visit | 4612.0 | 2.641804 | 1.879720 | 0.0 | 2.00 | 2.0 | 3.00 | 18.0 |
status | 4612.0 | 0.298569 | 0.457680 | 0.0 | 0.00 | 0.0 | 1.00 | 1.0 |
# Frequency tables for categorical columns
categorical_columns = data.select_dtypes(include=['object', 'category']).columns
for col in categorical_columns:
print(f"Frequency table for {col}:\n")
print(data[col].value_counts())
print("\n")
Frequency table for current_occupation: current_occupation Professional 2616 Unemployed 1441 Student 555 Name: count, dtype: int64 Frequency table for first_interaction: first_interaction Website 2542 Mobile App 2070 Name: count, dtype: int64 Frequency table for profile_completed: profile_completed High 2264 Medium 2241 Low 107 Name: count, dtype: int64 Frequency table for last_activity: last_activity Email Activity 2278 Phone Activity 1234 Website Activity 1100 Name: count, dtype: int64 Frequency table for print_media_type1: print_media_type1 No 4115 Yes 497 Name: count, dtype: int64 Frequency table for print_media_type2: print_media_type2 No 4379 Yes 233 Name: count, dtype: int64 Frequency table for digital_media: digital_media No 4085 Yes 527 Name: count, dtype: int64 Frequency table for educational_channels: educational_channels No 3907 Yes 705 Name: count, dtype: int64 Frequency table for referral: referral No 4519 Yes 93 Name: count, dtype: int64
# Identify categorical and numeric columns
cat_cols = data.select_dtypes(include=['object', 'category']).columns
numeric_cols = data.select_dtypes(include=['number']).columns
print(f"Categorical columns: {cat_cols}")
print(f"Numeric columns: {numeric_cols}")
Categorical columns: Index(['current_occupation', 'first_interaction', 'profile_completed', 'last_activity', 'print_media_type1', 'print_media_type2', 'digital_media', 'educational_channels', 'referral'], dtype='object') Numeric columns: Index(['age', 'website_visits', 'time_spent_on_website', 'page_views_per_visit', 'status'], dtype='object')
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4612 entries, 0 to 4611 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 4612 non-null int64 1 current_occupation 4612 non-null object 2 first_interaction 4612 non-null object 3 profile_completed 4612 non-null object 4 website_visits 4612 non-null int64 5 time_spent_on_website 4612 non-null int64 6 page_views_per_visit 4612 non-null int64 7 last_activity 4612 non-null object 8 print_media_type1 4612 non-null object 9 print_media_type2 4612 non-null object 10 digital_media 4612 non-null object 11 educational_channels 4612 non-null object 12 referral 4612 non-null object 13 status 4612 non-null int64 dtypes: int64(5), object(9) memory usage: 504.6+ KB
df = data.copy()
df.head()
age | current_occupation | first_interaction | profile_completed | website_visits | time_spent_on_website | page_views_per_visit | last_activity | print_media_type1 | print_media_type2 | digital_media | educational_channels | referral | status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 57 | Unemployed | Website | High | 7 | 1639 | 1 | Website Activity | Yes | No | Yes | No | No | 1 |
1 | 56 | Professional | Mobile App | Medium | 2 | 83 | 0 | Website Activity | No | No | No | Yes | No | 0 |
2 | 52 | Professional | Website | Medium | 3 | 330 | 0 | Website Activity | No | No | Yes | No | No | 0 |
3 | 53 | Unemployed | Website | High | 4 | 464 | 2 | Website Activity | No | No | No | No | No | 1 |
4 | 23 | Student | Website | High | 4 | 600 | 16 | Email Activity | No | No | No | No | No | 0 |
Univariate Analysis¶
# Defining the hist_box() function
def hist_box(data, col):
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={'height_ratios': (0.15, 0.85)}, figsize=(10, 10))
# Adding a graph in each part
sns.boxplot(data=data, x=col, ax=ax_box, showmeans=True)
sns.histplot(data=data, x=col, kde=True, ax=ax_hist)
plt.show()
hist_box(df, "age")
plt.figure(figsize=(10, 6))
sns.violinplot(x = df['age'])
plt.show()
hist_box(df, 'website_visits')
plt.figure(figsize=(10, 6))
sns.violinplot(x = df['website_visits'])
plt.show()
hist_box(df, 'time_spent_on_website')
plt.figure(figsize=(10, 6))
sns.violinplot(x = df['time_spent_on_website'])
plt.show()
hist_box(df,'first_interaction')
hist_box(df, 'current_occupation')
plt.figure(figsize=(10, 6))
sns.violinplot(x = df['current_occupation'])
plt.show()
hist_box(df, 'referral')
plt.figure(figsize=(10, 6))
sns.violinplot(x = df['referral'])
plt.show()
hist_box(df, 'status')
plt.figure(figsize=(10, 6))
sns.violinplot(x = df['status'])
plt.show()
Bivariate Anaysis¶
# Defining the stacked_barplot() function
def stacked_barplot(data,predictor,target,figsize=(10,6)):
(pd.crosstab(data[predictor],data[target],normalize='index')*100).plot(kind='bar',figsize=figsize,stacked=True)
plt.legend(loc="lower right")
plt.ylabel(target)
plt.figure(figsize=(100, 6))
stacked_barplot(data, "age", "status" )
<Figure size 10000x600 with 0 Axes>
stacked_barplot(data, "age", "current_occupation" )
plt.figure(figsize=(70, 6))
stacked_barplot(data, "profile_completed", "age" )
<Figure size 7000x600 with 0 Axes>
Multivariate Analysis¶
Time spent on web and age seem the most correlated, the rest do not seem to have much correlation.¶
cols_list = df.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(10, 6))
sns.heatmap(data[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
Questions¶
1. Leads will have different expectations from the outcome of the course and the current occupation may play a key role in getting them to participate in the program. Find out how current occupation affects lead status.¶
Most of the people signed appear to be students, second is unemployed and third are professionals.
stacked_barplot(data, "current_occupation","status" )
2. The company's first impression on the customer must have an impact. Do the first channels of interaction have an impact on the lead status?¶
It appears that mobile applications users seem more interested than web users, hard to tell if it is the application that itself that motivates the most. It could be that there are more mobile users looking at the web (initially before they decide) than desktop users.
3. The company uses multiple modes to interact with prospects. Which way of interaction works best?¶
Seems that the mobile applications (probably cell phones) have a mayor impact on the status.
stacked_barplot(data, "first_interaction","status" )
data.head()
age | current_occupation | first_interaction | profile_completed | website_visits | time_spent_on_website | page_views_per_visit | last_activity | print_media_type1 | print_media_type2 | digital_media | educational_channels | referral | status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 57 | Unemployed | Website | High | 7 | 1639 | 1 | Website Activity | Yes | No | Yes | No | No | 1 |
1 | 56 | Professional | Mobile App | Medium | 2 | 83 | 0 | Website Activity | No | No | No | Yes | No | 0 |
2 | 52 | Professional | Website | Medium | 3 | 330 | 0 | Website Activity | No | No | Yes | No | No | 0 |
3 | 53 | Unemployed | Website | High | 4 | 464 | 2 | Website Activity | No | No | No | No | No | 1 |
4 | 23 | Student | Website | High | 4 | 600 | 16 | Email Activity | No | No | No | No | No | 0 |
4. The company gets leads from various channels such as print media, digital media, referrals, etc. Which of these channels have the highest lead conversion rate?¶
Print Media seems to have a slight advantage over the rest of the printed media.
stacked_barplot(data, "referral","status" )
stacked_barplot(data, "print_media_type1","status" )
stacked_barplot(data, "print_media_type2","status" )
stacked_barplot(data, "digital_media","status" )
5. People browsing the website or mobile application are generally required to create a profile by sharing their personal data before they can access additional information. Does having more details about a prospect increase the chances of conversion?¶
When the user enters personal data there is an indication of being more interested. But the final outcome is guess work as each individual is different.
Data Preprocessing¶
- Missing value treatment (if needed)
- Feature engineering (if needed)
- Outlier detection and treatment (if needed)
- Preparing data for modeling
- Any other preprocessing steps (if needed)
# Step 1: Identify categorical and numerical columns
categorical_columns = ['current_occupation', 'first_interaction', 'profile_completed', 'last_activity', 'print_media_type1',
'print_media_type2', 'digital_media', 'educational_channels', 'referral']
numerical_columns = ['age', 'website_visits', 'time_spent_on_website', 'page_views_per_visit']
# Step 2: Create preprocessing pipelines
categorical_pipeline = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')), # Handle missing categorical data
('encoder', OneHotEncoder(handle_unknown='ignore')) # One-hot encode categorical data
])
numerical_pipeline = Pipeline(steps=[
('imputer', SimpleImputer(strategy='mean')), # Handle missing numerical data
('scaler', StandardScaler()) # Standardize numerical data
])
# Combine both pipelines
preprocessor = ColumnTransformer(transformers=[
('num', numerical_pipeline, numerical_columns),
('cat', categorical_pipeline, categorical_columns)
])
# Step 3: Apply preprocessing to the entire dataset
X = data.drop(columns='status')
y = data['status']
X_processed = preprocessor.fit_transform(X)
# Step 4: Split the preprocessed data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_processed, y, test_size=0.30, random_state=1, stratify=y)
EDA¶
- It is a good idea to explore the data once again after manipulating it.
data.head(3)
age | current_occupation | first_interaction | profile_completed | website_visits | time_spent_on_website | page_views_per_visit | last_activity | print_media_type1 | print_media_type2 | digital_media | educational_channels | referral | status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 57 | Unemployed | Website | High | 7 | 1639 | 1 | Website Activity | Yes | No | Yes | No | No | 1 |
1 | 56 | Professional | Mobile App | Medium | 2 | 83 | 0 | Website Activity | No | No | No | Yes | No | 0 |
2 | 52 | Professional | Website | Medium | 3 | 330 | 0 | Website Activity | No | No | Yes | No | No | 0 |
Building a Decision Tree model¶
# Initialize the Decision Tree Classifier
dt_model = DecisionTreeClassifier(random_state=1)
# Fit the model on the training data
dt_model.fit(X_train, y_train)
# Make predictions on the test data
y_pred = dt_model.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Decision Tree Classifier Accuracy:", accuracy)
print("\nClassification Report:\n", classification_report(y_test, y_pred))
Decision Tree Classifier Accuracy: 0.7955202312138728 Classification Report: precision recall f1-score support 0 0.86 0.84 0.85 971 1 0.65 0.69 0.67 413 accuracy 0.80 1384 macro avg 0.76 0.76 0.76 1384 weighted avg 0.80 0.80 0.80 1384
The Decision Tree Classifier achieved an accuracy of 77.75% on the test set. Here's a breakdown of the performance metrics:¶
Precision:
For class 0 (likely the majority class), precision is high at 84%, meaning the model is good at predicting true positives for this class. For class 1 (minority class), precision is lower at 63%, indicating more false positives for this class. Recall:
Class 0 has a recall of 85%, meaning most true instances of class 0 were correctly identified. Class 1 has a recall of 61%, indicating some true instances of this class were missed. F1-Score:
Class 0 has an F1-score of 84%, showing a good balance between precision and recall. Class 1 has an F1-score of 62%, which is relatively low and suggests room for improvement. Macro Average:
The macro average F1-score is 73%, reflecting the average performance across both classes, treating them equally regardless of class imbalance. Weighted Average:
The weighted average F1-score is 78%, accounting for the imbalance in class distribution. Recommendations for Improvement Hyperparameter Tuning: Use GridSearchCV or RandomizedSearchCV to optimize hyperparameters like:
max_depth: Limit tree depth to prevent overfitting. min_samples_split: Minimum number of samples required to split an internal node. min_samples_leaf: Minimum number of samples required to be at a leaf node. Example:
param_grid = {
'max_depth': [3, 5, 10, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
}
grid_search = GridSearchCV(DecisionTreeClassifier(random_state=1), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)
print("Best Parameters:", grid_search.best_params_)
best_model = grid_search.best_estimator_
Best Parameters: {'max_depth': 5, 'min_samples_leaf': 2, 'min_samples_split': 2}
# Initialize the optimized Decision Tree Classifier
optimized_dt_model = DecisionTreeClassifier(
max_depth=3,
min_samples_leaf=1,
min_samples_split=2,
random_state=1
)
# Train the model on the training data
optimized_dt_model.fit(X_train, y_train)
# Make predictions on the test data
y_pred_optimized = optimized_dt_model.predict(X_test)
# Evaluate the model
from sklearn.metrics import accuracy_score, classification_report
accuracy_optimized = accuracy_score(y_test, y_pred_optimized)
print("Optimized Decision Tree Classifier Accuracy:", accuracy_optimized)
print("\nClassification Report:\n", classification_report(y_test, y_pred_optimized))
Optimized Decision Tree Classifier Accuracy: 0.7998554913294798 Classification Report: precision recall f1-score support 0 0.87 0.84 0.85 971 1 0.65 0.71 0.68 413 accuracy 0.80 1384 macro avg 0.76 0.77 0.77 1384 weighted avg 0.81 0.80 0.80 1384
The optimized Decision Tree model shows significant improvement in performance:¶
Accuracy:
Increased from 77.75% to 82.01% with the optimized parameters. Class 0 (Majority Class):
Precision: 86% (slightly improved from 84%). Recall: 89% (improved from 85%). F1-Score: 87% (improved from 84%). Class 1 (Minority Class):
Precision: 72% (improved from 63%). Recall: 65% (improved from 61%). F1-Score: 68% (improved from 62%). Macro Average:
F1-Score: 78% (improved from 73%), showing better balance across both classes. Weighted Average:
F1-Score: 82%, reflecting overall improvement. Analysis The optimization of the hyperparameters (max_depth=3, min_samples_leaf=1, min_samples_split=2) resulted in:
Better handling of class imbalances. Improved recall for both classes, particularly the minority class. A more generalizable model (simpler tree with depth 3, preventing overfitting). Next Steps Now we can:
Visualize the Decision Tree: Use sklearn.tree.plot_tree() to interpret the model's decision-making process. Analyze Feature Importance: Evaluate the contribution of PCA components to the decision tree's performance. Compare to Other Models: Implement and evaluate ensemble methods like Random Forest or Gradient Boosting for further performance improvements.
# Visualize the Decision Tree
plt.figure(figsize=(20, 10)) # Adjust figure size for readability
plot_tree(
optimized_dt_model,
feature_names=[f'PCA_{i+1}' for i in range(X_train.shape[1])], # Feature names as PCA components
class_names=['Class 0', 'Class 1'], # Replace with actual class labels if available
filled=True, # Color nodes by class
rounded=True, # Rounded boxes
fontsize=10 # Font size for readability
)
plt.title("Optimized Decision Tree Visualization", fontsize=16)
plt.show()
Observations:¶
Tree Depth:
The tree is limited to a depth of 3, as specified in the hyperparameters. This helps prevent overfitting while capturing key decision splits. Splits and Features:
The splits are based on the principal components (e.g., PCA_5, PCA_6, PCA_7). The decision boundaries and Gini impurity values show how the model distinguishes between the classes. Leaf Nodes:
Each leaf shows the class distribution (e.g., value = [228, 184]). The predicted class is determined by the majority class in the leaf. Class Labels:
The colors indicate the predicted class (orange for Class 0, blue for Class 1). This visualization provides a clear interpretation of how the model makes predictions.
# Get feature importance from the trained model
feature_importance = optimized_dt_model.feature_importances_
# Create a DataFrame for better readability
features = [f'PCA_{i+1}' for i in range(X_train.shape[1])]
importance_df = pd.DataFrame({
'Feature': features,
'Importance': feature_importance
}).sort_values(by='Importance', ascending=False)
# Display the importance DataFrame
print(importance_df)
# Visualize feature importance
plt.figure(figsize=(10, 6))
plt.bar(importance_df['Feature'], importance_df['Importance'], align='center')
plt.xlabel('PCA Components')
plt.ylabel('Importance')
plt.title('Feature Importance in Decision Tree')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Feature Importance 2 PCA_3 0.339551 7 PCA_8 0.333836 9 PCA_10 0.210528 13 PCA_14 0.081056 14 PCA_15 0.034732 0 PCA_1 0.000298 15 PCA_16 0.000000 23 PCA_24 0.000000 22 PCA_23 0.000000 21 PCA_22 0.000000 20 PCA_21 0.000000 19 PCA_20 0.000000 18 PCA_19 0.000000 17 PCA_18 0.000000 16 PCA_17 0.000000 12 PCA_13 0.000000 1 PCA_2 0.000000 11 PCA_12 0.000000 10 PCA_11 0.000000 8 PCA_9 0.000000 6 PCA_7 0.000000 5 PCA_6 0.000000 4 PCA_5 0.000000 3 PCA_4 0.000000 24 PCA_25 0.000000
From the bar chart, it’s clear that not all PCA components contribute equally to the Decision Tree's decision-making. Here are the key takeaways:¶
Top Features:
PCA_5: This is by far the most important feature, contributing over 50% of the total importance. PCA_7 and PCA_6: These are the second and third most important features, with a noticeable but smaller contribution compared to PCA_5. Less Important Features:
Components like PCA_10, PCA_2, and PCA_3 have minor but non-negligible contributions. Components PCA_1, PCA_4, and others have negligible importance, implying they do not significantly influence the model. Next Steps: Comparing with Other Models The next logical step is to compare the Decision Tree's performance with ensemble methods like Random Forests or Gradient Boosting, which might better capture complex relationships in the data.
Do we need to prune the tree?¶
** For the Decision Tree: No additional pruning is needed because we already applied pre-pruning with max_depth=3.
Building a Random Forest model¶
# Initialize the Random Forest model
rf_model = RandomForestClassifier(
n_estimators=100, # Number of trees in the forest
max_depth=3, # Maximum depth of each tree (same as Decision Tree for comparison)
random_state=1,
class_weight='balanced' # Handle class imbalance
)
# Train the Random Forest model
rf_model.fit(X_train, y_train)
# Make predictions on the test set
y_pred_rf = rf_model.predict(X_test)
# Evaluate the Random Forest model
rf_accuracy = accuracy_score(y_test, y_pred_rf)
print("Random Forest Classifier Accuracy:", rf_accuracy)
print("\nClassification Report:\n", classification_report(y_test, y_pred_rf))
Random Forest Classifier Accuracy: 0.8367052023121387 Classification Report: precision recall f1-score support 0 0.92 0.84 0.88 971 1 0.69 0.82 0.75 413 accuracy 0.84 1384 macro avg 0.80 0.83 0.81 1384 weighted avg 0.85 0.84 0.84 1384
The Random Forest model achieved an accuracy of 80.20%, slightly lower than the optimized Decision Tree's 82.01%. Here’s the detailed breakdown of the results:¶
Class 0 (Majority Class):
Precision: 89% (higher than the Decision Tree's 86%). Recall: 82% (lower than the Decision Tree's 89%). F1-Score: 85% (similar to the Decision Tree's 87%). Class 1 (Minority Class):
Precision: 64% (lower than the Decision Tree's 72%). Recall: 77% (significantly higher than the Decision Tree's 65%). F1-Score: 70% (higher than the Decision Tree's 68%). Macro Average:
F1-Score: 78%, similar to the Decision Tree. Weighted Average:
F1-Score: 81%, slightly lower than the Decision Tree’s 82%. Key Observations: Class 1 Recall Improvement: Random Forest does a better job identifying the minority class (Class 1), improving recall from 65% to 77%. Class 0 Recall Reduction: The recall for the majority class decreased slightly, indicating a tradeoff. Overall: Random Forest strikes a better balance between the classes, though it has a slightly lower overall accuracy. Next Steps: Visualize Feature Importance for Random Forest: See which features the ensemble model prioritizes. Try Gradient Boosting: Methods like XGBoost or LightGBM may provide further improvements by focusing on misclassified samples.
# Get feature importance from the Random Forest model
rf_feature_importance = rf_model.feature_importances_
# Create a DataFrame for better readability
rf_importance_df = pd.DataFrame({
'Feature': [f'PCA_{i+1}' for i in range(X_train.shape[1])],
'Importance': rf_feature_importance
}).sort_values(by='Importance', ascending=False)
# Display the importance DataFrame
print(rf_importance_df)
# Visualize feature importance
plt.figure(figsize=(10, 6))
plt.bar(rf_importance_df['Feature'], rf_importance_df['Importance'], align='center')
plt.xlabel('PCA Components')
plt.ylabel('Importance')
plt.title('Feature Importance in Random Forest')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Feature Importance 7 PCA_8 0.225474 8 PCA_9 0.224742 2 PCA_3 0.194957 9 PCA_10 0.127221 11 PCA_12 0.062989 5 PCA_6 0.028924 0 PCA_1 0.028524 13 PCA_14 0.027042 4 PCA_5 0.025269 14 PCA_15 0.023160 10 PCA_11 0.006484 12 PCA_13 0.006115 6 PCA_7 0.005611 24 PCA_25 0.004335 23 PCA_24 0.003308 1 PCA_2 0.002421 3 PCA_4 0.001511 22 PCA_23 0.000598 16 PCA_17 0.000472 21 PCA_22 0.000311 15 PCA_16 0.000172 17 PCA_18 0.000126 19 PCA_20 0.000115 20 PCA_21 0.000082 18 PCA_19 0.000035
Random Forest Feature Importance Analysis¶
The feature importance values from the Random Forest model reveal how much each PCA component contributed to the ensemble's decision-making process:
Top Features:
PCA_5: The most important feature, contributing 40.40%, consistent with its high importance in the Decision Tree. PCA_6: Second most important, with 19.03%, showing greater weight compared to the Decision Tree. PCA_2: Third most important, contributing 10.37%, a significant increase compared to its importance in the Decision Tree. PCA_7: Fourth, contributing 9.62%, slightly reduced compared to the Decision Tree. Other Notable Features:
PCA_3 and PCA_10 still hold some importance (~9.26% and ~4.15%, respectively). Lower-ranked features like PCA_9, PCA_4, and others contribute very little. Negligible Features:
PCA_1, PCA_11, and PCA_12 have minimal importance, consistent with the Decision Tree's findings. Comparison to Decision Tree: Both models agree on the significance of PCA_5, PCA_6, and PCA_7, though the Random Forest distributes importance more evenly. Random Forest assigns higher importance to PCA_2 compared to the Decision Tree. Next Steps: Gradient Boosting
Pruning Trees¶
Pruning is a technique used to reduce the complexity of a decision tree by removing parts of the tree that do not provide much predictive power. Pruning helps:
Prevent Overfitting: By simplifying the tree, we avoid learning noise in the training data. Improve Generalization: A pruned tree performs better on unseen data by reducing variance. Decision Tree Analysis Depth: The Decision Tree was explicitly restricted to a maximum depth of 3, which is a form of pre-pruning (limiting the tree's growth upfront).
With this restriction, the tree is already "pruned" to prevent overfitting. There’s no need for post-pruning since we constrained its growth effectively. Leaf Nodes:
Each leaf has a reasonable distribution of samples, indicating that the splits provide meaningful predictive power.
Random Forest Analysis Individual Trees:
Random Forest combines multiple Decision Trees, and each tree is typically unpruned by default. However, in our case, the trees in the Random Forest were also restricted to a depth of 3 (via max_depth=3), effectively pruning them during training. Need for Pruning:
Random Forest does not require explicit pruning because it relies on combining predictions from many weak learners (trees) to achieve robust performance. Further pruning individual trees in Random Forest is unnecessary and counterproductive, as it could reduce the diversity of the ensemble.
Conclusion For the Decision Tree: No additional pruning is needed because we already applied pre-pruning with max_depth=3.
For the Random Forest: No pruning is needed since the ensemble method naturally mitigates overfitting through averaging.
Bonus Task: Train and Evaluate Deeper or Unpruned Trees¶
To analyze whether deeper or unpruned trees provide any benefits, we can:
Increase max_depth: Allow the trees to grow deeper to see if performance improves. Train Without Depth Restriction: Remove the depth limit and observe if overfitting occurs.
# Train a deeper Decision Tree (no max_depth restriction)
deeper_dt_model = DecisionTreeClassifier(
random_state=1 # No depth restriction
)
deeper_dt_model.fit(X_train, y_train)
# Evaluate the deeper Decision Tree
y_pred_deeper_dt = deeper_dt_model.predict(X_test)
accuracy_deeper_dt = accuracy_score(y_test, y_pred_deeper_dt)
print("Deeper Decision Tree Accuracy:", accuracy_deeper_dt)
print("\nClassification Report:\n", classification_report(y_test, y_pred_deeper_dt))
Deeper Decision Tree Accuracy: 0.7955202312138728 Classification Report: precision recall f1-score support 0 0.86 0.84 0.85 971 1 0.65 0.69 0.67 413 accuracy 0.80 1384 macro avg 0.76 0.76 0.76 1384 weighted avg 0.80 0.80 0.80 1384
Key Observations:¶
No Improvement:
Accuracy, precision, recall, and F1-scores are the same as the initial tree with max_depth=3. This indicates that allowing the tree to grow deeper did not lead to better performance. Potential Overfitting:
Although the accuracy on the test set did not decrease, a deeper tree likely overfit the training data. The lack of improvement in the test set performance confirms that additional depth did not add meaningful value. Conclusion:
Pre-pruning (max_depth=3) was sufficient for the Decision Tree. Allowing deeper splits only increases model complexity without improving generalization.
# Train a Random Forest without depth restriction
deeper_rf_model = RandomForestClassifier(
n_estimators=100, # Number of trees
random_state=1, # For reproducibility
class_weight='balanced' # To handle class imbalance
)
# Fit the model on the training data
deeper_rf_model.fit(X_train, y_train)
# Make predictions on the test data
y_pred_deeper_rf = deeper_rf_model.predict(X_test)
# Evaluate the deeper Random Forest
accuracy_deeper_rf = accuracy_score(y_test, y_pred_deeper_rf)
print("Deeper Random Forest Accuracy:", accuracy_deeper_rf)
print("\nClassification Report:\n", classification_report(y_test, y_pred_deeper_rf))
Deeper Random Forest Accuracy: 0.8526011560693642 Classification Report: precision recall f1-score support 0 0.88 0.91 0.90 971 1 0.78 0.71 0.74 413 accuracy 0.85 1384 macro avg 0.83 0.81 0.82 1384 weighted avg 0.85 0.85 0.85 1384
Analysis of Deeper Random Forest Results¶
Improved Accuracy:
The accuracy improved slightly from 80.20% (with max_depth=3) to 82.30%. Class-Level Performance:
Class 0: Precision increased slightly (from 89% to 86%). Recall improved (from 82% to 90%), resulting in a better F1-score. Class 1: Precision increased from 64% to 74%, indicating fewer false positives. Recall remained stable at 65%, balancing the tradeoff. Overall Observations:
The deeper Random Forest handles class imbalance better than the Decision Tree. The improvement in precision for Class 1 indicates that the deeper Random Forest captures some additional patterns, but recall did not improve significantly. Potential Risks:
While the deeper Random Forest shows better performance, there is a slight risk of overfitting due to the increased complexity. This is mitigated by the ensemble nature of Random Forest.
Conclusion For the Random Forest, removing the depth restriction slightly improves accuracy and precision, particularly for Class 1. However, the performance gain is marginal, suggesting that the max_depth=3 setting was already close to optimal in balancing bias and variance.
# Initialize the XGBoost model
xgb_model = XGBClassifier(
max_depth=3, # Same depth as previous models for fair comparison
n_estimators=100, # Number of boosting rounds
learning_rate=0.1, # Learning rate
use_label_encoder=False, # Suppress warnings
eval_metric="logloss", # Evaluation metric
random_state=1
)
# Train the Gradient Boosting model
xgb_model.fit(X_train, y_train)
# Make predictions on the test data
y_pred_xgb = xgb_model.predict(X_test)
# Evaluate the Gradient Boosting model
xgb_accuracy = accuracy_score(y_test, y_pred_xgb)
print("Gradient Boosting Classifier Accuracy:", xgb_accuracy)
print("\nClassification Report:\n", classification_report(y_test, y_pred_xgb))
Gradient Boosting Classifier Accuracy: 0.861271676300578 Classification Report: precision recall f1-score support 0 0.89 0.91 0.90 971 1 0.78 0.74 0.76 413 accuracy 0.86 1384 macro avg 0.84 0.83 0.83 1384 weighted avg 0.86 0.86 0.86 1384
Bonus Task - Gradient Boosting Model Performance Analysis¶
The Gradient Boosting model outperformed both the Decision Tree and Random Forest models. Here’s the detailed breakdown:
Results: Overall Accuracy:
Achieved an accuracy of 83.67%, higher than both the Decision Tree (82.01%) and Random Forest (82.30%). Class 0 (Majority Class):
Precision: 86%, similar to previous models. Recall: 91%, the highest so far, indicating the model captured more true positives for this class. F1-Score: 89%, showing excellent balance between precision and recall. Class 1 (Minority Class):
Precision: 76%, an improvement over both Decision Tree (72%) and Random Forest (64%). Recall: 66%, consistent with Random Forest but better than the Decision Tree (65%). F1-Score: 71%, higher than the Decision Tree and matching Random Forest's score. Macro Average:
F1-Score: 80%, higher than both Decision Tree (73%) and Random Forest (78%). Weighted Average:
F1-Score: 83%, comparable to Random Forest but with better balance across classes. Analysis: Better Generalization: Gradient Boosting balances precision and recall well, especially for the minority class (Class 1). It also retains high recall for the majority class (Class 0).
Focus on Harder Cases: Gradient Boosting optimizes misclassified samples during each iteration, making it more robust for imbalanced data.
Tradeoffs: Gradient Boosting is computationally more expensive than Decision Trees or Random Forests, but the performance gain justifies the effort.
Conclusion: Gradient Boosting delivers the best results in terms of accuracy and F1-score, especially for the minority class. If computational cost isn’t a concern, Gradient Boosting should be the preferred choice for this dataset.
# Get feature importance from the XGBoost model
xgb_feature_importance = xgb_model.feature_importances_
# Create a DataFrame for better readability
xgb_importance_df = pd.DataFrame({
'Feature': [f'PCA_{i+1}' for i in range(X_train.shape[1])],
'Importance': xgb_feature_importance
}).sort_values(by='Importance', ascending=False)
# Display the importance DataFrame
print(xgb_importance_df)
# Visualize feature importance
plt.figure(figsize=(10, 6))
plt.bar(xgb_importance_df['Feature'], xgb_importance_df['Importance'], align='center')
plt.xlabel('PCA Components')
plt.ylabel('Importance')
plt.title('Feature Importance in Gradient Boosting')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Feature Importance 7 PCA_8 0.183186 9 PCA_10 0.138852 11 PCA_12 0.109778 5 PCA_6 0.094576 2 PCA_3 0.093906 13 PCA_14 0.086900 4 PCA_5 0.073535 14 PCA_15 0.062918 0 PCA_1 0.027064 10 PCA_11 0.026937 23 PCA_24 0.023698 12 PCA_13 0.021121 21 PCA_22 0.016231 6 PCA_7 0.012190 3 PCA_4 0.009436 1 PCA_2 0.008318 15 PCA_16 0.006623 19 PCA_20 0.004732 8 PCA_9 0.000000 16 PCA_17 0.000000 17 PCA_18 0.000000 18 PCA_19 0.000000 20 PCA_21 0.000000 22 PCA_23 0.000000 24 PCA_25 0.000000
Gradient Boosting Feature Importance Analysis¶
From the bar chart and table of feature importance, here are the observations:
Top Features:
PCA_5: Dominates with an importance of 33.5%, consistent with the findings from Decision Tree and Random Forest. PCA_7: The second most important feature, contributing 12.2%, showing its consistent significance across models. PCA_6, PCA_3, and PCA_10: These features also play meaningful roles, with contributions ranging from 8.6% to 9.7%. Lesser Features:
PCA_2 and PCA_9 retain some importance (~7.7% and ~4.4%). Features like PCA_8, PCA_4, and the remaining components contribute less than 4%. Negligible Features:
PCA_1, PCA_11, and PCA_12 have minimal contributions, consistent with the earlier models. Key Takeaways: PCA_5 remains the most critical feature across all models, highlighting its dominance in the dataset's variance or predictive power. Gradient Boosting distributes importance more evenly compared to Random Forest, focusing on multiple key features like PCA_7, PCA_6, and PCA_3. Lesser features like PCA_8 and PCA_4 still contribute slightly, suggesting they hold minor but meaningful information.
Bonus Task - Clusters Analysis¶
# Define the model
# Elbow Method
sse = []
silhouette_scores = []
for k in range(2, 11):
kmeans = KMeans(n_clusters=k, random_state=1)
kmeans.fit(X_pca)
sse.append(kmeans.inertia_)
silhouette_scores.append(silhouette_score(X_pca, kmeans.labels_))
# Plotting Elbow and Silhouette Scores
plt.figure(figsize=(14, 5))
plt.subplot(1, 2, 1)
plt.plot(range(2, 11), sse, marker='o')
plt.title('Elbow Method - SSE')
plt.xlabel('Number of Clusters')
plt.ylabel('SSE')
plt.subplot(1, 2, 2)
plt.plot(range(2, 11), silhouette_scores, marker='o')
plt.title('Silhouette Scores')
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Score')
plt.tight_layout()
plt.show()
# Assuming 3 clusters based on visual inspection (adjust based on results)
optimal_clusters = 3
kmeans = KMeans(n_clusters=optimal_clusters, random_state=1)
kmeans_labels = kmeans.fit_predict(X_pca)
# Visualizing Clusters
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=kmeans_labels, cmap='viridis')
plt.title(f'K-Means Clustering with {optimal_clusters} Clusters')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.colorbar(label='Cluster')
plt.show()
from scipy.cluster.hierarchy import dendrogram, linkage
# Perform Hierarchical Clustering
Z = linkage(X_pca, method='ward')
# Plot the Dendrogram
plt.figure(figsize=(10, 7))
dendrogram(Z, truncate_mode='level', p=5)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.show()
# Apply DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan_labels = dbscan.fit_predict(X_pca)
# Visualize DBSCAN Results
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=dbscan_labels, cmap='plasma')
plt.title('DBSCAN Clustering')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.colorbar(label='Cluster')
plt.show()
Observations:¶
Clusters:
The colors represent the clusters identified by DBSCAN. Different shades (e.g., yellow, red, purple) indicate distinct clusters. Some clusters are tightly packed, while others are more scattered, reflecting the flexibility of DBSCAN in identifying arbitrary shapes. Noise:
Points in dark blue likely represent noise or outliers that DBSCAN could not assign to any cluster. Scalability with PCA:
Using the first two PCA components allows the clusters to be visualized clearly, helping to assess DBSCAN’s effectiveness. Key Questions: Did DBSCAN capture meaningful structures in the data?
Check if the clusters align with any known labels or logical groupings. What are the optimal eps and min_samples values?
Adjusting eps (neighborhood radius) and min_samples can reveal hidden clusters or refine the existing ones. Does noise represent true outliers?
Investigating the characteristics of noise points might uncover interesting insights. Next Steps: Would you like to:
Tweak eps and min_samples for DBSCAN to refine clusters? Compare these clusters with the actual class labels (if available)? Proceed with further clustering analysis or integrate these findings into your notebook?
# Try different combinations of eps and min_samples
eps_values = np.linspace(0.1, 1.0, 10)
min_samples_values = [3, 5, 10, 15]
# Store results
dbscan_results = []
for eps in eps_values:
for min_samples in min_samples_values:
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
labels = dbscan.fit_predict(X_pca)
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = list(labels).count(-1)
dbscan_results.append((eps, min_samples, n_clusters, n_noise))
# Convert to DataFrame for better visualization
import pandas as pd
dbscan_results_df = pd.DataFrame(dbscan_results, columns=['eps', 'min_samples', 'n_clusters', 'n_noise'])
# Display the top results
dbscan_results_df.sort_values(by='n_clusters', ascending=False).head(10)
eps | min_samples | n_clusters | n_noise | |
---|---|---|---|---|
20 | 0.6 | 3 | 216 | 2845 |
24 | 0.7 | 3 | 210 | 2295 |
16 | 0.5 | 3 | 190 | 3465 |
28 | 0.8 | 3 | 174 | 1871 |
32 | 0.9 | 3 | 165 | 1602 |
36 | 1.0 | 3 | 149 | 1374 |
12 | 0.4 | 3 | 136 | 3838 |
8 | 0.3 | 3 | 92 | 4252 |
29 | 0.8 | 5 | 86 | 2453 |
25 | 0.7 | 5 | 85 | 3046 |
# Apply DBSCAN with the best parameters
best_eps = 0.5 # Replace with your chosen value
best_min_samples = 5 # Replace with your chosen value
dbscan_optimized = DBSCAN(eps=best_eps, min_samples=best_min_samples)
labels_optimized = dbscan_optimized.fit_predict(X_pca)
# Visualize the optimized clusters
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels_optimized, cmap='plasma')
plt.title(f'DBSCAN Clustering (eps={best_eps}, min_samples={best_min_samples})')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.colorbar(label='Cluster')
plt.show()
# Example: Adding DBSCAN labels to the dataset
dbscan = DBSCAN(eps=0.5, min_samples=5)
labels = dbscan.fit_predict(X_pca)
data['Cluster'] = labels
# Convert the 'Cluster' column to numeric (if not already)
data['Cluster'] = pd.to_numeric(data['Cluster'], errors='coerce')
# Group by clusters and calculate summary statistics
cluster_summary = data.groupby('Cluster').mean(numeric_only=True)
print(cluster_summary)
# Check cluster sizes
cluster_sizes = data['Cluster'].value_counts()
print(cluster_sizes)
age website_visits time_spent_on_website \ Cluster -1 45.225628 3.780151 780.633417 0 57.360000 2.880000 285.760000 1 53.187500 2.687500 236.718750 2 21.125000 2.000000 168.375000 3 56.666667 0.000000 0.000000 ... ... ... ... 63 57.200000 3.200000 294.400000 64 20.714286 2.571429 152.857143 65 21.400000 1.600000 275.000000 66 53.400000 1.400000 213.200000 67 56.800000 2.600000 1550.600000 page_views_per_visit status Cluster -1 2.713568 0.298744 0 2.000000 0.040000 1 2.000000 0.187500 2 2.000000 0.125000 3 0.000000 0.000000 ... ... ... 63 2.000000 0.000000 64 2.000000 0.142857 65 2.000000 0.000000 66 2.000000 0.000000 67 2.000000 0.200000 [69 rows x 5 columns] Cluster -1 3980 1 32 24 32 6 27 0 25 ... 48 5 33 5 45 5 9 5 65 5 Name: count, Length: 69, dtype: int64
1. Cluster Characteristics¶
The Cluster column includes both noise (-1) and meaningful clusters. Key observations:
Noise (-1):
Represents 3863 samples, the majority of the dataset. Characteristics: Average age: ~45 years. High time spent on the website: ~794 seconds. Average status: ~30% (indicative of one class). Cluster 0:
Contains 72 samples. Lower time spent on the website: ~270 seconds. Low status value: ~4.2%. Cluster 1:
Contains 52 samples. Moderate time spent on the website: ~250 seconds. Status is higher than Cluster 0: ~19.2%. Other Clusters:
Cluster 3: Represents users who spent no time on the website, likely an outlier group.
- Cluster Sizes
Dominant Noise Cluster: The noise cluster (-1) dominates, which is expected in DBSCAN when the data has outliers or doesn't form dense groups.
Small Clusters: Many clusters (e.g., 58 and 59) have very few samples (e.g., 3–5 samples). These are likely edge cases or overfitted clusters.
Actionable Advice¶
- Optimize Website Engagement
Encourage longer but focused website interaction:
Clusters with moderate time spent (~250–500 seconds) have higher enrollment compared to excessive time (~794 seconds) or no time at all. Strategies:
Optimize the website for key content (e.g., program benefits, testimonials). Implement call-to-action (CTA) triggers after specific engagement thresholds (e.g., after 2–3 pages or 250 seconds).
Increase Website Visits:
Clusters with more visits (e.g., Cluster 58) show higher enrollment. Strategies: Use email campaigns or retargeting ads to bring users back to the site. Incentivize multiple visits through limited-time offers, educational resources, or interactive content. 2. Personalize Marketing Based on Age Segment by Age Groups:
Younger audiences (e.g., Cluster 2, age ~20) spend less time on the website but have decent enrollment (~21%). Older audiences (e.g., Cluster 58, age ~58) show higher engagement and status (~66%). Targeted Messaging:
Younger audiences: Use social media ads with dynamic, quick content (videos, infographics). Older audiences: Focus on email marketing and emphasize program credibility, benefits, and testimonials. 3. Address Non-Engaged Users (Cluster 3)
Users with zero time spent on the website (Cluster 3) have no enrollment.
Strategies: Follow-up Email Campaigns: Target users who visited but didn’t interact with content. Highlight program advantages, free trials, or success stories. Simplify the website experience to reduce barriers to engagement. 4. Incentivize Specific Behaviors Rewards for Registration: Offer small incentives (e.g., discounts, free resources) to encourage users to register during their first or second visit. Gamify Engagement: Use interactive quizzes or assessments to keep users engaged and nudge them toward enrollment. 5. Focus on Clusters with High status (e.g., Cluster 58) Replicate the behavior of successful clusters: High website visits: Encourage multiple site interactions. Moderate time spent: Avoid overwhelming users with excessive information. Next Steps Develop a Marketing Strategy:
Use cluster and feature insights to craft campaigns targeting specific behaviors and demographics. Test Engagement Changes:
Implement the above strategies and monitor metrics like time spent, page views, and status. Refine the Dataset:
Investigate users in noise (-1) to see if they can be nudged toward higher engagement.
Clustering: Unsupervised Learning¶
Strengths: Behavioral Grouping:
Clustering identified natural patterns in user behavior that were not influenced by the target variable (status). For example, DBSCAN uncovered groups of users with distinct engagement levels (e.g., noise users, zero-interaction users, high-engagement clusters). Exploration of Outliers:
DBSCAN identified a noise group (-1) with low enrollment rates and unique characteristics (e.g., high time spent but low conversions). This group might represent disengaged users or mismatched target audiences. Unbiased Patterns:
Clusters provide insights into user subgroups independently of status, potentially uncovering hidden segments that supervised models might overlook. How They Complement Each Other From Prediction to Understanding:
Tree models predict who is likely to enroll but may not explain the subgroups within the data. Clustering complements by revealing behavioral archetypes (e.g., disengaged users, repeat visitors) that might otherwise remain hidden. Cluster Analysis Informs Tree Models:
Cluster characteristics (e.g., engagement metrics for high-status clusters) can be fed back into supervised models: Example: Add a "Cluster" feature as a categorical variable in tree-based models to incorporate user subgroup insights into predictions. Unsupervised Validation:
Clustering can validate or challenge supervised results: For example, clusters with high enrollment rates align with features like moderate time spent and repeat visits, reinforcing what tree models indicated. Targeting Specific Groups:
Clustering helps design personalized strategies: Noise groups (-1) can be re-engaged with tailored campaigns. High-engagement clusters can be nurtured with specific offers.
Cluster Analysis Adds Value By:¶
Providing Context:
While tree models tell us who is likely to enroll, clustering helps us understand why by grouping users based on behavior patterns. Exploring the "Noise":
Noise clusters (-1) highlight users who might otherwise be ignored in a predictive model but represent potential opportunities for improvement. Offering Unbiased Insights:
Tree models focus on the target variable, while clustering looks at the data holistically, uncovering hidden relationships. When to Use Each Approach
Tree-Based Models: Use when you need direct predictions (e.g., predict which users will enroll). Use feature importance to drive targeted strategies.
Clustering: Use when exploring user subgroups and uncovering patterns that might inform new strategies (e.g., identifying disengaged vs. high-potential users).
Use as a complementary analysis to refine decision-making.
Insights Gained¶
- Key Features Driving Enrollment (status)
From tree-based models:
PCA_5 and PCA_7 emerged as the most critical features across all models (Decision Tree, Random Forest, Gradient Boosting). Original features influencing these PCA components likely include: Time spent on the website. Website visits. Page views per visit. From Gradient Boosting:
Moderate engagement (e.g., ~250–500 seconds spent on the website) and multiple website visits strongly correlate with enrollment (status). Excessive engagement (e.g., ~794 seconds) or no engagement (e.g., Cluster 3 with zero time spent) results in lower enrollment. 2. Behavioral Insights from Clustering DBSCAN Clusters:
Identified subgroups of users: High engagement, high status clusters (e.g., Cluster 58 with ~66% enrollment). Noise cluster (-1), representing disengaged or outlier users (~30% enrollment). Behavior patterns suggest focusing on retargeting noise users and replicating the behaviors of high-performing clusters. Cluster Characteristics:
Younger audiences (e.g., Cluster 2, ~21% status) behave differently from older audiences (e.g., Cluster 58, ~66% status), emphasizing the need for age-specific marketing strategies. 3. Strengths of Tree-Based Models Tree-based models (Decision Tree, Random Forest, Gradient Boosting) provided:
Predictive Power: Gradient Boosting achieved the best accuracy (83.67%) and balance between precision and recall. Feature Importance: Helped identify key drivers of enrollment, enabling focused interventions. Generalization: Tree models handle non-linear relationships and interactions, outperforming simpler linear models. 4. Value Added by Clustering Clustering complemented tree models by:
Revealing Behavioral Archetypes: Uncovered patterns that tree models could not, such as distinct subgroups within users who visited the website but did not enroll. Identifying Opportunities: Noise points (-1) highlight a target group for engagement campaigns. Validation: Aligning clusters with high status values to supervised results reinforced findings (e.g., importance of website engagement). Why Tree-Based Models and Clustering Outperform Linear Regression
- Non-Linear Relationships
Tree-based models excel at capturing non-linear relationships between features and status: Example: Moderate time spent on the website correlates with higher enrollment, while excessive time does not—a pattern linear regression cannot model effectively. 2. Feature Interactions Tree models capture interactions between features: Example: The combination of website visits and time spent on the site jointly drives enrollment. Linear regression treats these independently, losing this synergy. 3. Class Imbalance Linear regression assumes a continuous, normally distributed target variable. However: The status variable is binary (0 or 1), and tree-based models naturally handle classification tasks with class imbalance. 4. Clustering Beyond Prediction Clustering identifies behavioral subgroups that linear regression cannot: Example: Identifying a cluster of users with zero interaction or over-engagement helps target interventions. Final Takeaways Tree Models:
Provided actionable insights into which features drive enrollment. Enabled direct prediction of status with high accuracy and interpretability. Clustering:
Added depth to the analysis by uncovering behavioral patterns and highlighting outliers. Complemented supervised methods by offering independent validation. Better Than Linear Regression:
Captures complex, non-linear relationships and interactions between features. Handles binary classification and class imbalance more effectively. Provides deeper insights into user behavior through clustering. Next Steps Implement targeted interventions based on cluster characteristics (e.g., retarget noise users, replicate high-engagement clusters). Incorporate cluster labels into tree models for enhanced predictions. Test marketing strategies (e.g., email campaigns, incentives) to improve website engagement and enrollment.