Jacky's Blog

Sparkify is a digital music service similar to Spotify. Users in Sparkify either use a free tier which includes ads in between songs, or they can have the premium subscription, paid monthly, which does not have any ads. Users can upgrade, downgrade, or cancel their service at any time.

Our task in this project is to perform an analysis of the customers' data and come out with a customer churn predicting model. Here are the steps:

Clean data: fill the miss values, correct the data types, drop the outliers.
EDA: exploratory data to look at features, distributions, and correlation between columns.
Feature engineering: extract and found customer-features and customer-behavior-features; Implement standscaler on numerical features.
Train and measure models: logistic regression, linear svm classifier, decision tree and random forest classifier were used to train a baseline model and tuning a better model from best of them. It is worth mentioning that this data is unbalanced because of fewer churn customers, so we choose f1 score as a metric to measure models’ performance.

Quick Facts

A mini subset of size 125 MB of the original 12 GB customer log json data file will be used for creating the prediction model. The small dataset has 286’500 log entries with 18 unique columns.

The schema and info of the dataset are given below:

root
 |-- artist: string (nullable = true)
 |-- auth: string (nullable = true)
 |-- firstName: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- itemInSession: long (nullable = true)
 |-- lastName: string (nullable = true)
 |-- length: double (nullable = true)
 |-- level: string (nullable = true)
 |-- location: string (nullable = true)
 |-- method: string (nullable = true)
 |-- page: string (nullable = true)
 |-- registration: long (nullable = true)
 |-- sessionId: long (nullable = true)
 |-- song: string (nullable = true)
 |-- status: long (nullable = true)
 |-- ts: long (nullable = true)
 |-- userAgent: string (nullable = true)
 |-- userId: string (nullable = true)

Column’s Name	Description
artist	The artist being listened to
auth	Whether or not the user is logged in
firstName/lastName	Name of the user
gender	Gender of the user
itemInSession	Item number in session
length	Length of time for current row of specific log
level	Free or Paid user
location	Physical location of user, including City and State
method	Get or Put requests
page	Which page are user on in current row
registration	Users registration number
sessionId	Session ID
song	Song currently being played
status	Web status
ts	Timestamp of current row
userAgent	Useragent of post or get in browser of users
userId	User ID

Exploratory Data Analysis

We use the Cancellation Confirmation events of page column to define the customer churn, and perform some exploratory data analysis to observe the behavior for users who stayed vs users who churned.

churn

So, there are 52 users have churned events in the dataset, it’s about 23.1% churned rate. The rate of churn and not churn is roughly 1:3, so this is an unbalanced dataset.

gender

Can we say the gender has effect on Churn or not ? We calculate the p-value and result is 0.20 over 0.05, so, we can’t say like that.

page

We count each item in page column of different group and normalized data.

Obviously, NextSong has accounted for most of customers’ events. Thumbs Up ,Thumbs Down , Home and Add to Playlist have effect on churn too.

userAgent

We extract the browser and platform of customers from userAgent column.

Customers using safari and iPad have more proportion in churn.

time

We extract day-of-month, day-of-week and hour from ts column.

Customers from churn group have more events after 15th in one month, and have less events in weekend.

Feature Engineering

On the basis of the above EDA, we can create features as follows:

Categorical Features
- gender
- level
- browser
- platform
Numerical Features
- mean,max,min,std of length of users
- numbers of each item in page (ThumbsUp …
- number of unique songs and total songs of users
- number of unique artists of users
- percentage of operations after 15th in a month
- percentage of operations in workday

We implement label encoding on categorical features and standard scaler on numerical features.

Modeling

We split the full dataset into train and test sets. Test out the baseline of four machine learning methods: Logistic Regression, Linear SVC, Decision Tree Classifier and Random Forest Classifier.

As we can see, Random Forest has the highest f1 score. So I’ll choose it to tuning in the next section, the result is as follows:

Random Forest Training time:

F-1 Score:

Conclusion

After optimization, the f1 score actually drops. My guess is that we did not really set the same seed after the optimization. (takes too long to run). However, take the original one. I am happy to have 0.7 as my F1 score.

Reflection

In this project I set out to predict customers’ churn problem with the dataset of a music streaming service named Sparkify. This is a binary classification problem , so I choose four supervised learning algorithm to found a model. After evaluated and tuning, I find out the random forest is the suitable model for this project because of its balanced and high f1-score (0.7) and time spending.

Improvement

There are only about 76 samples in the mini dataset above, so the model could be improved by being trained on a bigger dataset and tuning hyperparameters based on it.

Another improvement could be to try out more features or deep learning models.

Search This Blog