Telecom Churn Prediction in South Asia

In the Telecom Industry, Customers are able to choose from multiple service providers and actively switch from one Operator to another. Due to the technical progress and the increasing Number of Operators raised the level of competition. Hence, for the Telecom Companies Predicting the Customers who have High Risk of getting into Churn Proactively has become important.

Telecom Companies follow three main strategies to Generate More Revenues:

- Acquire New Customers
- Upsell the Existing Customers
- Increase the Retention Period of Customers

However, comparing the above Strategies Taking the Value of Return on Investment (RoI) of each into account has shown that the Increase the Retention Period of Customers, is the most Profitable Strategy. In this highly competitive market, the Telecom Industry experiences an average of 15-25% annual Churn rate. Given the fact that it costs 5-10 times more to acquire a new customer than to retain an existing one. So, for most of the Telecom Operators the Customer Retention has now become even more important than Customer Acquisition.

Customer Churn Rate:

Churn Rate is the percentage of subscribers to a service, who had discontinued their Service in a given time period.

Importance:

- Churn Rate indicate the Strength of a Company’s Customer Service and its overall Growth.
- Lower the Churn Rate of a Company, the better it is in its Competitive State.
- It is always more Difficult and Expensive for a Company to Acquire a New Customer than it is to retain a Current Paying Customer.

If we Predict Customers who are at High Risk of Churning, there will be still time to Take Necessary Actions for the Company about their Customers to reduce the Churn Rate.

So, the Task here is to Predict the Customers at High Risk of Churn Using Machine Learning.

Objectives and Constraints:

- To Predict as many customers as possible with the best possible Score.
- No Low Latency Constraint.

Performance Metrics:

- AUC Score
- F1 Score

F1 Score:

As the Dataset is Highly Imbalanced, the positive class makes just over 7.32% of the total. So, a metric such as F1 Score seems appropriate for this kind of problem as it considers both precision and recall of the test to compute the score.

Reason for Choosing AUC:

As, The Dataset is Imbalanced and it is Binary Classification, AUC has better Sensitivity to the Imbalanced Dataset. Whereas, the Accuracy Fails in this Case.

Data:

The Data set can be downloaded from https://www.kaggle.com/vijaysrikanth/telecom-churn-data-set-for-the-south-asian-market

The Data set is Based on South Asian Users.

- File format: .csv
- Size: 75.4 MB
- Number of Data Points: 99999
- Number of Columns: 226

Abbreviations in Columns:

Understanding Column Names:

6,7,8,9 at the end of Each Column name, Denotes the Data Recorded to the Corresponding Columns in the months of June, July, August and September respectively.

The Data is collected for four months i.e. June, July, August and September. Out of these four months data, we need to use three months Data to Predict the Customers who are at High Risk of Getting into Churn Category and the September month data can be used to Label the Data set.

Handling Missing Values:

As it was observed that there are 166 Columns with Missing (NaN) Values in the data. So, we need to handle the missing values in Data. Let’s find the Percentage of Missing Values in Each Column and plot the corresponding plots.

We can observe that

- About 80% of Columns have less than 10% Missing Data i.e ~10,000 Data Points.
- About 10% of Columns have More Than 70% of Missing Data i.e ~70,000 Data Points.

It’s better to drop the columns which are having higher percentage of Missing Values. The reference here is taken as 30%. So, all the columns with more than 30% missing values is dropped.

Now, the remaining columns are to handled based on the data in the column or based on domain knowledge. We, here use both the techniques to impute the NaN values.

Outgoing Calls:

It was observed that there are No Missing Values in the column total_og_mou. So this column is used as the reference to handle the missing values in other Outgoing Minutes of Usage Columns. Based on the Total Outgoing Minutes of Usage i.e. total_og_mou, if it is zero, which mean that there were no outgoing calls of any category and can be imputed with zero.

So, all the features on right hand side are dependent on total_og_mou, if total_og_mou is zero, which in turn imply that there were no calls handled. So all these features with missing values can be imputed with zero if the corresponding total_og_mou is zero. This is handled using the below function.

Now handling the cumulative local and STD calls outgoing calls with the help of above handled features, as these are dependent on them.

Local and STD Minutes of Usage are handled using the above equation. This is handled using the below snippet of code.

Incoming Calls:

Even here it was observed that there are No Missing Values in the column total_ic_mou. So this column is used as the reference to handle the missing values in other remaining Incoming Minutes of Usage Columns. Based on the Total Incoming Minutes of Usage i.e. total_ic_mou, if it is zero, which mean that there were no incoming calls of any category and can be imputed with zero.

So all these features with missing values can be imputed with zero if the corresponding total_ic_mou is zero. This is handled by using the same code snippet used in case of Outgoing Calls. Now handling the cumulative local and STD calls Incoming calls with the help of above handled features, as these are dependent on them.

This is handled by using the same code snippet used in case of Outgoing Calls.

Remaining Columns :

On Network Minutes of Usage:
- This column represents the total minutes of usage of calls both incoming and outgoing on the same Network.

Off Network Minutes of Usage:
- This column represents the total minutes of usage of calls both incoming and outgoing to the Other Networks.

Even the above columns are handled using the above same snippet of code based on the equations. And all the remaining call related columns are imputed with zero’s.

Date Columns :

These columns are first converted to proper Date Time format using the predefined function pd.to_datetime and then the gap between the Recharges is calculated, remaining columns related to this are dropped. The Missing values in the new columns are filled with an Unique Integer Value.

Customer Value Segmentation :

Segmenting Customers by Value, the Standard Approach used is the ‘Decile Analysis’. This Calculates a Value Measure for each Customer, sorts the Customer Base into Descending Order by Value and then Splits the Base into 10 Equal Segments. The First or Top Decile is the Top 10 Percent of the Base. The Second Decile is the next 10 per cent, and so on.

Depending on the Decile Analysis of the Revenue Being Generated by the Customers in the Months of June and July.
These are the Customer’s, who contribute Maximum Revenue to the Company.
Identifying, these Customers with High Risk of Getting into Churn, can help the Company.

We are here considering, top 3 Deciles for segregating, as these customer’s are contributing more than 70% of revenue to the Company.

Labeling the Data:

We need to assign Churn Label To the Data, by Using September Month Data (As, mentioned in Problem Statement) depending on the total_ic_mou, total_og_mou, vol_2g_mb, total_rech_num. i.e.

- No Incoming and Outgoing Calls.
- No Data Usage, either 2G or 3G.
- No Recharges have been Done.

If the above conditions are satisfied then the Customer is supposed to be Churned and can be assigned a Churn Label 1. This labeling of the data is performed using the below snippet of code.

Exploratory Data Analysis:

Exploratory Data Analysis refers to performing the basic analysis on data both uni and multivariate to observe the patterns in the Data. This step helps us in finding the valuable insights through which the important and necessary features can be inferred and the remaining without any information can be dropped down. Even it is helpful in deriving the new features.

Churn Label:

It can be inferred that the data is Highly Imbalanced. The Churn Rate is ~7.32%.
The Data has to be Up-sampled.

Off Network Minutes of Usage:

Inferences:

Most of the Data is Overlapped.
The Call Duration’s made by the Churned Customer’s to the Other Network’s in the Month of August Decreased Drastically when Compared with the Previous Month’s MOU and it’s Very Low.
So, the Difference in usage of Minutes between Two Months may be a helpful Feature.

Total Outgoing and Incoming Minutes of Usage:

Inferences:

Most of the Data is Overlapped.
Outgoing and Incoming MOU of Churned Customer’s is Very Low in the Month of August.
So, the Difference in usage of Minutes between Two Months may be a helpful Feature.

Total Number of Recharges:

Inferences:

Most of the Data is Overlapped.
Higher the Number of Recharges, lesser the Churn Rate

Age on Network:

Inferences:

As the Age on Network is High the Churn Rate is Low.
Churn Rate is Lower if Age on Network is above 1000.

Gap Between Recharges:

Inferences:

80 is the Value filled, if there are no Recharges Done.
So, Most of the Churned Customer’s haven’t done Recharge in the Month of August.

Deriving Features:

Deriving features based on the above Data analysis and Domain Knowledge.

- Difference in Outgoing Minutes of Usage for the Months July and August.
- Difference in STD Minutes of Usage for the Months July and August.
- Difference in Incoming Minutes of Usage for the Months July and August.
- Average Incoming Minutes of Usage.
- Average Outgoing Minutes of Usage.
- Average ARPU.
- Average onnet and offnet MOU.
- Average VBC.
- Difference in Total Revenue for the Months July and August.

These features are derived from the existing features, and found to be Informative.

Splitting the Data:

The Data has to be split down into Train and Test Data, so that the model can be trained on the Train Data and the evaluation of the model can be done on the Test Data. Here we are splitting the data randomly in the ratio 70:30 i.e. 70% as Train Data and 30% as Test Data.

Standardizing the Data:

As, all the columns in the Data are Numerical and have large difference in the range, they have to be standardized. The Standardization on the columns is performed using StandardScaler() function from sklearn. This process re scales the values in the columns to a mean value=0 and standard deviation=1.

Up-sampling:

As, the data is Imbalanced, Up-sampling of the Minority Class data points has to be done. We here up-sample the data using SMOTE Technique. SMOTE creates Synthetic Artificial New Data Points based in the existing Data.

As, it was observed that the correlation between the columns is very high, the number of columns has to be decreased based on the amount of correlation between them.

To reduce the dimensionality, two techniques are used

- PCA
- selectKBest

PCA:

Principle Component Analysis, by using this technique it was found that only 65 columns from the entire data can explain 95% cumulative explained variance. So, we fit the data on top of this and Data with Reduced Dimensions and Less Correlated is obtained. The code snippet and the cumulative explained variance plot are attached below.

Correlation Matrix after applying PCA on the Data and it can observed that, the correlation between the columns is very low.

SelectKBest:

By using this technique, it was observed that after tuning the number of best features i.e.K in SelectKBest, only 34 columns from the data are the best ones in predicting the Customers with High Risk of Getting into Churned Category.

Modeling:

There are many Machine Learning Algorithms that can be trained on the Data. The above data has been trained on different models (KNN, Logistic Regression, Random Forest, XGBoost) both with Default Parameters and Tuned Hyper Parameters. All these models were trained on Data with all the columns, Data with Reduced Dimensions using PCA and on the Data with Reduced Dimensions using SelectKBest. The AUC and F1 scores for all the models have been evaluated and tabulated as shown below.

Results for the Models Trained on Entire Data:

Results for the Models Trained on PCA Data:

Results for the Models Trained on SelectKBest Data:

On observing the above results, Logistic Regression with Hyper parameter tuning on the Data with reduced dimensions using SelectKBest performed well with an AUC Score of 98%. So, we Finally Fit our Data with this Model. The code snippet is shown below for Fitting the model.

The Confusion, Precision and Recall Matrices for the above Predicted Data are shown below.

It can be observed that the model has performed well in predicting the Customers with High Risk of Churn. The feature importance plots of different models is shown below.are

Conclusions:

It was found that Logistic Regression on Features with SelectKBest performed the best with an AUC score of 98%. This signifies that if this model predicts a data point, then there is 98% chance that it belongs to that class.
Most of the Important Features are Minutes of Usage of Outgoing Calls.
ARPU is also an Important Feature in Predicting the Risk of Churn.
Onnet and Offnet MOU of August Month are also Important Features in Predicting Risk of Churn.

Thank you for Reading..!

While some of the code snippets are included in the blog, and for the full code you can check out this Jupyter Notebook on Github.

References:

https://link.springer.com/article/10.1057/dbm.2010.21https://aamiraltaf.wordpress.com/2012/08/09/top-3-for-high-value-customers-experience/https://blog.comviva.com/efficient-ways-for-customer-churn-analysis-in-telecom-sector/
https://journalofbigdata.springeropen.com/articles/10.1186/s40537-019-0191-6
https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7359318
https://clevertap.com/blog/arpu-everything-you-need-to-know-about-average-revenue-per-user/

Telecom Churn Prediction in South Asia

Kranthi Kumar Valaboju