Tejashree Nawale
10 min readMar 6, 2022

MACHINE LEARNING Project IN Banking Domain

Predict if the client will subscribe to a term deposit based on the analysis of the marketing campaigns the bank performed.

Photo by Museums Victoria on Unsplash

The main objective of this article is to take you through the entire working pipeline to predict if the client will subscribe to a term deposit based on the analysis of the marketing campaigns the bank performed, while approaching a Machine Learning problem.

Business Use Case

There has been a revenue decline for a Portuguese bank and they would like to know what actions to take. After investigation, they found out that the root cause is that, their clients are not depositing as frequently as before. Knowing that term deposits allow banks to hold onto a deposit for a specific amount of time, so banks can invest in higher gain financial products to make a profit. In addition, banks also hold better chance to persuade term deposit clients into buying other products such as funds or insurance to further increase their revenues. As a result, the Portuguese bank would like to identify existing clients that have higher chance to subscribe for a term deposit and focus marketing efforts on such clients.

Potential Business Problems

Run optimized campaigns to bring in more customers, and thereby increase the bank revenue ?

Increase long-term holdings which can be further invested in different financial instruments?

Why solve this problem?

Business Impact

Improve prediction - identify common features of subscribing customers - targeted campaigns

Improve prediction - identify right target audience- efficient budget for marketing

Improve prediction- identify right frequency interval for campaign - optimum campaign

Data Set Information

The data is related to direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the bank term deposit would be subscribed or not subscribed.

There are two datasets: train.csv with all examples (32950) and 21 inputs including the target feature, ordered by date (from May 2008 to November 2010). test.csv which is the test data that consists of 8238 observations and 20 features without the target feature.

Now, let’s start with the task of machine learning to predict if the client will subscribe to a term deposit based on the analysis of the marketing campaigns the bank performed using Classification.

Exploratory Data Analysis

We will use the popular scikit-learn library to develop our machine learning algorithms. I will start by importing all the necessary libraries that we need for this task and import the dataset.

1) Importing libraries

2) Importing the dataset

The first thing that we can do when tackling a data science problem is getting an understanding of the dataset that you are working with. Key observations and trends in the data were noted down. For this you can use df.info,df.head()etc.

In Data Analysis We will try to Find out the below stuff

  1. Missing Values in the dataset.

2. All the Numerical variables and Distribution of the numerical variables.

3. Categorical Variables

4. Outliers

5. Relationship between an independent and dependent feature

Check Numerical and Categorical Features

If you are familiar with machine learning, you will know that a dataset consists of numerical and categorical columns. Looking at the dataset, we think we can identify the categorical and continuous columns in it. Right? But it might also be possible that the numerical values are represented as strings in some feature. Or the categorical values in some features might be represented as some other datatypes instead of strings. Hence it’s good to check for the datatypes of all the features.

Check Missing Data

Missing data means absence of observations in columns that can be caused while procuring the data, lack of information, incomplete results etc. The given dataset is a pretty clean dataset. But this might not be the case always as you can often encounter missing values represented as NaN values in the data.

Fill null values in continuous features

There are no null values in any of the continuous columns in this dataset. But when null values exist in a continuous column, a good approach would be to impute them.

Check for Class Imbalance

Class imbalance occurs when the observations belonging to one class in the target are significantly higher than the other class or classes. A class distribution of 80:20 or greater is typically considered as an imbalance for a binary classification. Since most machine learning algorithms assume that data is equally distributed, applying them on imbalanced data often results in bias towards majority classes and poor classification of minority classes. Hence we need to identify & deal with class imbalance. The code below that takes the target variable and outputs the distribution of classes in the target.

Detect outliers in the continuous columns

Outliers are observations that lie far away from majority of observations in the dataset and can be represented mathematically in different ways. The number of outliers in every numeric feature based on the rule of IQR

Univariate analysis of Categorical columns

Univariate analysis means analysis of a single variable. It’s mainly describes the characteristics of the variable. If the variable is categorical we can use either a bar chart or a pie chart to find the distribution of the classes in the variable.

From the above visuals, we can make the following observations:

  • The top three professions that our customers belong to are — administration, blue-collar jobs and technicians.
  • A huge number of the customers are married.
  • Majority of the customers do not have a credit in default
  • Many of our past customers have applied for a housing loan but very few have applied for personal loans.
  • Cell-phones seem to be the most favoured method of reaching out to customers.
  • Many customers have been contacted in the month of May.
  • The plot for the target variable shows heavy imbalance in the target variable.
  • The missing values in some columns have been represented as unknown. unknown represents missing data. In the next task, we will treat these values

Imputing unknown values of categorical columns

Depending on the use case, we can decide how to deal with these values. One method is to directly impute them with the mode value of respective columns.

Univariate analysis of Continuous columns

Just like for categorical columns, by performing a univariate analysis on the continuous columns, we can get a sense of the distribution of values in every column and of the outliers in the data. Histograms are great for plotting the distribution of the data and boxplots are the best choice for visualizing outliers.

Observation :

  • As we can see from the histogram, the features age, duration and campaign are heavily skewed and this is due to the presence of outliers as seen in the boxplot for these features. We will deal with these outliers in the steps below.
  • Looking at the plot for pdays, we can infer that majority of the customers were being contacted for the first time because as per the feature description for pdays the value 999 indicates that the customer had not been contacted previously.
  • Since the features pdays and previous consist majorly only of a single value, their variance is quite less and hence we can drop them since technically will be of no help in prediction.

Bivariate Analysis — Categorical Columns

Bivariate analysis involves checking the relationship between two variables simultaneously. In the code below, we plot every categorical feature against the target by plotting a barchart.

Treating outliers in the continuous columns

Outliers can be treated in a variety of ways. It depends on the skewness of the feature. To reduce right skewness, we use roots or logarithms or reciprocals (roots are weakest). This is the most common problem in practice. To reduce left skewness, we take squares or cubes or higher powers. But in our data, some of the features have negative values and also the value 0. In such cases, square root transform or logarithmic transformation cannot be used since we cannot take square root of negative values and logarithm of zero is not defined.

Hence for this data we use a method called Winsorization. In this method we define a confidence interval of let’s say 90% and then replace all the outliers below the 5th percentile with the value at 5th percentile and all the values above 95th percentile with the value at the 95th percentile. It is pretty useful when there are negative values and zeros in the features which cannot be treated with log transforms or square roots.

2.Applying vanilla models on the data

Since we have performed preprocessing on our data and also done with the EDA part, it is now time to apply vanilla machine learning models on the data and check their performance.

Function to Label Encode Categorical variables: For the given dataset, we are going to label encode the categorical columns. we will perform label encoding on all the categorical features and also the target (since it is categorical) in the dataset.

Fit vanilla classification models

Since we have label encoded our categorical variables, our data is now ready for applying machine learning algorithms. There are many Classification algorithms are present in machine learning, which are used for different classification applications. Some of the main classification algorithms are as follows-

  • Logistic Regression
  • DecisionTree Classifier
  • RandomForest Classfier
  • XGBClassifier
  • GradientBoostingClassifier

Three vanilla models were assessed without performing any hyperparameter tuning and without treatment of class imbalance of the target. The models were:Logistic Regression , Random Forest Classifier, XGBoost Classifier. None of the three models were able to give an ROC_AUC score above 70%. This called for performing hyperparameter tuning using Grid Search and also treatment of class imbalance using SMOTE for further improvement of the ROC_AUC score.

Feature Selection

Now that we have applied vanilla models on our data, we now have a basic understanding of what our predictions look like. Let’s now use feature selection methods for identifying the best set of features for each model.

1. Using RFE for feature selection

In this task let’s use Recursive Feature Elimination for selecting the best features. RFE is a wrapper method that uses the model to identify the best features. Here, we have inputted 8 feature. You can change this value and input the number of features you want to retain for your model.

2. Feature Selection using Random Forest

Random Forests are often used for feature selection in a data science workflow. This is because the tree based strategies that random forests use, rank the features based on how well they improve the purity of the node. The nodes having a very low impurity get split at the start of the tree while the nodes having a very high impurity get split towards the end of the tree. Hence by pruning the tree after desired amount of splits, we can create a subset of the most important features.

Note: The Feature Selection techniques can differ from problem to problem and the techniques applied for this problem may or may not work for the other problems. In those cases, feel free to try out other methods like PCA, SelectKBest(), SelectPercentile(), tSNE etc.

Grid-Search & Hyperparameter Tuning

Hyperparameters are function attributes that we have to specify for an algorithm. Applying the best parameters obtained using Grid Search on Random Forest model .We fit a random forest model using the best parameters obtained using Grid Search. Since the target is imbalanced, we apply Synthetic Minority Oversampling (SMOTE) for undersampling and oversampling the majority and minority classes in the target respectively.

note that SMOTE should always be applied only on the training data and not on the validation and test data.

Prediction on the test data

Here, we have performed a prediction on the test data. We have used Logistic Regression for this prediction. You can use the model of your choice that will give you the best metric score on the validation data. We have to perform the same preprocessing operations on the test data that we have performed on the train data. For demonstration purposes, we have preprocessed the test data and this preprocessed data is present in the csv file test_preprocessed.csv We then make a prediction on the preprocessed test data using the Grid Search Logistic regression model. And as the final step, we concatenate this prediction with the Id column and then convert this into a csv file which becomes the submission.csv.

Models and Approaches:

Logistic Regression > AUC_ROC: 67.17 % After tuning AUC_ROC:72.4 %

Random Forest Classifier>AUC_ROC: 68.01 % After tuning AUC_ROC:92.48 %

XGBoost Classifier > AUC_ROC: 68.19 % After tuning AUC_ROC:93.81 %

Final Results: From the above observations it can be inferred that the best performing model was XGBoost giving an AUC_ROC score of 93.81 %. While XGBoost is used a lot, it is always prudent to start from simpler algorithms and then go to complex ones.

Insights & Decisions

Customers to be targeted :

Age : 30–50

Education : University, High School, Professional Courses

Job : Admin, Blue-collar, Technician

Campaign Targets :

Customers who were not targeted before

Customers successful in previous campaigns

Plan campaigns from May through August

I hope you liked this article on Machine Learning. Github link is given, where you’ll find all source code. Feel free to ask your valuable questions in the comments section below :)