Spiritual Gurus are the true leftists!

The whole of left cabal in India feels threatened today by India’s spiritual gurus as they are systematically exposing the hypocrisy of the so-called liberals. Recently, Sadhguru had spoken in a…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Take Banking Personalization to the Next Level with 3 effective ML practices

Source: pinterest.com

You can hear the word “personalization” at every turn nowadays. Everyone has their reasoning about the definition of this term (fun fact). In the original sense, this term signifies tailoring the service or product to accommodate it to the specific individuals or groups of similar individuals. An informal definition would be trying to match the user (individual) with the item (product or service) that fits their needs as much as possible. So, whether it’s about salesmen in the leather goods store who offers belts to male visitors and purses to women, telecommunication companies offering tariff packages for different target populations based on age, or suggesting adequate music on Spotify to each person based on its preferences, we are speaking about different levels of personalization, i.e. more or less personalized depending on the type of business they are used for.

Source: tealium.com/blog/customer-centricity

One common thing for all of them is that personalization quality could be measured by tracking customer satisfaction (structured or unstructured feedback) via conversion rates, the number of visits, given ratings, or even through the sentiment of their comments to the suggested content. Thanks to this we can say whether personalization was justified or not. It is clear that depending on the quality of the recommended content, the metrics would be better and therefore business would get better engagement (trust) from its customers.

A bank is an excellent example of a business where proper interaction with customers is of high importance. If done properly, it is possible to form long-lasting relationships with them and establish strong loyalty. This will certainly ensure customers take more products or services from the bank and contribute to the business growth in general. So, the bank should recognize their customers’ needs in order to suggest to them the right actions from which they will benefit.

Traditional approaches such as bulk campaigning for RFM (recency-frequency-monetary based) segments offering Cash Credit once in 6 months, targeting 20% of highest-paid customers (select top * from … order by) with premium products followed by generic message or just showing spending distribution across categories (TRAVEL, VEHICLES, GROCERIES, …) without any relevant content and call to action are not going to encompass customer needs at that particular moment and persuade them to make a decision.

People need more than that! They want to get noticed by a specific product that is relevant to their current life situation, plans, or wishes. Therefore, a sophisticated and empathetic approach is a must.

Some trends that are present now, influence how the access towards the customers should be developed.

Source: behance.net

These factors are the main causes that transform the traditional ways of building relationships with the customer. In this article, however, I will focus on the second of them (“timely, precise, and appropriate content”), or personalization, in one word!

This is the moment where banks ask for Machine Learning to appropriately respond to these challenges!

Luckily, besides gigantic reservoirs of money, banks’ greatest asset is data. All the executed transactions, taken products, occurred activities, call center calls made, mbank loggings, website visits, demographic info, and many others are being recorded and stored in the databases. This represents a valuable resource for getting improved insights on customer behavior.

In this article, I will try to analyze how personalization is (or could be) powered by machine learning in the banking industry using 3 different approaches: segmentation, prediction, recommendation.

But first of all, let’s dive into data…

The bank’s data is quite broad and various but these types of data are the most probable you would find there:

Source: nateqnews.com

The idea is to cluster customers based on the chosen features. Depending on the business problem that is being solved and the type of products the bank has, different target groups (clusters) should be detected, such as young people at the beginning of their career, large families with kids, people that enjoy traveling, fitness and new trends, etc. In this sense, appropriate features that describe these types of people are created and clustering is done. Target groups are then detected based on the most dominant features in the cluster and after analyzing them manually (using the domain knowledge) labels are assigned to each of them.

Distribution of features for 3 different clusters

In this example, the three distributions may seem similar and even between some of them the difference is only in one feature, the meaning is different for each of them. The first cluster has high spending and we may assume that it is about some wealthy family with a lot of members. The second cluster is quite similar to the first one, except that salary and spending power are much smaller, while spending is mainly focused on basic necessary things. The last one represents the younger population that is employed, just bought an apartment, and prefers enjoying life (expensive traveling, etc.).

For example, if the bank wants to offer products related to traveling, house reparation, fitness services, or technology, the main features could be extracted from transactional data based on the customer spending toward different merchants. If the bank decides to target customers with products related to their current life situation and monetary power (high spenders, average loyalists, career starters, schooling family, digital boomers) then features could be found in financial and demographic data primarily.

Features that are extracted could be:

In case of numerical features that are changing over time (such as income amount, spending on some specific merchant, etc.), these are created almost always by applying some aggregate function to the value for some period of time (average monthly income, total spending to specific merchant in the last 3 months). On the other hand, categorical features from some variables are created by applying an aggregate function to all possible values of that variable (pivoting), which may result in a few to dozens of new features.

After the features are prepared, the next step is to perform clustering.

It is obvious that KMeans applied to the pure input won’t lead towards satisfying results because we may have a large number of features (especially in categorical variables) and also not all features are of the same importance.

Here are a few more words on its architecture. Autoencoders are special types of neural networks that try to compress (unlabelled) data to its latent-space representation and to reconstruct the output from this representation. In this way, an autoencoder tries to learn the distribution in the input and to store it in a lower number of dimensions. The first part is called encoder and it is a simple (deep) feed-forward neural network resulting in embedding (or code), while the second part is a decoder (mirrored encoder) which tries to reconstruct some defined output (could be original input, input without noise, etc.) only using the knowledge stored in the embedding. Depending on the defined loss function, this architecture ends up learning the embedding while minimizing loss. Since FF neural networks contain non-linearities between layers, the encoder can learn complex relations between input features (which is the reason why they perform the best in dimensionality reduction problems) and therefore, find the best low-dimensional representation that describes input data.

Autoencoder architecture consisting of Encoder and Decoder

After embedding from each input row (user) is extracted, KMeans clustering is applied as presented in the image below.

Here is a list of blogs on this approach applied to different kinds of segmentation:

TSNE

Speaking about non-linearities in compressing the information, it is worth mentioning one more method that is doing the same thing. T-distributed Stochastic Neighbor Embedding or TSNE is a method that converts the similarities between data points to joint probabilities and tries to minimize Kullback-Leibner distance between joint distributions of low-dimensional embedding and the high-dimensional data. It is therefore capable of compressing the input to low dimensional embedding, as previously mentioned techniques. Since TSNE has a cost function that is not convex, every time it is initialized the method will provide different results (probabilistic) and that’s why it is not advised to use this for dimensionality reduction. However, this is a perfect tool to get a visual sense of the data representations if data is compressed to 2 dimensions. Usually, a large number of features are combined (using FeatureTools package, for example) resulting in a few hundred new features. This kind of processed data is then fed into TSNE, compressed to 2 dimensions, and represented in an image. We can understand this as if two points are closer in 2D representation, they are much more similar.

Also, if the bank is interested to discover only a particular group of users with clearly distinctive traits (such as people that are high spenders and follow new trends, or exclusive clients that spend a lot and have a tendency toward traveling), it is possible by inspecting certain groups on 2D representation and their most dominant traits. If this group is unambiguously noticeable on 2D TSNE representation, all the clients that belong to this group are easily labeled. This is very important because the bank can easily label its customers and use them to make a classification model (instead of clustering). Again, TSNE is a quite powerful technique for visual representations of the input but since it is a probabilistic model it is not recommended to be used in getting the embedding from data!

Here are some of the articles on this topic:

Pros and cons of segmentation approach

Besides the manual work that needs to be done (analyzing and assigning labels) and the lack of metrics during the development phase that could tell how good the model is (this could be checked only after the model is deployed in production since the model is unsupervised), this approach is also quite inert. Since the features used in clustering are often aggregated (average, count, sum) over the last few months, this means that the segmentation will only change if some “huge” event happens in the customer behavior or if “smaller” unusual events happen frequently over a shorter period of time. In other words, as long as the user stays in its comfort zone, there is no migration between the clusters, and the bank is unable to address the customer in a timely manner for some specific life event. A potential fix to this would be decreasing time intervals (aggregate features over smaller time periods) or choosing the features that are related to the time components (e.g. recency).

Even though its real accuracy could be checked only after the model is deployed to production, segmentation based on clustering is still great to start with and the bank that decides to introduce personalization in its business should definitely try with this one first. It is quite easy to implement, very intuitive and understandable, adaptable and adjustable to different requirements for segments and the results would most probably satisfy initial expectations :)

The idea is to make a prediction model which will tell whether the customer will take a product or service in the next period. These types of models try to attack the problem of timely, precise, and appropriate content. In basic terms, this is a classification model that classifies customer behavior for some period to positive (product taken after that period) or negative class (not taken). Depending on the number of products (or services) the bank wants to recommend, the model will have an output of that size with a binary classifier for each class at the end (ex. sigmoid). It is advised to have a binary classifier at the end (instead of softmax for example) since distribution across product counts is not uniform and we don’t want to bias less frequent products. In this way, each product is considered independently. Labels for this approach are extracted from the historical data such that considering a particular time point in history, customer behavior before that moment (during the period of an arbitrary number of months) is used to build features, while a few months after that as prediction period. If the user took the product during the prediction period, the label for that class is positive, otherwise negative.

As we want our model to be significantly sensitive to the dynamics of the users’ behavior, so that the offer is sent as soon as possible, special attention should be committed to features creation. Unlike the segmentation where we wanted to have a “smoothed” overall picture of the client for some longer period, here we want to give an advantage to the latest events, shorten time periods for aggregation function and focus more on each individual client’s action (rather than their average). Here are some examples: maximal withdrawal of money from foreign currency in the last month, latest spend on travel agencies/car dealers/house material, increase in salary compared to previous month, number of website visits in the last 24h, etc.

Note that here highly imbalanced classes are expected since some products are naturally less frequent than others and the action of taking the product is much rarer than not taking. This module is strongly dependent on the data quality and there must be enough examples per each class for successful training. One of the trivial solutions to this would be resampling (up/down sampling). However, this is not the only option.

Predicting whether the customer will take the product could be seen from a data and time perspective. The explained approach in extracting the labels and features (data) at a certain point in time could be multiplied to many time points. If we slide the point through time in regular time intervals (e.g. one month) we may get similar features (probably not identical) for the same client and the same product but from a different time view. In this way, we can augment our dataset and increase the number of examples for positive rare classes. Also, at each time interval, we get an enormous number of negative examples, so it is advised to use negative sampling to reduce their number.

Shifting time window and multiplicating data

Considering class imbalance, metrics that are observed for this task are Precision, Recall, F1 and ROC AUC curve. More on these could be found here.

Prediction models try to answer when and what should be offered to the customer. Some cons of this model are:

Pros:

In case when the bank doesn’t have clear granularity in product types (for example, under Loans are assumed all the loans without split regarding its purpose to the car loan, house reparation loan, travel loan, etc. ) the segmentation could be combined so that particular content of interest is generated for each client (or groups of clients). For example, if a person is a student starting undergraduate studies, then the content of the Loan may be related to that special life event, or if a person is just starting a career, then the content of the Loan may be related to house/car purchase, etc. This is the combination of ML-based models with manually defined rules.

The next level of personalization would be to suggest products (such as Loan, Overdraft, Mortgage, or Insurance) with concrete amounts of money, the number of installments, purpose (Car purchase, summer vacation planning, house decoration, travel insurance, etc.) generated by machine learning models considering client financial state and interests. Instead of predicting whether the person will take a Loan or not, the model could also predict these additional product properties combined with information on taking the product. This could be implemented as a combination of two or more loss functions that should be optimized (loss for binary classifier, regression loss on continual values such as amount, number of installments, and cross-entropy loss for multiclass classifier (purpose)). Also, some extra input features that could be added here may be the results from segmentation (mentioned previously). Moreover, the bank could focus on detecting important life events such as Child Birth, Starting studies, Buying a house, Car purchases, Starting a new job, Salary Increase by simply combining prediction and segmentation results.

In cases where a bank has an enormous number of products (or items of any kind, in general) and there is potential to target its customers often (mistakes are not penalized hardly by customer satisfaction) use of a recommendation system is a natural choice. Items are preferred to be of the same type, repeatable, and easily accessible for users. The question that one may ask here is: Why would the bank have such a gigantic number of products or items? This depends on the bank’s business strategy and the product portfolio they want to offer. I will give a few examples:

Bank, Clients and Companies creating an ecosystem

Deep recommendation systems

This is quite an extensive field as there are many kinds of these systems. Some of the most successful are deep recommendation systems, mainly run nowadays by companies such as Netflix, Spotify, Facebook, etc reporting outstanding results. Traditionally speaking, there are content-based and collaborative-based recommendation systems. The first one is matching the user with items based on the features from both of them, while the second one is matching the user with items based on the interactions of similar users to similar items. There are also hybrid approaches content-collaborative where both of these are taken into consideration.

Collaborative filtering

The idea is to represent user and item inputs as sparse one-hot encodings of dimensions M (number of users) and N (number of items), respectively. After the encoding, there goes the embeddings layer (dimension K) which tries to learn a user/item latent representation. These two representations are then merged (concatenation or scalar dot-product) and fed to a feed-forward neural network (Neural CF Layers). The output is a sigmoid function resulting in the probability that processed user and item interacted actually. If there is an interaction then the target is a positive class 1, otherwise negative 0. Looking at all possible combinations of user-item pairs, it is obvious that just a few of them are positive while most are negative. Because of this disbalance, during training, not all negative pairs are used, but rather those chosen by the “negative sampling” approach (for each positive sample randomly choose n negatives, ration 1:n). Train and validation datasets are formed from the user perspective. For each user, all interactions with items except the last one are used as positive samples in the train set, while the last one is used for validation. Negative samples are added to both datasets using the above-mentioned approach, while for validation there is an extra condition that negative samples are not matching those included in the train set. The collaborative filtering approach assumes that all users are present during the training procedure so that each user is able to form an embedding. This might be a problem for new users since “cold start” would occur until users interact with some of the items. Moreover, the model needs to be trained all over again since new users don’t have learned embeddings. This is where content-based recommenders come in rescue :)

Content-based recommender

While collaborative filtering represents the user or item as one-hot encoding and based on the interactions learns its representations, content-based systems tend to represent it using features that describe it better. Depending on the nature of the features, there might be multiple embeddings, each of them learning representation for the particular type of features. These are all merged in some way (cosine similarity, scalar dot-product, concatenation) resulting in one embedding for a user and one for an item. Features that represent user and item don’t have to be of the same type, in general. The rest of the architecture is more or less the same as for collaborative filtering.

Embedding is learned from user/item features

This recommender is particularly popular since it doesn’t require interactions between user and item in order to make a recommendation. As a user is represented with features that are independent of items or other users, the system is able to extract user embedding and calculate a probability of interacting with all existing items. The “cold start” problem is therefore solved!

Pros:

Cons:

In cases where both approaches are combined and embeddings are formed from features, as well as from interactions, this is called hybrid systems.

Here are some blogs on deep recommendation systems types:

The goal of this article was to analyze current approaches on how personalization is done in the banking industry and to try to list some parts of ML which could be used in it. In practice, there is no general solution to these types of problems, but a lot of experimentation and data wrangling is needed in order to get to the winning formula. Moreover, the results depend a lot on the data quality and sometimes even the best state-of-the-art models are not able to give satisfying results. These are all factors that should be considered when starting a personalization system in the banking industry.

On the other hand, understanding customer needs and approaching each in a unique way offering services and products that are convenient, relevant, and valuable at a certain moment is a MUST. Even though it requires extra effort to be done, it is worth committing to it in the long run. Banks that are focusing on their customers have a SUCCESSFUL strategy and Machine Learning is the best tool to help them accomplish this.

Add a comment

Related posts:

Deploy ERC20 Tampa Bay Token example

This is the same content from our meetup on 09/17. Going over the basics of how ERC20 tokens work.. “Deploy ERC20 Tampa Bay Token example” is published by Michael O'Rourke.

You Need To Believe In Your New Business First

He wants to start a business. He has a great idea, he thinks. He’s very passionate about the industry. He’s seen a gap in the market. It’s small, maybe. It could be bigger, possibly. He’s worried…

Asian Football Confederation Meeting

The Melbourne Stem Cell Centre General Manager and Donor Study supervisor, Mr Michael Kenihan, attended the Asian Football Confederation meeting in Chengdu China in March 2019. This meeting is…