RL in XChannnel
How we used Reinforcement Learning to target offers to customers
XChannel is an in-store marketing platform which targets customers with contextual offers when they are shopping in stores. To this end, we use the customers location within the store to predict their shopping interests and uses store offers to incentivize shopping decisions. Merchants or brands setup various types of campaigns via our dashboard and monitor their performance in near real-time.
XChannel supports 3 types of in-store campaigns:
- Proximity based campaigns
- Social Campaigns
- Proximity based Social Campaigns
In all these campaigns, we are faced with a few key decisions when it comes to engaging customers. Some of these decisions are:
- Should we engage a customer – Though we use the presence of the shopper in a particular shopping aisle as the primary signal for engaging them, we use a few secondary indicators before deciding whether to engage them at all or not. Every customer engagement results in a potential negative customer experience if its not relevant to them.
- When do we engage them? – Once we decide to engage a customer, next decision is when. Sending in-store offers is fraught with risk when it comes to relevance. For example, what if they are just walking by an aisle when they come within the range of a beacon? Or what if they are just cursorily interested in a product?
- Which offer to engage them with? – XChannel allows multiple offers to be associated with a campaign. So the exact offer to use to target the customer is chosen dynamically based on the probability of acceptance.
We use our Machine Learning model to predict the responses before deciding on whether, when and how to engage them.
Traditional learning algorithms use Supervised Learning and use lots of labeled data. In the in-store targeting world, collecting this data is challenging for a few reasons.
- Cost of Learning – The data collection exercise requires us to run through various data combinations so that they cover all possibilities. So a lot of campaign targeting is wasted just to collect the data. This also means that we may end up with a lot of poor customer experience up-front just to improve future customer experience.
- Learning become irrelevant – The products associated with a campaign, and the campaign itself keep changing from time to time. Typical campaigns don’t last more than a few weeks. So by the time we collect the data, it can become less relevant since the campaign itself may have changed.
- Too many variables – The above two problems are exacerbated by the fact that we need to collect data for a campaign across too many variable which includes among others Customer data, campaign data and store data. All this data along with engagement data explodes the various combinations of data required to train the model using Supervised Learning methods.
All the challenges above with data collection made us look at Reinforcement Learning (RL) as an alternative. RL has been gaining a lot of popularity in the last few years ever since the folks at DeepMind trained an RL agent to play the Atari game in 2013. It has shown tremendous potential in its ability to learn online through “trial and error”.
A Short Intro to Reinforcement Learning
Reinforcement Learning (RL) is a branch of machine learning that has its foundation in behavioral psychology. RL algorithms are used to solve problems which are modeled as Markov Decision Process (MDP) which is a Discrete Time Stochastic Control Process. MDP’s are commonly used for solving sequential decision making problems under stochastic conditions. In other words, it provides a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision maker.
The key elements of an MDP are an Environment and an Agent (Fig 1). The Environment may be a game, or a robot, or in our case, an ad delivery system. The agent may be a software system that is learning to interact with the environment through periodic actions, and gains rewards in the process.
When the agent interacts with the environment through its action, the environment transitions to a new state. This transition is sometimes associated with a reward. The agent uses this reward to learn more about the environment. The goal of RL is to train a software agent to interact with the environment in such a way that it can maximize future reward. The series of actions taken by the agent is called a policy and the policy that maximizes future reward is called the “optimal policy”. So the RL algorithms attempt to learn the optimal policy for a stochastic environment, which can be modeled as an MDP, by interacting with it through “trial and error” (a.k.a. exploration and exploitation) and observing the reward obtained in the process.
Figure below shows the key elements of an MDP, and the agent learning process. The agent, at every discrete time interval, gets the state of the environment. The agent then generates an action. The environment responds to the action by transitioning to a new state and returns the new state along with an associated reward to the agent.
Figure 1: A temporal view of agent-environment interaction. Agent learns about the quality of action At taken time ’t’ in the next time step based on the type (positive or negative) and the quantum of reward obtained (Rt+1). Source: Sutton and Barto (2nd ed).
The goal of an RL agent, is to solve the MDP by finding the optimal policy, i.e., finding the sequence of action it can take, to maximize the cumulative future reward. In cases where we don’t know the transition probability function or reward function, we go for model-free RL.
RL problems have been studied extensively in the last few decades and multiple approaches exist for solving them. Two of the popular approaches for model-free RL are:
- Q-Learning belongs to a group of algorithms called Value Based algorithms. It aims to assign a value to each state-action pair based on their potential to maximize future rewards. Intuitively, this is akin to a chess player looking at all possible moves from a given board position, and picking the one that has the highest chance of future victory. Sounds logical, doesn’t it?
- REINFORCE (a.k.a. Stochastic Policy Gradient) belongs to a family of algorithms called Policy Based algorithms. It tries to directly learn a policy by sampling from a probability distribution of possible actions at a given state. Loosely speaking, this is akin to a chess player making a random move and trying to remember (forget) that move based on the positive (negative) reward obtained. These rewards may be obtained at a future point in time after multiple moves are made. The key idea underlying policy gradients is to tweak the action sampling distribution so as to push up the probabilities of actions that lead to higher return, and push down the probabilities of actions that lead to lower return, until you arrive at the optimal policy. See here for an example of an agent learning to play the game of Pong using Policy Gradient.
Both these approaches have some challenges in terms of convergence, stability, and applicability to certain class of problems. Hence variants of these approaches have evolved which tries to address those limitations. More info on RL and various algorithms may be found here, here, here and here.
One of the key prerequisites for applying RL to a problem is that the problem should be framed as a Markov Decision Process. That does not mean you need to fully describe the MDP model (i.e., all the state transition probabilities and reward distributions), just that you expect an MDP model could be made or discovered.
This paper from Facebook has some good guidelines on how to choose the features so that it will conform to an MDP model. Basically it boils down to verifying that:
- All chosen actions and state features are relevant. In other words, the state transitions are predictable by the chosen action and state features
- The reward obtained is indeed determined by both actions and states in a meaningful way
In the context of MDP, XChannel model consisted of the following:
Environment – The customer who is receiving notifications/offers and acting on them. Since same offer may evoke different responses at different times, the environment is stochastic
Agent – XChannel offer notification system, that at discrete time intervals, decides whether to send an offer notification
Action – YES/NO depending on whether we decide to engage the customer or not at that discrete time interval
Reward – Positive or negative values returned by the environment in response to our actions. A positive reward is obtained if the user clicks on the offer, while a negative reward is generated if the customer ignores the offer. The rewards are not immediate, sometimes it can take a while for the customer to act on an offer.
State – A set of features about the recipient customer, the offer candidate, and the beacon triggering the notification.
The Markov Decision Process (MDP) is based on a sequence of offer candidates for a particular person. The actions here are sending and dropping the notification, and the state describes a set of features about the person and the offer candidate.
With these definitions of Environment, Agent, Action, and Reward, our RL problem boils down to this –
- At every time step, the agent decides whether or not to send an offer to the customer.
- Based on the customer response, a reward is generated. The type (positive or negative) and the quantum of this reward informs the agent if the offer was relevant to the customer.
- This information is used by the agent to learn before taking future actions.
That’s it! Agent takes an action, and learns about the efficacy of that action from the reward it receives at a future time step. The action taken by the agent is sometimes random, and sometimes based on its prior learnings. This way the agent learns new behaviors instead of repeating the same action that may have worked before.
Figure 2: Markov Decision Process formulation for XChannel. State diagram representing one episode of interaction between the agent and the environment. The Markov Decision Process (MDP) is based on a sequence of offers for a particular person. The actions here are sending and dropping the offer notification, and the state describes a set of features about the customer and the offer candidate.
Figure 3 below shows the key components of the system. System consists of three logical components:
Listener – Monitors for new devices and ranges them when they are within the proximity of our beacons. Once they are within range, it collects all relevant details (state) of the customer and their device, and passes it to the Decider. Listener also monitors for user activity and assigns reward based on user engagement.
Decider – The component that encapsulates our RL model. It takes in the user/device state and any associated rewards, and decides whether and when to engage the
Figure 3: Component view of XChannel
customer. It also picks the offer to engage the customer with.
Notifier – Notifies the customer at the based on the input it receives from the Decider.
The RL model employed is a Deep Q-Network (DQN) which is a deep Neural Network which learns the state-action value (Q) function.
As we discussed, RL algorithms are theoretically designed on the Markov Decision Process (MDP) framework where some sort of long-term reward is optimized in a sequential setting.
Online vs Offline Learning – One of the benefits of using RL comes from its ability to learn online – i.e., it can interact with the environment by taking an action, while learning from those interactions. The model is iteratively improved based on the rewards received after each action. Eventually the agent learns what action to take at each state, so as to maximize the expected rewards (optimal policy). But online training can be risky too. Platforms like Horizon provide you the ability to do offline learning and Counterfactual Policy evaluations to assess how a new (incoming) model compares with the existing model, before rolling it out to production. Offline methods and Counterfactual Policy evaluations can make production rollouts less risky. Out current implementation relies on online learning though we may explore alternatives in future.
Hyper-parameter Selection – Learning rate, Discount factor, Initial Q-values, Network topology, exploration-exploitation factor for -greedy etc were all tuned through trial and error. There is a lot of literature out there on how to tune these values using heuristics.
Reward Selection – Traditional reward selection would involve selecting a positive reward for user clicking on the offer, and a negative reward for user ignoring it. We used differentiated rewards for following user flows –
- A reward of +X for user opening the offer notification and acting on it
- A reward of -X for user opening an offer notification and ignoring it
- A reward of -Y for user not opening a notification (Y < X)
However Facebook has attempted a more sophisticated reward function which adds a penalty component for every notification sent. This is to account for the negative user experience associated with receiving a notification. This can be a good way to arrive at a policy that increases the long term value for the customer. They have also attempted reward shaping by using a combination of parameters to calculate the reward.
Stability – Traditional Deep Q-Network suffer from a couple of issues which affect their stability and convergence.
- As it continues to explore new state trajectories, it can “forget” previously learned policies (“experience amnesia”).
- As both the input feature value and the target Q-value are both changing constantly, the solution could become unstable.
Adoption of RL significantly improved our offer adoption rates compared to our pure heuristic based approach of the past. It also reduced the customer dissatisfaction emanating from receiving redundant or irrelevant offer notifications at the most inopportune times. We chose DQN for our model due to its simplicity and numerous enhancements behind it. Though our approach relied on online training, we are exploring and offline model due to the risk of deploying online learning models in production.