February 2016

Volume 31 Number 2

[Microsoft Azure]

Azure Service Fabric, Q-Learning and Tic-Tac-Toe

By Jesus Aguilar

Innovations in cloud computing are reducing the barriers to entry for distributed computing and machine learning applications from niche technologies requiring specialized and expensive infrastructure to a commodity offering available to any software developer or solution architect. In this article, I’ll describe the implementation of a reinforcement learning technique that leverages the distributed computing and storage capabilities of Azure Service Fabric, the next iteration of the Azure Platform-as-a-Service offering. To demonstrate the potential of this approach, I’ll show how you can make use of the Service Fabric and its Reliable Actors programming model to create an intelligent back end that can predict the next move in a game of tic-tac-toe. War Games, anyone?

Enter Q-Learning

Today we see innovative data-driven solutions such as recommendations, face recognition and fraud detection everywhere. Software engineering teams use supervised and unsupervised learning techniques to implement these solutions. Despite the extensive capabilities of these approaches, there are cases where they are difficult to apply.

Reinforcement learning is a method that handles scenarios you can represent as a sequence of states and transitions. In contrast to other machine learning approaches, reinforcement learning doesn’t attempt to generalize patterns by training a model from labeled information (supervised learning) or from unlabeled data (unsupervised learning). Instead, it focuses on problems you can model as a sequence of states and transitions.

Say you have a scenario you can represent as a sequence of states that lead to the final state (known as the absorbing state). Think of a robot making decisions to avoid obstacles, or artificial intelligence (AI) in a game designed to beat an opponent. In many cases, the sequence of states that lead to a particular situation is what determines the best next step for the agent/robot/AI character.

Q-learning is a reinforcement learning technique that uses an iterative reward mechanism to find optimal transitional pathways in a state machine model; it works remarkably well when the number of states and their transitions are finite. In this article, I’ll present how I used Service Fabric to build an end-to-end Q-learning solution and show how you can create an intelligent back end that “learns” how to play tic-tac-toe. (Note that state machine scenarios are also referred to as Markov Decision Processes [MDPs]).

First, some basic theory about Q-learning. Consider the states and transitions depicted in Figure 1. Say you want to find, at any state, which state an agent needs to transition to next to arrive at the gold state—while minimizing the number of transitions. One way to tackle this problem is to assign a reward value to each state. The reward suggests the value of transitioning to a state toward your goal: getting the gold.

A Sequence of States Leading to the Gold State
Figure 1 A Sequence of States Leading to the Gold State

Simple, right? The challenge becomes how to identify the reward for each state. The Q-learning algorithm identifies rewards by recursively iterating and assigning rewards to states that lead to the absorbing (gold) state. The algorithm calculates a state’s reward by discounting the reward value from a subsequent state. If a state has two rewards—which is possible if a state exists in more than one pathway—the highest prevails. The discount has an important effect on the system. By discounting the reward, the algorithm reduces the value of the reward for states that are far from the gold and assigns more weight to the states closest to the gold.

As an example of how algorithm calculates the reward, look at the state diagram in Figure 1. As you can see, there are three pathways to gold:

1->5->4->G

1->5->3->4->G

1->5->3->2->4->G

After running the algorithm using brute force transitioning (iterating through all the possible paths in the graph), the algorithm calculates and assigns the rewards for the valid pathways. Rewards are calculated with the discount factor of 0.9.

1(R=72)->5(R=81)->4(R=90)->G (R=100) 

1(R=64)->5(R=72)->3(R=81)->4(R=90)-> G(R=100)

1(R=58)->5(R=64)->3(R=72)->2(R=81)->4(R=90)-> G(R=100)

Because some states have more than one reward, the highest value will prevail. Figure 2 depicts the final reward assignment.

Final Rewards
Figure 2 Final Rewards

With this information, an agent can identify the optimal path to gold in any state by transitioning to the state with the highest reward. For instance, if the agent is in state 5, it has the choice to transition to states 3 or 4, and 4 becomes the choice because the reward is higher.

Azure Service Fabric

Service Fabric, the next iteration of the Azure Platform-as-a-Service offering, empowers developers to create distributed applications using two different top-level programming models: Reliable Actors and Reliable Services. These programming models allow you to maximize the infrastructure resources of a distributed platform. The platform handles the most difficult tasks associated with maintaining and running a distributed application—recovery from failures, distribution of services to ensure efficient resource utilization, rolling updates and side-to-side versioning, to mention a few.

Service Fabric provides you with a cluster, giving you a higher level of abstraction to use, rather than having to worry about the underlying infrastructure. Your code runs on the nodes of a Service Fabric cluster, and you can host multi-node clusters on a single machine for development purposes or on multiple servers (virtual or physical machines) for production workloads. The platform manages the lifecycle of your Actors and Services and the recovery from infrastructure failures.

Service Fabric introduces Reliable Actors and Services with stateful semantics. This capability translates into a fully integrated developer experience in which you can develop applications that persist data in a distributed and therefore highly available manner without having to include an external storage layer (for example, taking dependency on an external storage or caching layer) in your architecture.

By implementing the Q-learning algorithm as a service within Service Fabric, you can benefit from having distributed computing and low-latency state storage capabilities, enabling you to execute the algorithm, persist the results, and expose the whole thing as reliable end points for clients to access. All these capabilities come together in a single solution with a unified programming and management stack. There’s no need to add additional components to your architecture, such as an external storage, cache or messaging system. In short, you have a solution in which your compute, data and services reside within the same integrated platform. That’s an elegant solution in my book!

Q-Learning and Reliable Actors

The actor model simplifies the design of massively concurrent applications. In the actor model, actors are the fundamental computing unit. An actor represents a boundary of functionality and state. You can think of an actor as an object entity living in a distributed system. Service Fabric manages the lifecycle of the actor. In the event of failure, Service Fabric re-instantiates the actor in a healthy node automatically. For example, if a stateful actor crashes for some reason or the node (think VM) it’s running on fails, the actor is automatically re-created on another machine with all of its state (data) intact.

Service Fabric also manages how an instance of an actor is accessed. The platform guarantees that at any point in time, only one method on a particular actor is executing at a time. If there are two concurrent calls to the same actor, Service Fabric will queue one and let the other proceed. The implication is that inside an actor your code doesn’t have to worry about race conditions, locks or synchronization.

As I described earlier, the Q-learning algorithm iterates through states with the goal of finding absorbing states and states with rewards. Once the algorithm identifies an absorbing state with a reward, it will calculate rewards for all the states that lead to the absorbing state.

Using the actor model, I can model this functionality as an actor that represents a state in the context of the Q-learning algorithm (think of stages in the overall graph). In my implementation, the actor type that represents these states is QState. Once there’s a transition to a QState actor containing a reward, the QState actor will create another actor instance of a different type (QTrainedState) for each of the QState actors in the pathway. QTrainedState actors maintain the maximum reward value and a list of the subsequent states that yield the reward. The list contains the state tokens (which uniquely identifies a state in the graph) of the subsequent states.

In Figure 3, I depict the logic of the algorithm using actors, for a very simple scenario where a state with state token 3 is an absorbing state, contains a reward of 100, and has only one pathway with two previous states (state token 1 and 2). Each circle represents an instance of an actor, QStates in blue and QTrainedStates in orange. Once the transition process reaches the QState with state token 3, the QState actor will create two QTrainedStates, one for each of the previous QStates. For the QTrainedState actor that represents state token 2, the suggested transition (for a reward of 90) is to state token 3, and for the QTrainedState actor that represents state token 1, the suggested transition (for a reward of 81) is to state token 2.

Determining and Persisting Rewards
Figure 3 Determining and Persisting Rewards

It’s possible that multiple states will yield the same reward, so the QTrainedState actor persists a collection of state tokens as children states.

The following code shows the implementation of the interfaces for the QState and QTrainedState actors, called IQState and IQTrainedState. QStates have two behaviors: transitioning to other QStates and starting the transition process when no prior transition exists:

public interface IQState : IActor
{
  Task StartTrainingAsync(int initialTransitionValue);
  Task TransitionAsync(int? previousStateToken, int transitionValue);
}
public interface IQTrainedState:IActor
{
 .Task AddChildQTrainedStateAsync(int stateToken, double reward);
 .Task<List<int>> GetChildrenQTrainedStatesAsync();
}

Notice that the implementation of IQTrainedState surfaces the method GetChildrenQTrainedStatesAsync. This method is how the QTrainedState actor will expose the trained data containing the states with the highest reward value for any state in the system. (Note that all actors in the Service Fabric must implement an interface derived from IActor.)

QState Actor

After defining the interfaces, I can move to the implementation of the actors. I’ll start with the QState actor and the TransitionAsync method, which is the cornerstone of the algorithm and where most of the work resides. TransitionAsync makes the transition to another state by creating a new instance of a QState actor and calling the same method again.

You might wonder if by calling the method recursively you’d avoid the overhead of invoking the method through another actor instance. A recursive method call is a compute-intensive operation in a single node. In contrast, by instantiating another actor, you're taking advantange of the capabilities of Service Fabric by letting the platform distribute the processing across horizontal computing resources.

To manage the assignment of the reward, I’ll register a reminder. Reminders are new constructs introduced in the actor programming model that allow you to schedule asynchronous work without blocking the execution of a method.

Reminders are available only for stateful actors. For both stateless and stateful actors, the platform provides timers that enable similar patterns. One important consideration is that when the actor is used, the garbage collection process is delayed; nevertheless, the platform doesn’t consider timer callbacks as usage. If the garbage collector kicks in, the timers will be stopped. Actors won’t be garbage collected while a method is executed. To guarantee recurring execution, use reminders. More information can be found at bit.ly/1RmzKfr.

The goal, as it relates to the algorithm, is to perform the reward assignment without blocking the transition process. Typically, scheduling a work item in the thread pool with a callback will suffice; however, in the actor programming model, this approach is not a good idea as you’ll lose the concurrency benefits of the platform.

The platform guarantees that only one method is executed at any time. This capability allows you to write your code without considering concurrency; that is, without having to worry about thread safety. As you’d expect, there’s a trade-off: You must avoid creating tasks or threads to wrap operations inside the actor methods. The reminders allow you to implement background processing scenarios with the concurrency guarantees of the platform, as shown in Figure 4.

Figure 4 TransitionAsync in the QState Class

public abstract class QState : StatefulActor, IQState, IRemindable
{
  // ...
  public Task TransitionAsync(int? previousStateToken, int transitionValue)
  {
    var rwd = GetReward(previousStateToken, transitionValue);
    var stateToken = transitionValue;
    if (previousStateToken != null)
        stateToken = int.Parse(previousStateToken.Value + stateToken.ToString());
    var ts = new List<Task>();
    if (rwd == null || !rwd.IsAbsorbent)
      ts.AddRange(GetTransitions(stateToken).Select(p =>
        ActorProxy.Create<IQState>(ActorId.NewId(),
        "fabric:/QLearningServiceFab").TransitionAsync(stateToken, p)));
    if (rwd != null)
      ts.Add(RegisterReminderAsync("SetReward",
        Encoding.UTF8.GetBytes(JsonConvert.SerializeObject(rwd))
        , TimeSpan.FromMilliseconds(0)
        , TimeSpan.FromMilliseconds(-1), ActorReminderAttributes.Readonly));
      return Task.WhenAll(ts);
  }
  // ...
}

 (Note that setting dueTime to TimeSpan.FromMilliseconds(0)) indicates an immediate execution.)

To complete the implementation of IQState, the following code implements the StartTransitionAsync method, where I use a reminder to avoid a blocking long-running call:

public Task StartTrainingAsync(int initialTransitionValue)
  {
    return RegisterReminderAsync("StartTransition",
      Encoding.UTF8.GetBytes(JsonConvert.SerializeObject(new { TransitionValue =
      initialTransitionValue })), TimeSpan.FromMilliseconds(0),
      TimeSpan.FromMilliseconds(-1),
      ActorReminderAttributes.Readonly);
  }

To finalize the implementation of the QState class, I’ll describe the implementation of the SetRewardAsync and the Receive­ReminderAsync methods, shown in Figure 5. The SetReward method creates or updates a stateful actor (the implementation of IQTrainedState). To locate the actor in subsequent calls, I use the state token as the actor id—actors are addressable.

Figure 5 The SetRewardAsync and the ReceiveReminderAsync Methods

public Task SetRewardAsync(int stateToken, double stateReward, double discount)
  {
    var t = new List<Task>();
    var reward = stateReward;
    foreach (var pastState in GetRewardingQStates(stateToken))
    {
      t.Add(ActorProxy
        .Create<IQTrainedState>(new ActorId(pastState.StateToken),
          "fabric:/QLearningServiceFab")
        .AddChildQTrainedStateAsync(pastState.NextStateToken, reward));
      reward = reward * discount;
    }
    return Task.WhenAll(t);
  }
public async Task ReceiveReminderAsync(string reminderName,
  byte[] context, TimeSpan dueTime, TimeSpan period)
  {
    await UnregisterReminderAsync(GetReminder(reminderName));
    var state = JsonConvert.DeserializeObject<JObject>(
      Encoding.UTF8.GetString(context));
    if (reminderName == "SetReward")
    {
      await SetRewardAsync(state["StateToken"].ToObject<int>(),
        state["Value"].ToObject<double>(),
        state["Discount"].ToObject<double>());
    }
    if (reminderName == "StartTransition")
    {
      await TransitionAsync(null, state["TransitionValue"].ToObject<int>());
    }
  }

QTrainedState Actor

The second actor in the solution is QTrainedState. The data in QTrainedState actor must be durable, therefore I implemented this actor as a stateful actor.

In Service Fabric, you implement a stateful actor by deriving your class from Stateful­Actor or the StatefulActor<T> base classes and implementing an interface derived from IActor. T is the type of the state instance, which must be serializable and a reference type. When you call a method of a class derived from StatefulActor<T>, the platform loads the state from the state provider, and once the call completes, the platform saves it automatically. In the case of the QTrainedState, I modeled the state (durable data) using the following class:

[DataContract]
public class QTrainedStateState
{
  [DataMember]
  public double MaximumReward { get; set; }
  [DataMember]
  public HashSet<int> ChildrenQTrainedStates { get; set; }
}

Figure 6 shows the complete implementation of the QTrainedState class, which implements the two methods of the IQTrainedState interface.

Figure 6 The QTrainedState Class

public class QTrainedState : StatefulActor<QTrainedStateState>, IQTrainedState
{
  protected async override Task OnActivateAsync()
  {
    this.State =
      await ActorService.StateProvider.LoadStateAsync<QTrainedStateState>(
      Id, "qts") ??
      new QTrainedStateState() { ChildrenQTrainedStates = new HashSet<int>() };
    await base.OnActivateAsync();
  }
  protected async override Task OnDeactivateAsync()
  {
    await ActorService.StateProvider.SaveStateAsync(Id, "qts", State);
    await base.OnDeactivateAsync();
  }
  [Readonly]
  public  Task AddChildQTrainedStateAsync(int stateToken, double reward)
  {
    if (reward < State.MaximumReward)
    {
      return Task.FromResult(true);
    }
    if (Math.Abs(reward - State.MaximumReward) < 0.10)
    {
      State.ChildrenQTrainedStates.Add(stateToken);
      return Task.FromResult(true);
    }
      State.MaximumReward = reward;
      State.ChildrenQTrainedStates.Clear();
      State.ChildrenQTrainedStates.Add(stateToken);
      return Task.FromResult(true);
  }
  [Readonly]
  public Task<List<int>> GetChildrenQTrainedStatesAsync()
  {
    return Task.FromResult(State.ChildrenQTrainedStates.ToList());
  }
}

Surfacing the Actors

At this point, the solution has everything necessary to start the training process and persist the data. But I haven’t yet discussed how clients will interact with these actors. At a high level, this interaction consists of starting the training process and querying the persisted data. Each of these interactions correlates nicely with an API operation, and a RESTful implementation facilitates the integration with clients.

In addition to having the two programming models, Service Fabric is a comprehensive orchestration and process management platform. The failure recovery and resource management that exists for actors and services is also available to other processes. For instance, you can run Node.js or ASP.NET 5 processes, managed by Service Fabric, and benefit from these capabilities without further effort. So I can just use a standard ASP.NET 5 Web API application and create an API controller that exposes the relevant actor’s functionality, as shown in Figure 7.

Figure 7 The API Controller

[Route("api/[controller]")]
public class QTrainerController : Controller
{
  [HttpGet()]
  [Route("[action]/{startTrans:int}")]
  public  async Task<IActionResult>  Start(int startTrans)
  {
    var actor = ActorProxy.Create<IQState>(ActorId.NewId(),
      "fabric:/QLearningServiceFab/");
    await actor.StartTrainingAsync(startTrans);
    return Ok(startTrans); 
  }
  [HttpGet()]
  [Route("[action]/{stateToken}")]
  public async Task<int> NextValue(int stateToken)
  {
    var actor = ActorProxy.Create<IQTrainedState>(new ActorId(stateToken),
      "fabric:/QLearningServiceFab");
    var qs = await actor.GetChildrenQTrainedStatesAsync();
    return qs.Count == 0 ? 0 : qs[new Random().Next(0, qs.Count)];
  }
}

And Tic-Tac-Toe?

What’s left now is to make use of the solution with a concrete scenario. For this, I’ll use a simple game: tic-tac-toe.

The goal is to train a set of QTrainedStates that you can query to predict the next move in a game of tic-tac-toe. One way to think about this is that the machine is acting as both players and learning from the outcomes.

Going back to the implementation, notice that QState is an abstract class. The idea is to encapsulate the basic aspects of the algorithm and put the logic of the specific scenario in a derived class. A scenario defines three parts of the algorithm: how the transition between states occurs (policy); what states are absorbing and have an initial reward; and the states the algorithm will assign a reward with a discount. For each of these parts, the QState class has a method where you can implement these semantics to solve a specific scenario. These methods are GetTransitions, GetReward and GetRewardingQStates.

So the question becomes: How can you model a game of tic-tac-toe as a sequence of states and transitions?

Consider the game depicted in Figure 8, where each cell has a number assigned. You can think of each turn as a transition from one state to another in which the transition value is the cell where the player is making a play. Each state token is then a combination of the previous turns (cells) and the transition value. For the example in Figure 8, a transition from 1 to 14, and then to 142, and so on, models the steps of the game where the player that played the first turn wins. And in this case, all the states that lead to 14273 (the winning and absorbing state) must be assigned a reward: 1 and 142.

The Tic-Tac-Toe Scenario
Figure 8 The Tic-Tac-Toe Scenario

Going back to Q-learning, what I need to provide are all the final (absorbing) states, each with an initial reward. For tic-tac-toe, three types of states will yield a reward: a win, a tie or a block (referring to a point when your opponent is about to win, so you are forced to use your turn to block him). A win and a tie are absorbing, meaning the game ends; a block is not, however, and the game continues. Figure 9 shows the implementation of the GetReward method for the game of tic-tac-toe.

Figure 9 The GetReward Method

internal override IReward GetReward(int? previousStateToken, int transitionValue)
{
  var game = new TicTacToeGame(previousStateToken,transitionValue);
  IReward rwd = null;
  if (game.IsBlock)
  {
    rwd = new TicTacToeReward() { Discount = .5, Value = 95, IsAbsorbent = false,
      StateToken = game.StateToken};
  }
  if (game.IsWin)
  {
    rwd = new TicTacToeReward() { Discount = .9, Value = 100, IsAbsorbent = true,
      StateToken = game.StateToken };
  }
  if (game.IsTie)
  {
    rwd = new TicTacToeReward() { Discount = .9, Value = 50, IsAbsorbent = true,
      StateToken = game.StateToken };
  }
  return rwd;
}

Next, once a state is identified with a reward, I need to provide the algorithm with the states that led to the state with an initial reward so a discounted reward can be assigned. For win or block scenarios, these states are all the previous states (plays) of the winning or blocking player. For ties, all the states (plays) of both players must be assigned a reward:

internal override IEnumerable<IPastState> GetRewardingQStates(int stateToken)
{
  var game = new TicTacToeGame(stateToken);
  if (game.IsTie)           
    return game.GetAllStateSequence();           
  return game.GetLastPlayersStateSequence();
}

Finally, I need to implement a transition policy that determines how the algorithm will iterate through the states. For the game, I’ll implement a transition policy where all possible combinations are explored:

internal override IEnumerable<int> GetTransitions(int stateToken)
{
  var game = new TicTacToeGame(stateToken);
  return game.GetPossiblePlays();
}

Playing Against the Machine

At this point, I can publish the solution and start the training by calling REST API and providing the initial transitions: 1 to 9.

Once the training finishes, you can use the API to create an application that can simply pass a state token and receive the suggested value. The source code for this article contains a Universal Windows Platform app that uses this back end. Figure 10 shows the game.

A Game of Tic-Tac-Toe
Figure 10 A Game of Tic-Tac-Toe

Wrapping Up

By using Q-learning and Service Fabric, I was able to create an end-to-end framework that leverages a distributed platform to compute and persist data. To showcase this approach, I used the game of tic-tac-toe to create a back end that learns how to play the game, and does so at an acceptable level by only indicating when a win, a tie, or a block occurs and letting the machine learn by playing the game.


Jesus Aguilar is a senior cloud architect at Microsoft in the Technical Evangelism and Development team where he partners with awesome companies that are born in the cloud, and helps them deliver compelling experiences at scale. He is passionate about software engineering and solution design, and you will catch his attention by using terms such as “Predictive Analytics,” “Scalability,” “Concurrency,” “Design Patterns” and “<Choose any Letter>aaS.” You can follow him on Twitter: @giventocode and check out his blog at giventocode.com.

Thanks to the following Microsoft technical experts for reviewing this article: Rob Bagby, Mike Lanzetta and Matthew Snider
Rob is a SR. Cloud Architect for Microsoft.  In this role, Rob works with influential ISVs, helping them implement their software in Azure.  Prior to this role, Rob worked as a consultant building loosely coupled, manageable, scalable systems.

Mike Lanzetta is a developer in the Partner Catalyst team at Microsoft, with a focus on Machine Learning, Big Data and Systems programming. He has an MS in CSE from University of Washington, and has been in the industry for over 20 years at companies ranging from startups to Amazon and Microsoft.

Matt Snider joined Microsoft in 2008, working on a small part of .NET. After .NET 4 shipped, he joined the Service Fabric team as the first technical PM, working on all the different feature areas, and has been with the team since then. These days he primarily works on the Cluster Resource Manager (Orchestrator), the Reliable Collections, and the Failover/Replication portions of the stack. When not working on distributed systems, he enjoys beer, hiking, and music.