# Deep Reinforcement Learning with TensorFlow 2.0

In this tutorial I will showcase the upcoming TensorFlow 2.0 features through the lense of deep reinforcement learning (DRL) by implementing an advantage actor-critic (A2C) agent to solve the classic CartPole-v0 environment. While the goal is to showcase TensorFlow 2.0, I will do my best to make the DRL aspect approachable as well, including a brief overview of the field.

In fact since the main focus of the 2.0 release is making developers’ lives easier, it’s a great time to get into DRL with TensorFlow - our full agent source is under 150 lines! Code is available as a notebook here and online on Google Colab here.

## Setup

As TensorFlow 2.0 is still in experimental stage, I recommend installing it in a separate (virtual) environment. I prefer Anaconda, so I’ll illustrate with it:

> conda create -n tf2 python=3.6
> source activate tf2
> pip install tf-nightly-2.0-preview # tf-nightly-gpu-2.0-preview for GPU version


Let’s quickly verify that everything works as expected:

>>> import tensorflow as tf
>>> print(tf.__version__)
2.0.0-dev20190129
>>> print(tf.executing_eagerly())
True


Note that we’re now in eager mode by default!

>>> print("1 + 2 + 3 + 4 + 5 =", tf.reduce_sum([1, 2, 3, 4, 5]))
1 + 2 + 3 + 4 + 5 = tf.Tensor(15, shape=(), dtype=int32)


If you’re not yet familiar with eager mode, then in essence it means that computation is executed at runtime, rather than through a pre-compiled graph. You can find a good overview in the TensorFlow documentation.

## Deep Reinforcement Learning

Generally speaking, reinforcement learning is a high level framework for solving sequential decision making problems. A RL agent navigates an environment by taking actions based on some observations, receiving rewards as a result. Most RL algorithms work by maximizing sum of rewards an agent collects in a trajectory, e.g. during one in-game round.

The output of an RL based algorithm is typically a policy - a function that maps states to actions. A valid policy can be as simple as a hard-coded no-op action. Stochastic policy is represented as a conditional probability distribution of actions, given some state. #### Actor-Critic Methods

RL algorithms are often grouped based on the objective function they are optimized with. Value-based methods, such as DQN, work by reducing the error of the expected state-action values.

Policy Gradients methods directly optimize the policy itself by adjusting its parameters, typically via gradient descent. Calculating gradients fully is usually intractable, so instead they are often estimated via monte-carlo methods.

The most popular approach is a hybrid of the two: actor-critic methods, where agents policy is optimized through policy gradients, while value based method is used as a bootstrap for the expected value estimates.

#### Deep Actor-Critic Methods

While much of the fundamental RL theory was developed on the tabular cases, modern RL is almost exclusively done with function approximators, such as artificial neural networks. Specifically, an RL algorithm is considered “deep” if the policy and value functions are approximated with deep neural networks. Over the years, a number of improvements have been added to address sample efficiency and stability of the learning process.

First, gradients are weighted with returns: discounted future rewards, which somewhat alleviates the credit assignment problem, and resolves theoretical issues with infinite timesteps.

Second, an advantage function is used instead of raw returns. Advantage is formed as the difference between returns and some baseline (e.g. state-action estimate) and can be thought of as a measure of how good a given action is compared to some average.

Third, an additional entropy maximization term is used in objective function to ensure agent sufficiently explores various policies. In essence, entropy measures how random a probability distribution is, maximized with uniform distribution.

Finally, multiple workers are used in parallel to speed up sample gathering while helping decorrelate them during training.

Incorporating all of these changes with deep neural networks we arrive at the two of the most popular modern algorithms: (asynchronous) advantage actor critic, or A3C/A2C for short. The difference between the two is more technical than theoretical: as the name suggests, it boils down to how the parallel workers estimate their gradients and propagate them to the model. With this I will wrap up our tour of DRL methods as the focus of the blog post is more on the TensorFlow 2.0 features. Don’t worry if you’re still unsure about the subject, things should become clearer with code examples. If you want to learn more then one good resource to get started is Spinning Up in Deep RL.

## Advantage Actor-Critic with TensorFlow 2.0

Now that we’re more or less on the same page, let’s see what it takes to implement the basis of many modern DRL algorithms: an actor-critic agent, described in previous section. For simplicity, we won’t implement parallel workers, though most of the code will have support for it. An interested reader could then use this as an exercise opportunity.

As a testbed we will use the CartPole-v0 environment. Somewhat simplistic, it’s still a great option to get started with. I always rely on it as a sanity check when implementing RL algorithms.

#### Policy & Value via Keras Model API

First, let’s create the policy and value estimate NNs under a single model class:

import numpy as np
import tensorflow as tf
import tensorflow.keras.layers as kl

class ProbabilityDistribution(tf.keras.Model):
def call(self, logits):
# sample a random categorical action from given logits
return tf.squeeze(tf.random.categorical(logits, 1), axis=-1)

class Model(tf.keras.Model):
def __init__(self, num_actions):
super().__init__('mlp_policy')
# no tf.get_variable(), just simple Keras API
self.hidden1 = kl.Dense(128, activation='relu')
self.hidden2 = kl.Dense(128, activation='relu')
self.value = kl.Dense(1, name='value')
# logits are unnormalized log probabilities
self.logits = kl.Dense(num_actions, name='policy_logits')
self.dist = ProbabilityDistribution()

def call(self, inputs):
# inputs is a numpy array, convert to Tensor
x = tf.convert_to_tensor(inputs, dtype=tf.float32)
# separate hidden layers from the same input tensor
hidden_logs = self.hidden1(x)
hidden_vals = self.hidden2(x)
return self.logits(hidden_logs), self.value(hidden_vals)

def action_value(self, obs):
# executes call() under the hood
logits, value = self.predict(obs)
action = self.dist.predict(logits)
# a simpler option, will become clear later why we don't use it
# action = tf.random.categorical(logits, 1)
return np.squeeze(action, axis=-1), np.squeeze(value, axis=-1)


And let’s verify the model works as expected:

import gym

env = gym.make('CartPole-v0')
model = Model(num_actions=env.action_space.n)

obs = env.reset()
# no feed_dict or tf.Session() needed at all
action, value = model.action_value(obs[None, :])
print(action, value) #  [-0.00145713]


Things to note here:

• Model layers and execution path are defined separately
• There is no “input” layer, model will accept raw numpy arrays
• Two computation paths can be defined in one model via functional API
• A model can contain helper methods such as action sampling
• In eager mode everything works from raw numpy arrays

#### Random Agent

Now we can move on to the fun stuff - the A2CAgent class. First, let’s add a test method that runs through a full episode and returns sum of rewards.

class A2CAgent:
def __init__(self, model):
self.model = model

def test(self, env, render=True):
obs, done, ep_reward = env.reset(), False, 0
while not done:
action, _ = self.model.action_value(obs[None, :])
obs, reward, done, _ = env.step(action)
ep_reward += reward
if render:
env.render()
return ep_reward


Let’s see how much our model scores with randomly initialized weights:

agent = A2CAgent(model)
rewards_sum = agent.test(env)
print("%d out of 200" % rewards_sum) # 18 out of 200


Not even close to optimal, time to get to the training part!

#### Loss / Objective Function

As I’ve described in the DRL overview section, an agent improves its policy through gradient descent based on some loss (objective) function. In actor-critic we train on three objectives: improving policy with advantage weighted gradients plus entropy maximization, and minizing value estimate errors.

import tensorflow.keras.losses as kls
import tensorflow.keras.optimizers as ko

class A2CAgent:
def __init__(self, model):
# hyperparameters for loss terms
self.params = {'value': 0.5, 'entropy': 0.0001}
self.model = model
self.model.compile(
optimizer=ko.RMSprop(lr=0.0007),
# define separate losses for policy logits and value estimate
loss=[self._logits_loss, self._value_loss]
)

def test(self, env, render=True):
# unchanged from previous section
...

def _value_loss(self, returns, value):
# value loss is typically MSE between value estimates and returns
return self.params['value']*kls.mean_squared_error(returns, value)

# a trick to input actions and advantages through same API
# sparse categorical CE loss obj that supports sample_weight arg on call()
# from_logits argument ensures transformation into normalized probabilities
weighted_sparse_ce = kls.SparseCategoricalCrossentropy(from_logits=True)
# note: we only calculate the loss on the actions we've actually taken
actions = tf.cast(actions, tf.int32)
# entropy loss can be calculated via CE over itself
entropy_loss = kls.categorical_crossentropy(logits, logits, from_logits=True)
# here signs are flipped because optimizer minimizes
return policy_loss - self.params['entropy']*entropy_loss


And we’re done with the objective functions! Note how compact the code is: there’s almost more comment lines than code itself.

#### Agent Training Loop

Finally, there’s the train loop itself. It’s relatively long, but fairly straightforward: collect samples, calculate returns and advantages, and train the model on them.

class A2CAgent:
def __init__(self, model):
# hyperparameters for loss terms
self.params = {'value': 0.5, 'entropy': 0.0001, 'gamma': 0.99}
# unchanged from previous section
...

# storage helpers for a single batch of data
actions = np.empty((batch_sz,), dtype=np.int32)
rewards, dones, values = np.empty((3, batch_sz))
observations = np.empty((batch_sz,) + env.observation_space.shape)
# training loop: collect samples, send to optimizer, repeat updates times
ep_rews = [0.0]
next_obs = env.reset()
for step in range(batch_sz):
observations[step] = next_obs.copy()
actions[step], values[step] = self.model.action_value(next_obs[None, :])
next_obs, rewards[step], dones[step], _ = env.step(actions[step])

ep_rews[-1] += rewards[step]
if dones[step]:
ep_rews.append(0.0)
next_obs = env.reset()

_, next_value = self.model.action_value(next_obs[None, :])
# a trick to input actions and advantages through same API
# performs a full training step on the collected batch
# note: no need to mess around with gradients, Keras API handles it
return ep_rews

def _returns_advantages(self, rewards, dones, values, next_value):
# next_value is the bootstrap value estimate of a future state (the critic)
returns = np.append(np.zeros_like(rewards), next_value, axis=-1)
# returns are calculated as discounted sum of future rewards
for t in reversed(range(rewards.shape)):
returns[t] = rewards[t] + self.params['gamma'] * returns[t+1] * (1-dones[t])
returns = returns[:-1]
# advantages are returns - baseline, value estimates in our case

def test(self, env, render=True):
# unchanged from previous section
...

def _value_loss(self, returns, value):
# unchanged from previous section
...

# unchanged from previous section
...


#### Training & Results

We’re now all set to train our single-worker A2C agent on CartPole-v0! Training process shouldn’t take longer than a couple of minutes. After training is complete you should see an agent successfully achieve the target 200 out of 200 score.

rewards_history = agent.train(env)
print("Finished training, testing...")
print("%d out of 200" % agent.test(env)) # 200 out of 200 In the source code I include some additional helpers that print out running episode rewards and losses, along with basic plotter for the rewards_history. ## Static Computational Graph

With all of this eager mode excitement you might wonder if static graph execution is even possible anymore. Of course it is! Moreover, it takes just one additional line to enable it!

with tf.Graph().as_default():
print(tf.executing_eagerly()) # False

model = Model(num_actions=env.action_space.n)
agent = A2CAgent(model)

rewards_history = agent.train(env)
print("Finished training, testing...")
print("%d out of 200" % agent.test(env)) # 200 out of 200


There’s one caveat that during static graph execution we can’t just have Tensors laying around, which is why we needed that trick with CategoricalDistribution during model definition. In fact, while I was looking for a way to execute in static mode, I discovered one interesting low level detail about models built through the Keras API…

## One More Thing…

Remember when I said TensorFlow runs in eager mode by default, even proving it with a code snippet? Well, I lied! Kind of.

If you use Keras API to build and manage your models then it will attempt to compile them as static graphs under the hood. So what you end up getting is the performance of static computational graphs with flexibility of eager execution.

You can check status of your model via the model.run_eagerly flag. You can also force eager mode by setting this flag to True, though most of the times you probably don’t need to - if Keras detects that there’s no way around eager mode, it will back off on its own.

To illustrate that it’s indeed running as a static graph here’s a simple benchmark:

# create a 100000 samples batch
env = gym.make('CartPole-v0')
obs = np.repeat(env.reset()[None, :], 100000, axis=0)


#### Eager Benchmark

%%time

model = Model(env.action_space.n)
model.run_eagerly = True

print("Eager Execution:  ", tf.executing_eagerly())
print("Eager Keras Model:", model.run_eagerly)

_ = model(obs)

######## Results #######

Eager Execution:   True
Eager Keras Model: True
CPU times: user 639 ms, sys: 736 ms, total: 1.38 s


#### Static Benchmark

%%time

with tf.Graph().as_default():
model = Model(env.action_space.n)

print("Eager Execution:  ", tf.executing_eagerly())
print("Eager Keras Model:", model.run_eagerly)

_ = model.predict(obs)

######## Results #######

Eager Execution:   False
Eager Keras Model: False
CPU times: user 793 ms, sys: 79.7 ms, total: 873 ms


#### Default Benchmark

%%time

model = Model(env.action_space.n)

print("Eager Execution:  ", tf.executing_eagerly())
print("Eager Keras Model:", model.run_eagerly)

_ = model.predict(obs)

######## Results #######

Eager Execution:   True
Eager Keras Model: False
CPU times: user 994 ms, sys: 23.1 ms, total: 1.02 s


As you can see eager mode is behind static mode, and by default our model was indeed executing statically, more or less matching explicit static graph execution.

## Conclusion

Hopefully this has been an illustrative tour of both DRL and the things to come in TensorFlow 2.0. Note that this is still just a nightly preview build, not even a release candidate. Everything is subject to change and if there’s something about TensorFlow you especially dislike (or like :) ) , let the developers know!

A lingering question people might have is if TensorFlow is better than PyTorch? Maybe. Maybe not. Both are great libraries, so it is hard to say one way or the other. If you’re familiar with PyTorch, you probably noticed that TensorFlow 2.0 not only caught up, but also avoided some of the PyTorch API pitfalls.

In either case what is clear is that this competition has resulted in a net-positive outcome for both camps and I am excited to see what will become of the frameworks in the future.