# Issue

This Content is from Stack Overflow. Question asked by Stefano Testoni

There is something about the workings of GradientTape that escapes my understanding.

Suppose we want to train an agent on the classic bandit problem using an actor-critic RL framework. There are two bandits, A and B, and the agent must learn to select A, which yields higher returns on average. The training consists of, say, 1000 epochs, in each of which the agent draws, say, 100 samples from each bandit. The reward is 1 every time the agent selects A, and 0 otherwise.

Let’s see how the agent learns by observing rewards over 10 training simulations. Here is the code defining the agent and the environment (neither needs to be more complicated than below).

``````import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense
from keras import Model

n_sims = 10 # number of simulations
for n in range(n_sims):
# define actors and optimizers for each simulation
actor_input = Input(shape=(2,))
actor_output = Dense(2, activation='softmax')(actor_input)
vars()[f'actor_{n}'] = Model(inputs=actor_input, outputs=actor_output)

# define critics and optimizers for each simulation
critic_input = Input(shape=(2,))
critic_output = Dense(1, activation='softmax')(critic_input)
vars()[f'critic_{n}'] = Model(inputs=critic_input, outputs=critic_output)

vars()[f'mean_rewards_{n}'] = [] # list to store rewards over training epochs for each simulation

A = np.random.normal(loc=10, scale=15, size=int(1e5)) # bandit A
B = np.random.normal(loc=0, scale=1, size=int(1e5)) # bandit B
n_training_epochs = 1000
n_samples = 100
``````

Let’s consider two alternative codes for the training loop using GradientTape, both based on a simple ‘vanilla’ loss function.

The first is the slow one and literally involves a for loop over the samples drawn in each epoch. Cumulative actor and critic’s losses are iteratively computed, and then their means are used to update their respective network weights.

``````for _ in range(n_training_epochs):
A_samples = np.random.choice(A, size=n_samples)
B_samples = np.random.choice(B, size=n_samples)
for n in range(n_sims):
cum_actor_loss, cum_critic_loss, cum_reward = 0, 0, 0
for A_sample, B_sample in zip(A_samples, B_samples):
probs = globals()[f'actor_{n}'](tf.reshape([A_sample, B_sample], (1,-1)))
action = np.random.choice(['A','B'], p=np.squeeze(probs))
reward = 1 if action == 'A' else 0
cum_reward += reward
action_prob = probs[['A','B'].index(action)]
value = globals()[f'critic_{n}'](tf.reshape([A_sample, B_sample], (1,-1)))
mean_actor_loss = cum_actor_loss/n_samples
mean_critic_loss = cum_critic_loss/n_samples
globals()[f'mean_rewards_{n}'].append(cum_reward/n_samples)
``````

If you plot the average training rewards over each epoch, you’ll probably get something like this figure

In the second option, instead of using an explicit for loop over samples in each epoch, we perform operations on arrays. This alternative is much faster in terms of computation time.

``````for _ in range(n_training_epochs):
A_samples = np.random.choice(A, size=n_samples)
B_samples = np.random.choice(B, size=n_samples)
for n in range(n_sims):
probs = globals()[f'actor_{n}'](tf.reshape([[A_sample, B_sample] for A_sample, B_sample in zip(A_samples, B_samples)], (n_samples,-1)))
actions = np.array([np.random.choice(['A','B'], p=np.squeeze(probs[i])) for i in range(len(probs))]).reshape(n_samples, -1)
rewards = np.array([1.0 if action == 'A' else 0.0 for action in actions]).reshape(n_samples, -1)
globals()[f'mean_rewards_{n}'].append(np.mean(rewards))
values = globals()[f'critic_{n}'](tf.reshape([[A_sample, B_sample] for A_sample, B_sample in zip(A_samples, B_samples)], (n_samples,-1)))
actions_num = [['A','B'].index(action) for action in actions]
action_probs = tf.reduce_sum(tf.one_hot(actions_num, len(['A','B'])) * probs, axis=1)

``````

Let’s plot the average reward over epochs, to obtain something like this

As you can see the agent tends to learn earlier and more stably in the first case than in the second (where learning may not even happen), although the two training loops are in theory mathematically equivalent. How is that? The reason has probably something to do with the fact that, in the first option, GradientTape is watching the trainable variables several times per epoch before applying the gradient, whereas in the second option it does so only once. Even so, I can’t figure out why exactly this produces the observed results. Can you help me understand?

# Solution

This question is not yet answered, be the first one who answer using the comment. Later the confirmed answer will be published as the solution.

```