Training one agent among many agents

Training one agent among many agents#

Scenario#

This is the same scenario we met in the first tutorial

[1]:
import warnings
from navground import sim

warnings.filterwarnings('ignore')
%config InlineBackend.figure_formats = ['svg']

scenario = sim.load_scenario("""
type: Cross
agent_margin: 0.1
side: 4
target_margin: 0.1
tolerance: 0.5
groups:
  -
    type: thymio
    number: 20
    radius: 0.1
    control_period: 0.1
    speed_tolerance: 0.02
    color: gray
    kinematics:
      type: 2WDiff
      wheel_axis: 0.094
      max_speed: 0.12
    behavior:
      type: HL
      optimal_speed: 0.12
      horizon: 5.0
      tau: 0.25
      eta: 0.5
      safety_margin: 0.05
    state_estimation:
      type: Bounded
      range: 5.0
""")

where 20 agents move back-and-forth between waypoints crossing in the middle, initially using the HL navigation behavior, but later using the policies we are going to learn.

[2]:
from navground.sim.ui.video import display_video

world = scenario.make_world()
display_video(world, time_step=0.1, duration=60, width=600,
              factor=10, display_width=300)
[2]:

Environment#

We also use the same sensor configured in the first tutorial

[3]:
sensor = sim.load_state_estimation("""
type: Discs
number: 5
range: 5.0
max_speed: 0.12
max_radius: 0
""")

Let us create an environment where we control one of the 20 agents, and save it for later use.

[4]:
import gymnasium as gym
from navground import sim

from navground.learning.rewards import SocialReward
from navground.learning import ControlActionConfig, DefaultObservationConfig, io

with open('scenario.yaml') as f:
    scenario = sim.load_scenario(f.read())

with open('sensor.yaml') as f:
    sensor = sim.load_state_estimation(f.read())

action_config = ControlActionConfig(
    max_acceleration=1.0, max_angular_acceleration=10.0,
    use_acceleration_action=True)

observation_config = DefaultObservationConfig(
    include_target_distance=True, include_target_direction=True,
    include_velocity=True, include_angular_speed=True, flat=True)

reward = SocialReward()

duration = 60.0
time_step = 0.1
env = gym.make('navground',
    scenario=scenario,
    sensor=sensor,
    action=action_config,
    observation=observation_config,
    reward=reward,
    time_step=time_step,
    max_duration=duration);

io.save_env(env, 'env.yaml')

Human-like behavior#

In the first tutorial, we have already measured the performance of the HL navigation algorithm in this scenario.

We redo the computation, in a different way, to estimate the reward over one episode (600 steps):

[15]:
from navground.learning.evaluation import evaluate

mean = {}
stddev = {}
mean['HL'], stddev['HL'] = evaluate(env, n_eval_episodes=100, indices={0})
print(f"HL reward: {mean['HL'] / 600: .2f} ± {stddev['HL'] / 600: .2f}")
HL reward: -0.22 ±  0.04

Training#

The environment is ready, the expectations are set … we are ready to train.

If you have installed tensorboard, you can visualize the log during training.

[6]:
# %load_ext tensorboard
[7]:
# %tensorboard --logdir logs

We record the current time stamp to avoid overwriting the logs

[8]:
from datetime import datetime as dt

stamp = dt.now().strftime("%Y%m%d_%H%M%S")

Behavior Cloning#

We start by the simplest Imitation Learning algorithm, Behavior Cloning, to learn to imitate the HL “expert”. For all three algorithms, we use the same neural network model with two layers of 256 neurons each.

[10]:
from navground.learning.il import BC, make_vec_from_env, setup_tqdm
import numpy as np
from imitation.util import logger

setup_tqdm()

test_venv = make_vec_from_env(env, num_envs=8, rng=np.random.default_rng(101))
training_venv = make_vec_from_env(env, num_envs=8, rng=np.random.default_rng(0))
bc = BC(training_venv, policy_kwargs=dict(net_arch=[256, 256]),
        bc_kwargs={'l2_weight': 0, 'ent_weight': 1e-2, 'batch_size': 128})
bc.logger = logger.configure(f"logs/BC/{stamp}", ['tensorboard', 'csv'])

Let us check that the untrained policy performs like the random models we tried in the first notebook (i.e. with an average reward of about -1):

[11]:
from stable_baselines3.common.evaluation import evaluate_policy

m, s = evaluate_policy(bc.policy, env, n_eval_episodes=100)
print(f"Untrained: {m / 600: .2f} ± {s / 600: .2f}")
Untrained: -1.00 ±  0.02

Let us train the policy for a while (2000 runs are about 30 hours of simulated time)

[12]:
import time

start = time.time()
bc.collect_runs(2000)
print(f'Collecting runs took {time.time() - start: .0f} seconds')

start = time.time()
bc.learn(
    log_rollouts_venv=test_venv,
    log_rollouts_n_episodes=10,
    log_interval=1000,
    n_epochs=1,
    progress_bar=True
)
print(f'Training took {time.time() - start: .0f} seconds')
bc.save("BC/model")
Collecting runs took  194 seconds
Training took  35 seconds

About 400K steps (11 hours of simulated time) are enough to learn a policy

[13]:
import pandas as pd

df = pd.read_csv(f'{bc.logger.get_dir()}/progress.csv')
df.plot(y='rollout/return_mean', x='bc/samples_so_far', figsize=(8, 3), marker=".");
../../_images/tutorials_crossing_Training-SA_24_0.svg
[16]:
mean['BC'], stddev['BC'] = evaluate_policy(bc.policy, env, n_eval_episodes=100)
print(f"BC: {mean['BC'] / 600: .2f} ± {stddev['BC'] / 600: .2f}")
BC: -0.29 ±  0.05

with performances that are comparable to HL.

Next, we try another Imitation Learning algorithm.

DAgger#

[17]:
from navground.learning.il import DAgger

bc_kwargs = {'l2_weight': 1e-6, 'ent_weight': 1e-2, 'batch_size': 128}
dagger = DAgger(env, policy_kwargs=dict(net_arch=[256, 256]), bc_kwargs=bc_kwargs)
dagger.logger = logger.configure(f"logs/DAgger/{stamp}", ['tensorboard', 'csv'])
[18]:
import time

start = time.time()
dagger.learn(
    total_timesteps=100_000,
    rollout_round_min_episodes=5,
    bc_train_kwargs={
        'log_rollouts_venv': test_venv,
        'log_rollouts_n_episodes': 10,
        'log_interval': 1000,
        'n_epochs': 1,
        'progress_bar': False,
    },
    progress_bar=True
)
print(f'Training took {time.time() - start: .0f} seconds')
dagger.save("DAgger/model.zip")
Training took  125 seconds

About 50K steps are enough to learn the policy, significantly less than BC.

[19]:
import pandas as pd

df = pd.read_csv(f'{dagger.logger.get_dir()}/progress.csv')
df.plot(y='rollout/return_mean', x='dagger/total_timesteps', figsize=(8, 3), marker=".");
../../_images/tutorials_crossing_Training-SA_31_0.svg
[20]:
mean['DAgger'], stddev['DAgger'] = evaluate_policy(dagger.policy, env, n_eval_episodes=100)
print(f"DAgger: {mean['DAgger'] / 600: .2f} ± {stddev['DAgger'] / 600: .2f}")
DAgger: -0.35 ±  0.05

We get similar performances as BC.

Finally, we switch to a Reinforcement Learning algorithm.

SAC#

[21]:
from stable_baselines3 import SAC
from stable_baselines3.common.logger import configure

sac = SAC("MlpPolicy", env, verbose=0, policy_kwargs=dict(net_arch=[256, 256]))
sac.set_logger(configure(f'logs/SAC/{stamp}', ["csv", "tensorboard"]))
[22]:
start = time.time()
sac.learn(total_timesteps=100_000, progress_bar=True, tb_log_name="SAC");
print(f'Training took {time.time() - start: .0f} seconds')
sac.save("SAC/model.zip")
Training took  499 seconds

It takes a similar number of steps as DAgger, although it costs significantly more time

[23]:
df = pd.read_csv(f'{sac.logger.get_dir()}/progress.csv')
df.plot(y='rollout/ep_rew_mean', x='time/total_timesteps', figsize=(8, 3), marker=".");
../../_images/tutorials_crossing_Training-SA_38_0.svg
[24]:
mean['SAC'], stddev['SAC'] = evaluate_policy(sac.policy, env, n_eval_episodes=100)
print(f"SAC: {mean['SAC'] / 600: .2f} ± {stddev['SAC'] / 600: .2f}")
SAC: -0.17 ±  0.08

Summary#

We successfully trained three policies with three different algorithms, with similar (or even better) navigation performances compared to HL.

[63]:
import pandas as pd

pd.set_option("display.precision", 3)
rewards = pd.DataFrame({"mean": mean, "std dev": stddev})
rewards.index = rewards.index.set_names(['algorithm'])
rewards /= 600
rewards.to_csv("rewards_sa.csv")
rewards
[63]:
mean std dev
algorithm
HL -0.224 0.044
BC -0.293 0.054
DAgger -0.347 0.054
SAC -0.169 0.077

It the plot below, we rescale the reward to that 1 is the theoretical upper-bound and 0 is the reward of random.

[65]:
from matplotlib import pyplot as plt

(1 + rewards['mean']).plot.bar(figsize=(8, 2));
plt.ylim(0, 1);
../../_images/tutorials_crossing_Training-SA_43_0.svg

In the next notebook, we will take a deeper look at how the policies impact the rest of the multi-agent system, i.e., the other 19 agents.