Training one agent among many agents#
Scenario#
This is the same scenario we met in the first tutorial
[1]:
import warnings
from navground import sim
warnings.filterwarnings('ignore')
%config InlineBackend.figure_formats = ['svg']
scenario = sim.load_scenario("""
type: Cross
agent_margin: 0.1
side: 4
target_margin: 0.1
tolerance: 0.5
groups:
-
type: thymio
number: 20
radius: 0.1
control_period: 0.1
speed_tolerance: 0.02
color: gray
kinematics:
type: 2WDiff
wheel_axis: 0.094
max_speed: 0.12
behavior:
type: HL
optimal_speed: 0.12
horizon: 5.0
tau: 0.25
eta: 0.5
safety_margin: 0.05
state_estimation:
type: Bounded
range: 5.0
""")
where 20 agents move back-and-forth between waypoints crossing in the middle, initially using the HL navigation behavior, but later using the policies we are going to learn.
[2]:
from navground.sim.ui.video import display_video
world = scenario.make_world()
display_video(world, time_step=0.1, duration=60, width=600,
factor=10, display_width=300)
[2]:
Environment#
We also use the same sensor configured in the first tutorial
[3]:
sensor = sim.load_state_estimation("""
type: Discs
number: 5
range: 5.0
max_speed: 0.12
max_radius: 0
""")
Let us create an environment where we control one of the 20 agents, and save it for later use.
[4]:
import gymnasium as gym
from navground import sim
from navground.learning.rewards import SocialReward
from navground.learning import ControlActionConfig, DefaultObservationConfig, io
with open('scenario.yaml') as f:
scenario = sim.load_scenario(f.read())
with open('sensor.yaml') as f:
sensor = sim.load_state_estimation(f.read())
action_config = ControlActionConfig(
max_acceleration=1.0, max_angular_acceleration=10.0,
use_acceleration_action=True)
observation_config = DefaultObservationConfig(
include_target_distance=True, include_target_direction=True,
include_velocity=True, include_angular_speed=True, flat=True)
reward = SocialReward()
duration = 60.0
time_step = 0.1
env = gym.make('navground',
scenario=scenario,
sensor=sensor,
action=action_config,
observation=observation_config,
reward=reward,
time_step=time_step,
max_duration=duration);
io.save_env(env, 'env.yaml')
Human-like behavior#
In the first tutorial, we have already measured the performance of the HL navigation algorithm in this scenario.
We redo the computation, in a different way, to estimate the reward over one episode (600 steps):
[15]:
from navground.learning.evaluation import evaluate
mean = {}
stddev = {}
mean['HL'], stddev['HL'] = evaluate(env, n_eval_episodes=100, indices={0})
print(f"HL reward: {mean['HL'] / 600: .2f} ± {stddev['HL'] / 600: .2f}")
HL reward: -0.22 ± 0.04
Training#
The environment is ready, the expectations are set … we are ready to train.
If you have installed tensorboard
, you can visualize the log during training.
[6]:
# %load_ext tensorboard
[7]:
# %tensorboard --logdir logs
We record the current time stamp to avoid overwriting the logs
[8]:
from datetime import datetime as dt
stamp = dt.now().strftime("%Y%m%d_%H%M%S")
Behavior Cloning#
We start by the simplest Imitation Learning algorithm, Behavior Cloning, to learn to imitate the HL
“expert”. For all three algorithms, we use the same neural network model with two layers of 256 neurons each.
[10]:
from navground.learning.il import BC, make_vec_from_env, setup_tqdm
import numpy as np
from imitation.util import logger
setup_tqdm()
test_venv = make_vec_from_env(env, num_envs=8, rng=np.random.default_rng(101))
training_venv = make_vec_from_env(env, num_envs=8, rng=np.random.default_rng(0))
bc = BC(training_venv, policy_kwargs=dict(net_arch=[256, 256]),
bc_kwargs={'l2_weight': 0, 'ent_weight': 1e-2, 'batch_size': 128})
bc.logger = logger.configure(f"logs/BC/{stamp}", ['tensorboard', 'csv'])
Let us check that the untrained policy performs like the random models we tried in the first notebook (i.e. with an average reward of about -1):
[11]:
from stable_baselines3.common.evaluation import evaluate_policy
m, s = evaluate_policy(bc.policy, env, n_eval_episodes=100)
print(f"Untrained: {m / 600: .2f} ± {s / 600: .2f}")
Untrained: -1.00 ± 0.02
Let us train the policy for a while (2000 runs are about 30 hours of simulated time)
[12]:
import time
start = time.time()
bc.collect_runs(2000)
print(f'Collecting runs took {time.time() - start: .0f} seconds')
start = time.time()
bc.learn(
log_rollouts_venv=test_venv,
log_rollouts_n_episodes=10,
log_interval=1000,
n_epochs=1,
progress_bar=True
)
print(f'Training took {time.time() - start: .0f} seconds')
bc.save("BC/model")
Collecting runs took 194 seconds
Training took 35 seconds
About 400K steps (11 hours of simulated time) are enough to learn a policy
[13]:
import pandas as pd
df = pd.read_csv(f'{bc.logger.get_dir()}/progress.csv')
df.plot(y='rollout/return_mean', x='bc/samples_so_far', figsize=(8, 3), marker=".");
[16]:
mean['BC'], stddev['BC'] = evaluate_policy(bc.policy, env, n_eval_episodes=100)
print(f"BC: {mean['BC'] / 600: .2f} ± {stddev['BC'] / 600: .2f}")
BC: -0.29 ± 0.05
with performances that are comparable to HL.
Next, we try another Imitation Learning algorithm.
DAgger#
[17]:
from navground.learning.il import DAgger
bc_kwargs = {'l2_weight': 1e-6, 'ent_weight': 1e-2, 'batch_size': 128}
dagger = DAgger(env, policy_kwargs=dict(net_arch=[256, 256]), bc_kwargs=bc_kwargs)
dagger.logger = logger.configure(f"logs/DAgger/{stamp}", ['tensorboard', 'csv'])
[18]:
import time
start = time.time()
dagger.learn(
total_timesteps=100_000,
rollout_round_min_episodes=5,
bc_train_kwargs={
'log_rollouts_venv': test_venv,
'log_rollouts_n_episodes': 10,
'log_interval': 1000,
'n_epochs': 1,
'progress_bar': False,
},
progress_bar=True
)
print(f'Training took {time.time() - start: .0f} seconds')
dagger.save("DAgger/model.zip")
Training took 125 seconds
About 50K steps are enough to learn the policy, significantly less than BC.
[19]:
import pandas as pd
df = pd.read_csv(f'{dagger.logger.get_dir()}/progress.csv')
df.plot(y='rollout/return_mean', x='dagger/total_timesteps', figsize=(8, 3), marker=".");
[20]:
mean['DAgger'], stddev['DAgger'] = evaluate_policy(dagger.policy, env, n_eval_episodes=100)
print(f"DAgger: {mean['DAgger'] / 600: .2f} ± {stddev['DAgger'] / 600: .2f}")
DAgger: -0.35 ± 0.05
We get similar performances as BC.
Finally, we switch to a Reinforcement Learning algorithm.
SAC#
[21]:
from stable_baselines3 import SAC
from stable_baselines3.common.logger import configure
sac = SAC("MlpPolicy", env, verbose=0, policy_kwargs=dict(net_arch=[256, 256]))
sac.set_logger(configure(f'logs/SAC/{stamp}', ["csv", "tensorboard"]))
[22]:
start = time.time()
sac.learn(total_timesteps=100_000, progress_bar=True, tb_log_name="SAC");
print(f'Training took {time.time() - start: .0f} seconds')
sac.save("SAC/model.zip")
Training took 499 seconds
It takes a similar number of steps as DAgger, although it costs significantly more time
[23]:
df = pd.read_csv(f'{sac.logger.get_dir()}/progress.csv')
df.plot(y='rollout/ep_rew_mean', x='time/total_timesteps', figsize=(8, 3), marker=".");
[24]:
mean['SAC'], stddev['SAC'] = evaluate_policy(sac.policy, env, n_eval_episodes=100)
print(f"SAC: {mean['SAC'] / 600: .2f} ± {stddev['SAC'] / 600: .2f}")
SAC: -0.17 ± 0.08
Summary#
We successfully trained three policies with three different algorithms, with similar (or even better) navigation performances compared to HL.
[63]:
import pandas as pd
pd.set_option("display.precision", 3)
rewards = pd.DataFrame({"mean": mean, "std dev": stddev})
rewards.index = rewards.index.set_names(['algorithm'])
rewards /= 600
rewards.to_csv("rewards_sa.csv")
rewards
[63]:
mean | std dev | |
---|---|---|
algorithm | ||
HL | -0.224 | 0.044 |
BC | -0.293 | 0.054 |
DAgger | -0.347 | 0.054 |
SAC | -0.169 | 0.077 |
It the plot below, we rescale the reward to that 1 is the theoretical upper-bound and 0 is the reward of random.
[65]:
from matplotlib import pyplot as plt
(1 + rewards['mean']).plot.bar(figsize=(8, 2));
plt.ylim(0, 1);
In the next notebook, we will take a deeper look at how the policies impact the rest of the multi-agent system, i.e., the other 19 agents.