Performance of policies trained in multi-agent environment

Performance of policies trained in multi-agent environment#

In this notebook, we replicate the analysis for policies trained multi-agent environment.

[43]:
from navground.learning import io

ma_env = io.load_env('parallel_env.yaml')

Like before, we start by exporting the trained polices to onnx

[4]:
from navground.learning import onnx
from navground.learning.il import BC, DAgger
from stable_baselines3 import SAC

for algo in (BC, DAgger, SAC):
    path = f"P-{algo.__name__}"
    onnx.export(algo.load(f"{path}/model").policy, path=f"{path}/policy.onnx")

We want to measure the rewards of the whole group when a number of agents applies one of the trained policies and the rest of the agents continue to follow HL, using the same helpers. We will skip P-BC has it is the same as BC.

[8]:
import numpy as np
from navground.learning.evaluation import make_experiment_with_env
from navground.learning import GroupConfig
import moviepy as mpy
from navground.sim.ui.video import make_video_from_run


def get_groups_rewards(policy, number=1, number_of_episodes=1, policy_color='', processes=1):
    policy_group = GroupConfig(indices=slice(None, number),
                               policy=policy,
                               color=policy_color)
    hl_group = GroupConfig(indices=slice(number, None))
    exp = make_experiment_with_env(ma_env, groups=[policy_group, hl_group])
    exp.number_of_runs = number_of_episodes
    if processes > 1:
        exp.run_mp(number_of_processes=processes, keep=True)
    else:
        exp.run()
    rewards = np.asarray([run.get_record("reward") for run in exp.runs.values()])
    policy_rewards = rewards[..., :number]
    hl_rewards = rewards[..., number:]
    return policy_rewards, hl_rewards

def make_video(policy, number=1, policy_color='', duration=60, seed=0,
               factor=10, width=300):
    policy_group = GroupConfig(indices=slice(None, number),
                               policy=policy,
                               color=policy_color)
    exp = make_experiment_with_env(ma_env, groups=[policy_group],
                                   record_reward=False)
    exp.record_config.pose = True
    exp.number_of_runs = 1
    exp.steps = int(duration / 0.1)
    exp.run_index = seed
    exp.run()
    return make_video_from_run(exp.runs[seed], factor=factor, width=width)

Video#

We display the video of one for run for different ratios for agents that follow the policy and agents that follow HL.

[28]:
colors = {'P-DAgger': 'teal', 'P-SAC': 'darkviolet'}
models = ('P-DAgger', 'P-SAC')
[29]:
width = 300
duration = 60 * 2

videos = [
    [make_video(f'{name}/policy.onnx', number=number, factor=10,
                width=width, duration=duration, policy_color=colors[name])
     for name in models] for number in (5, 10, 15, 20)]

x = 0
width = 0
for i, vs in enumerate(videos):
    y = 0
    for j, video in enumerate(vs):
        video.pos = lambda _, x=x, y=y: (y, x)
        y += video.size[1]
    x += video.size[0]


videos = sum(videos, [])
duration = max(video.duration for video in videos)

img = mpy.ImageClip(np.full((x, y, 3), 255, dtype=np.uint8), duration=duration)
cc = mpy.CompositeVideoClip(clips=[img] + videos)
cc.display_in_notebook(fps=30, width=600, rd_kwargs=dict(logger=None))
[29]:

Looking at the second row (even split in 10/10 agents), the behavior of the policy agents is similar to the HL agents: a bit less efficient for DAgger and a bit less safe for SAC.

We could make the P-SAC policy safer by increasing the weight of safety violations in the reward. In our example, we have trained with a reward that assign the same weight to efficacy and safety, which leads to a overly aggressive behavior.

Reward#

Next, we compute the reward distribution.

[10]:
number_of_episodes = 100
ns = (1, 5, 10, 15, 19)
numbers = []
rewards = []
policies = []
targets = []
means = []
for number in ns:
    for name in models:
        ps, hs = get_groups_rewards(
            policy=f'{name}/policy.onnx',
            number_of_episodes=number_of_episodes,
            processes=8, number=number)
        rewards.append(ps)
        means.append(np.mean(ps))
        numbers.append(number)
        policies.append(name)
        targets.append('policy')
        rewards.append(hs)
        means.append(np.mean(hs))
        numbers.append(number)
        policies.append(name)
        targets.append('HL')
[19]:
import pandas as pd

pd.set_option("display.precision", 3)

data = pd.DataFrame(dict(number=numbers, reward=means, policy=policies, group=targets))
data.to_csv('rewards_ma_env.csv')

In the plot, solid lines are the average over all agents, dotted lines over agents following HL and dashed lines over agents following the policy.

[25]:
from matplotlib import pyplot as plt

_, (ax1, ax2) = plt.subplots(ncols=2, figsize=(15, 4))

for name, ds in data.groupby('policy'):
    ps = ds[ds.group=='policy']
    hs = ds[ds.group=='HL']
    num = np.asarray(ps.number)
    ax1.plot(ps.number, ps.reward, '.--', color=colors[name], label=name)
    ax1.plot(hs.number, hs.reward, '.:', color=colors[name], label=f'HL among {name}')
    total = (num * np.asarray(ps.reward) + (20 - num) * np.asarray(hs.reward)) / 20
    ax2.plot(ps.number, total, '.-', color=colors[name], label=name)
ax1.legend(ncols=2)
ax2.legend(ncols=1)
ax1.set_xlabel('number of policy agents')
ax1.set_ylim(-1, 0)
ax2.set_xlabel('number of policy agents')
ax2.set_ylim(-1, 0)
ax1.set_ylabel('reward')
ax1.set_title('Groups average reward');
ax2.set_title('Whole group average reward');
../../_images/tutorials_crossing_Analysis-MA_15_0.png

We observe that the whole group performance is decreases far less compared to the policies trained in the single-agent environment. P-SAC agents have higher rewards than HL while P-Dagger agents lower rewards. Interestingly, HL agents seem to be almost not impacted by the number of P-Dagger agents, while more P-SAC agents decreases the performance of HL agents, althought not by much.

The histograms below tells a similar story: - SAC move straight but with lots of safety violations (the -2 rewards) - BC agents are stuck togeter for about 50% of the time - DAgger agents have less outlier (less violations/collisions) and their number not impact the rewards distribution as much as the others.

[26]:
from matplotlib import pyplot as plt

fig, axs = plt.subplots(ncols=2, nrows=5, figsize=(8, 10))

i = 0
for cols, number in zip(axs, ns):
    bins = np.linspace(-2, 0, 50)
    for ax, name in zip(cols, models):
        if name == 'P-DAgger':
            ax.set_ylabel(f"{number} policy + {20 - number} HL", rotation=90)
        ax.hist(rewards[i].flatten(), density=True,
                color=colors[name], alpha=0.5, bins=bins, label=name)
        i += 1
        ax.hist(rewards[i].flatten(), density=True,
                color='black', alpha=0.5, bins=bins, label="HL")
        ax.set_xlim(-2, 0)
        ax.set_ylim(0, 10)
        i += 1
        if number == 1:
            ax.set_title(name)
            ax.legend()
../../_images/tutorials_crossing_Analysis-MA_18_0.png

Final video#

One final video where for fun we let the all policies (but P-BC) we have trained play togeher.

[48]:
from navground.learning.evaluation import InitPolicyBehavior
from navground.sim.ui.video import display_video_from_run

colors = {
    'BC': 'red',
    'DAgger': 'green',
    'SAC': 'blue',
    'P-DAgger': 'teal',
    'P-SAC': 'darkviolet'
}

groups = [
    GroupConfig(indices=slice(4 * i, 4 * (i + 1)),
                policy=f"{name}/policy.onnx",
                color=color)
    for i, (name, color) in enumerate(colors.items())
]

exp = make_experiment_with_env(ma_env, groups=groups, record_reward=False)
exp.steps = 1800
exp.record_config.pose = True
run = exp.run_once(seed=123)
display_video_from_run(run, factor=10, width=400, display_width=400)
[48]: