PettingZoo#

This notebook showcases the integration between navground and PettingZoo, the “multi-agent” version of Gymnasium. We focus on the differences compared to with Gymnasium: have a look at the Gymnasium notebook for the common parts (e.g., rendering).

While in Gymnasium we control a single navground agent (which may move among many other agents controlled by navground), with PettingZoo we can control multiple agents, even all the agents of a navground simulation.

To start, we load the same scenario with 20 agents and the same sensor

[1]:
from navground import sim
import numpy as np

from navground import sim

with open('scenario.yaml') as f:
    scenario = sim.load_scenario(f.read())

with open('sensor.yaml') as f:
    sensor = sim.load_state_estimation(f.read())

A single group#

Now, instead of a single agent, we want to control a group of agents with a policy acting on the selected sensor. We define the PettingZoo environment, controlling the first 10 agents, sharing the same configuration.

[2]:
from navground.learning.parallel_env import shared_parallel_env
from navground.learning import DefaultObservationConfig, ControlActionConfig
from navground.learning.rewards import SocialReward

observation_config = DefaultObservationConfig(include_target_direction=True,
                                              include_target_distance=True)
action_config = ControlActionConfig()

env = shared_parallel_env(
    scenario=scenario,
    indices=slice(0, 10, 1),
    sensor=sensor,
    action=action_config,
    observation=observation_config,
    reward=SocialReward(),
    time_step=0.1,
    max_duration=60.0)

All agents have the same observation and action spaces has configured

[3]:
print(f'We are controlling {len(env.possible_agents)} agents')

observation_space = env.observation_space(0)
action_space = env.action_space(0)
if all(env.action_space(i) == action_space and env.observation_space(i) == observation_space
       for i in env.possible_agents):
    print(f'They share the same observation {observation_space} and action {action_space} spaces')
We are controlling 10 agents
They share the same observation Dict('neighbors/position': Box(-5.0, 5.0, (5, 2), float32), 'neighbors/radius': Box(0.0, 0.1, (5,), float32), 'neighbors/valid': Box(0, 1, (5,), uint8), 'neighbors/velocity': Box(-0.12, 0.12, (5, 2), float32), 'ego_target_direction': Box(-1.0, 1.0, (2,), float32), 'ego_target_distance': Box(0.0, 5.0, (1,), float32)) and action Box(-1.0, 1.0, (2,), float32) spaces

The info map returned by reset(...) and step(...) contains the action computed by original navground behavior, in this case HL, for each of the 10 agents.

[4]:
observations, infos = env.reset()
print(f"Observation #0: {observations[0]}")
print(f"Info #0: {infos[0]}")
Observation #0: {'ego_target_distance': array([1.3484901], dtype=float32), 'ego_target_direction': array([ 1.0000000e+00, -1.5725663e-08], dtype=float32), 'neighbors/position': array([[-0.00738173, -0.30817246],
       [-0.38925827,  0.01894906],
       [-0.46368217, -0.4778133 ],
       [ 0.15306982, -0.6674728 ],
       [ 0.5088892 , -0.62434775]], dtype=float32), 'neighbors/radius': array([0.1, 0.1, 0.1, 0.1, 0.1], dtype=float32), 'neighbors/valid': array([1, 1, 1, 1, 1], dtype=uint8), 'neighbors/velocity': array([[0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.]], dtype=float32)}
Info #0: {'navground_action': array([0.32967997, 0.        ])}

Let’s collect the reward from the original controller

[5]:
all_rewards = []
for n in range(1000):
    actions = {i: info['navground_action'] for i, info in infos.items()}
    observations, rewards, terminated, truncated, infos = env.step(actions)
    all_rewards.append(np.mean(list(rewards.values())))
    done = np.bitwise_or(list(terminated.values()), list(truncated.values()))
    if np.all(done):
        print(f'reset after {n} steps')
        observations, infos = env.reset()

print(f'mean reward {np.mean(all_rewards):.3f}')
reset after 600 steps
mean reward -0.252

and compare it with the reward from a random policy

[6]:
observations, infos = env.reset()
all_rewards = []
for n in range(1000):
    actions = {i: env.action_space(i).sample() for i in range(10)}
    observations, rewards, terminated, truncated, infos = env.step(actions)
    all_rewards.append(np.mean(list(rewards.values())))
    done = np.bitwise_or(list(terminated.values()), list(truncated.values()))
    if np.all(done):
        print(f'reset after {n} steps')
        observations, infos = env.reset()

print(f'mean reward {np.mean(all_rewards):.3f}')
reset after 600 steps
mean reward -1.059

We want to use a machine learning policy to generate to action. For instance, a random policy, like

[7]:
from navground.learning.policies.random_predictor import RandomPredictor

policies = {i: RandomPredictor(observation_space=env.observation_space(i),
                               action_space=env.action_space(i))
            for i in env.agents}

Policies output a tuple (action, state). Therefore the new loop is

[8]:
observations, infos = env.reset()
rewards = []
for n in range(1000):
    actions = {i: policies[i].predict(observations[i])[0] for i in env.agents}
    observations, rewards, terminated, truncated, infos = env.step(actions)
    all_rewards.append(np.mean(list(rewards.values())))
    done = np.bitwise_or(list(terminated.values()), list(truncated.values()))
    if np.all(done):
        print(f'reset after {n} steps')
        observations, infos = env.reset()

print(f'mean reward {np.mean(all_rewards):.3f}')
reset after 600 steps
mean reward -1.034

Two groups#

Let us now consider the more complex case where we want to control agents using different sensors and/or configurations. For instance, we want to control the first 10 agents like before and the second 10 agents using a lidar scanner. Let say we also want to control the second group in acceleration vs the first group in speed.

[9]:
lidar = sim.load_state_estimation("""
type: Lidar
resolution: 100
range: 5.0
name: lidar
""")
[10]:
from navground.learning.parallel_env import parallel_env
from navground.learning import GroupConfig

first_group = GroupConfig(indices=slice(0, 10, 1), sensor=sensor,
                          observation = DefaultObservationConfig(include_target_distance=False),
                          action = ControlActionConfig(),
                          reward=SocialReward(), tag='first')
second_group = GroupConfig(indices=slice(10, 20, 1), sensor=lidar,
                           observation = DefaultObservationConfig(),
                           action = ControlActionConfig(use_acceleration_action=True,
                                                        max_acceleration=1.0,
                                                        max_angular_acceleration=10.0),
                           reward=SocialReward(), tag='second')

env2 = parallel_env(scenario=scenario, groups=[first_group, second_group],
                    time_step=0.1, max_duration=60.0)

The two groups uses now different observation spaces

[11]:
env2.observation_space(0)
[11]:
Dict('neighbors/position': Box(-5.0, 5.0, (5, 2), float32), 'neighbors/radius': Box(0.0, 0.1, (5,), float32), 'neighbors/valid': Box(0, 1, (5,), uint8), 'neighbors/velocity': Box(-0.12, 0.12, (5, 2), float32), 'ego_target_direction': Box(-1.0, 1.0, (2,), float32))
[12]:
env2.observation_space(10)
[12]:
Dict('lidar/fov': Box(0.0, 6.2831855, (1,), float32), 'lidar/max_range': Box(0.0, 10.0, (1,), float32), 'lidar/range': Box(0.0, 5.0, (100,), float32), 'lidar/start_angle': Box(-6.2831855, 6.2831855, (1,), float32), 'ego_target_direction': Box(-1.0, 1.0, (2,), float32))

and differnet maps between actions and commands

[13]:
env2.get_cmd_from_action(index=0, action=np.ones(2), time_step=0.1)
[13]:
Twist2((0.120000, 0.000000), 2.553191, frame=Frame.relative)
[14]:
env2.get_cmd_from_action(index=10, action=np.ones(2), time_step=0.1)
[14]:
Twist2((0.100000, 0.000000), 1.000000, frame=Frame.relative)

State#

PettingZoo environment expose a global state too, which we configure through a StateConfig. Next, we configure the global state for the same 2-group environment to include all 2D positions:

[15]:
from navground.learning import DefaultStateConfig

state_config = DefaultStateConfig(include_position=True)

env_state = parallel_env(scenario=scenario, state=state_config, groups=[first_group, second_group],
                         time_step=0.1, max_duration=60.0)
env_state.state_space
[15]:
Box(-2.0, 2.0, (40,), float32)
[16]:
env_state.reset()
env_state.state()
[16]:
array([ 0.18549144,  0.35280955,  0.8253108 ,  1.3351982 ,  0.42879337,
        1.4013637 ,  0.04322448,  1.2935146 , -0.29011178,  0.46954215,
        0.61308354, -0.44533634, -0.23716867, -0.7693684 ,  1.4887372 ,
       -1.6844907 ,  1.7619184 , -0.863906  , -0.44292223, -0.08487248,
        1.1511286 ,  1.1025013 ,  0.097995  , -0.04596518,  0.21168919,
       -0.43155256,  1.712881  ,  1.2181036 , -1.5547919 , -0.6635826 ,
       -1.5689087 ,  0.56305325, -1.8984411 , -0.454994  ,  1.2372903 ,
        1.8097478 ,  1.0569957 , -1.366667  ,  1.375972  ,  1.4343923 ],
      dtype=float32)

Converto to a Gymansium environement#

We can always transform multi-agent to single-agent environements. In the most general case, we can concatenate observations and actions and sum rewards.

When agents share the same configuration (and in particular action and observation spaces), we can stack observations and actions instead.

Vectorized Environement#

We can look at one multi-agent environement with N agents like N stacked single-agent evironments composing a vectorized single-agent evironment.

[17]:
env3 = shared_parallel_env(
    scenario=scenario,
    sensor=sensor,
    indices=slice(0, 10),
    action=action_config,
    observation=observation_config,
    reward=SocialReward(),
    state=state_config,
    time_step=0.1,
    max_duration=60.0)
[18]:
import supersuit

venv = supersuit.pettingzoo_env_to_vec_env_v1(env3)

with

[19]:
venv.num_envs
[19]:
10

environments that represents the individual agents.

This vector env follows the Gymnasium API, stacking together observation, actions of the individual agents.

[20]:
venv.observation_space, venv.action_space
[20]:
(Dict('neighbors/position': Box(-5.0, 5.0, (5, 2), float32), 'neighbors/radius': Box(0.0, 0.1, (5,), float32), 'neighbors/valid': Box(0, 1, (5,), uint8), 'neighbors/velocity': Box(-0.12, 0.12, (5, 2), float32), 'ego_target_direction': Box(-1.0, 1.0, (2,), float32), 'ego_target_distance': Box(0.0, 5.0, (1,), float32)),
 Box(-1.0, 1.0, (2,), float32))

If we want instead an vector env to follows the SB3 API, we can use (even stacking multiple vectorized envs together)

[21]:
venv1 = supersuit.concat_vec_envs_v1(venv, 2, num_cpus=1, base_class="stable_baselines3")
[22]:
venv1.num_envs
[22]:
20

Joint Environement#

A joint environnment is similar but rewards (sum), terminations (all) and truncation (any) are aggregared together, resulting in a single Gymnasium environment:

[23]:
from navground.learning.parallel_env import JointEnv

jenv = JointEnv(env)
jenv.observation_space, jenv.action_space
[23]:
(Dict('ego_target_direction': Box(-1.0, 1.0, (10, 2), float32), 'ego_target_distance': Box(0.0, 5.0, (10, 1), float32), 'neighbors/position': Box(-5.0, 5.0, (10, 5, 2), float32), 'neighbors/radius': Box(0.0, 0.1, (10, 5), float32), 'neighbors/valid': Box(0.0, 1.0, (10, 5), float32), 'neighbors/velocity': Box(-0.12, 0.12, (10, 5, 2), float32)),
 Box(-1.0, 1.0, (10, 2), float32))

It can be configured to use the global state as observations:

[24]:
jenv = JointEnv(env_state, state=True)
jenv.observation_space, jenv.action_space
[24]:
(Box(-2.0, 2.0, (40,), float32), Box(-1.0, 1.0, (20, 2), float32))

Convert from a Gymnasium Env#

If we have a single agent navground enviroment that uses a multi-agent scenario, we can convert it to a parallel environment, where all controlled agents share the same configuration, like for shared_parallel_env.

Let us load the environment we saved in the previous notebook

[25]:
from navground.learning import io

sa_env = io.load_env('env.yaml')

and covert it to a parallel environment, controlling 10 (out of the total 20) agents.

[26]:
from navground.learning.parallel_env import make_shared_parallel_env_with_env
[27]:
env4 = make_shared_parallel_env_with_env(env=sa_env, indices=slice(0, 10))
[28]:
env4.possible_agents
[28]:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Saving and loading#

The multi-agent PettingZoo environment supports the same YAML representation like the single-agent Gymnasium environment and we can save it and load it from a YAML file.

[29]:
io.save_env(env, 'penv.yaml')

Let us check that the groups field is coherent with the configuration we have just provided: a single group of 10 agents (indices 0, 1, …, 9).

[30]:
import yaml

print(yaml.safe_dump(env.asdict['groups']))
- action:
    dof: null
    dtype: ''
    fix_orientation: false
    has_wheels: null
    max_acceleration: .inf
    max_angular_acceleration: .inf
    max_angular_speed: .inf
    max_speed: .inf
    type: Control
    use_acceleration_action: false
    use_wheels: false
  indices:
    start: 0
    step: 1
    stop: 10
    type: slice
  observation:
    dof: null
    dtype: ''
    flat: false
    flat_values: false
    history: 1
    ignore_keys: []
    include_angular_speed: false
    include_radius: false
    include_target_angular_speed: false
    include_target_direction: true
    include_target_direction_validity: false
    include_target_distance: true
    include_target_distance_validity: false
    include_target_orientation: false
    include_target_orientation_validity: false
    include_target_speed: false
    include_velocity: false
    keys: null
    max_angular_speed: .inf
    max_radius: .inf
    max_speed: .inf
    max_target_distance: .inf
    normalize: false
    sort_keys: false
    type: Default
  reward:
    alpha: 0.0
    beta: 1.0
    critical_safety_margin: 0.0
    default_social_margin: 0.0
    safety_margin: null
    social_margins: {}
    type: Social
  sensors:
  - include_valid: true
    include_x: true
    include_y: true
    max_id: 0
    max_radius: 0.100000001
    max_speed: 0.119999997
    name: neighbors
    number: 5
    range: 5
    type: Discs
    use_nearest_point: true
  terminate_on_failure: true
  terminate_on_success: true

We also export the two-group environments that we are going to use in the next notebook.

[31]:
io.save_env(env2, 'penv2.yaml')
io.save_env(env_state, 'penv_state.yaml')
[ ]: