Navground-PettingZoo integration#
This notebook showcases the integration between navground and PettingZoo, the “multi-agent” version of Gymnasium. We focus on the differences compared to with Gymnasium: have a look at the Gymnasium notebook for the common parts (e.g., rendering).
While in Gymnasium we control a single navground agent (which may move among many other agents controlled by navground), with PettingZoo we can control multiple agents, even all the agents of a navground simulation.
To start, we load the same scenario with 20 agents and the same sensor
[1]:
from navground import sim
import numpy as np
from navground import sim
with open('scenario.yaml') as f:
scenario = sim.load_scenario(f.read())
with open('sensor.yaml') as f:
sensor = sim.load_state_estimation(f.read())
A single group#
Now, instead of a single agent, we want to control a group of agents with a policy acting on the selected sensor. We define the PettingZoo environment, controlling the first 10 agents, sharing the same configuration
[3]:
from navground.learning.parallel_env import shared_parallel_env
from navground.learning import DefaultObservationConfig, ControlActionConfig
from navground.learning.rewards import SocialReward
observation_config = DefaultObservationConfig(include_target_direction=True,
include_target_distance=True)
action_config = ControlActionConfig()
env = shared_parallel_env(
scenario=scenario,
indices=slice(0, 10, 1),
sensor=sensor,
action=action_config,
observation=observation_config,
reward=SocialReward(),
time_step=0.1,
max_duration=60.0)
All agents have the same observation and action spaces has configured
[4]:
print(f'We are controlling {len(env.possible_agents)} agents')
observation_space = env.observation_space(0)
action_space = env.action_space(0)
if all(env.action_space(i) == action_space and env.observation_space(i) == observation_space
for i in env.possible_agents):
print(f'They share the same observation {observation_space} and action {action_space} spaces')
We are controlling 10 agents
They share the same observation Dict('position': Box(-5.0, 5.0, (5, 2), float32), 'radius': Box(0.0, 0.1, (5,), float32), 'valid': Box(0, 1, (5,), uint8), 'velocity': Box(-0.12, 0.12, (5, 2), float32), 'ego_target_direction': Box(-1.0, 1.0, (2,), float32), 'ego_target_distance': Box(0.0, inf, (1,), float32)) and action Box(-1.0, 1.0, (2,), float32) spaces
The info
map returned by reset(...)
and step(...)
contains the action computed by original navground behavior, in this case HL
, for each of the 10 agents.
[5]:
observations, infos = env.reset()
print(f"Observation #0: {observations[0]}")
print(f"Info #0: {infos[0]}")
Observation #0: {'ego_target_distance': array([1.3484901], dtype=float32), 'ego_target_direction': array([ 1.0000000e+00, -1.5725663e-08], dtype=float32), 'position': array([[-0.00738173, -0.30817246],
[-0.38925827, 0.01894906],
[-0.46368217, -0.4778133 ],
[ 0.15306982, -0.6674728 ],
[ 0.5088892 , -0.62434775]], dtype=float32), 'radius': array([0.1, 0.1, 0.1, 0.1, 0.1], dtype=float32), 'valid': array([1, 1, 1, 1, 1], dtype=uint8), 'velocity': array([[0., 0.],
[0., 0.],
[0., 0.],
[0., 0.],
[0., 0.]], dtype=float32)}
Info #0: {'navground_action': array([0.32967997, 0. ], dtype=float32)}
Let’s collect the reward from the original controller
[6]:
all_rewards = []
for n in range(1000):
actions = {i: info['navground_action'] for i, info in infos.items()}
observations, rewards, terminated, truncated, infos = env.step(actions)
all_rewards.append(np.mean(list(rewards.values())))
done = np.bitwise_or(list(terminated.values()), list(truncated.values()))
if np.all(done):
print(f'reset after {n} steps')
observations, infos = env.reset()
print(f'mean reward {np.mean(all_rewards):.3f}')
reset after 600 steps
mean reward -0.243
and compare it with the reward from a random policy
[7]:
observations, infos = env.reset()
all_rewards = []
for n in range(1000):
actions = {i: env.action_space(i).sample() for i in range(10)}
observations, rewards, terminated, truncated, infos = env.step(actions)
all_rewards.append(np.mean(list(rewards.values())))
done = np.bitwise_or(list(terminated.values()), list(truncated.values()))
if np.all(done):
print(f'reset after {n} steps')
observations, infos = env.reset()
print(f'mean reward {np.mean(all_rewards):.3f}')
reset after 600 steps
mean reward -1.117
We want to use a machine learning policy to generate to action. For instance, a random policy, like
[8]:
from navground.learning.policies.random_predictor import RandomPredictor
policies = {i: RandomPredictor(observation_space=env.observation_space(i),
action_space=env.action_space(i))
for i in env.agents}
Policies output a tuple (action, state)
. Therefore the new loop is
[9]:
observations, infos = env.reset()
rewards = []
for n in range(1000):
actions = {i: policies[i].predict(observations[i])[0] for i in env.agents}
observations, rewards, terminated, truncated, infos = env.step(actions)
all_rewards.append(np.mean(list(rewards.values())))
done = np.bitwise_or(list(terminated.values()), list(truncated.values()))
if np.all(done):
print(f'reset after {n} steps')
observations, infos = env.reset()
print(f'mean reward {np.mean(all_rewards):.3f}')
reset after 600 steps
mean reward -1.088
Two groups#
Let us now consider the more complex case where we want to control agents using different sensors and/or configurations. For instance, we want to control the first 10 agents like before and the second 10 agents using a lidar scanner. Let say we also want to control the second group in acceleration vs the first group in speed.
[19]:
lidar = sim.load_state_estimation("""
type: Lidar
resolution: 100
range: 5.0
""")
[20]:
from navground.learning.parallel_env import parallel_env
from navground.learning import GroupConfig
first_group = GroupConfig(indices=slice(0, 10, 1), sensor=sensor,
observation = DefaultObservationConfig(include_target_distance=False),
action = ControlActionConfig(),
reward=SocialReward())
second_group = GroupConfig(indices=slice(10, 20, 1), sensor=lidar,
observation = DefaultObservationConfig(),
action = ControlActionConfig(use_acceleration_action=True,
max_acceleration=1.0,
max_angular_acceleration=10.0),
reward=SocialReward())
env = parallel_env(scenario=scenario, groups=[first_group, second_group],
time_step=0.1, max_duration=60.0)
The two groups uses now different observation spaces
[21]:
env.observation_space(0)
[21]:
Dict('position': Box(-5.0, 5.0, (5, 2), float32), 'radius': Box(0.0, 0.1, (5,), float32), 'valid': Box(0, 1, (5,), uint8), 'velocity': Box(-0.12, 0.12, (5, 2), float32))
[22]:
env.observation_space(10)
[22]:
Dict('fov': Box(0.0, 6.2831855, (1,), float32), 'range': Box(0.0, 5.0, (100,), float32), 'start_angle': Box(-6.2831855, 6.2831855, (1,), float32))
and differnet maps between actions and commands
[23]:
env._possible_agents[0].gym.get_cmd_from_action(np.ones(2), time_step=0.1)
[23]:
Twist2((0.120000, 0.000000), 2.553191, frame=Frame.relative)
[24]:
env._possible_agents[10].gym.get_cmd_from_action(np.ones(2), time_step=0.1)
[24]:
Twist2((0.100000, 0.000000), 1.000000, frame=Frame.relative)
Convert to a Gymnasium Env#
In case the agents share the same configuration (and in particular action and observation spaces), we can convert the PettingZoo env in a Gymnasium vector env.
[25]:
env = shared_parallel_env(
scenario=scenario,
agent_indices=slice(0, 10, 1),
sensor=sensor,
action=action_config,
observation=observation_config,
reward=SocialReward(),
time_step=0.1,
max_duration=60.0)
[26]:
import supersuit
venv = supersuit.pettingzoo_env_to_vec_env_v1(env)
with
[27]:
venv.num_envs
[27]:
10
environments that represents the individual agents.
This vector env follows the Gymnasium API, stacking together observation, actions of the individual agents
If we want instead an vector env to follows the SB3 API, we can use (even stacking multiple vectorized envs together)
[28]:
venv1 = supersuit.concat_vec_envs_v1(venv, 2, num_cpus=1, base_class="stable_baselines3")
[29]:
venv1.num_envs
[29]:
20
Convert from a Gymnasium Env#
If we have a single agent navground enviroment that uses a multi-agent scenario, we can convert it to a parallel environment, where all controlled agents share the same configuration, like for shared_parallel_env
.
Let us load the environment we saved in the previous notebook
[11]:
from navground.learning import io
sa_env = io.load_env('env.yaml')
and covert it to a parallel environment, controlling 10 (out of the total 20) agents.
[15]:
from navground.learning.parallel_env import make_shared_parallel_env_with_env
[16]:
env1 = make_shared_parallel_env_with_env(env=sa_env, indices=slice(0, 10))
[17]:
env1.possible_agents
[17]:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Saving and loading#
The multi-agent PettingZoo environment supports the same YAML representation like the single-agent Gymnasium environment and we can save it and load it from a YAML file.
[18]:
io.save_env(env1, 'penv.yaml')
Let us check that the groups
field is coherent with the configuration we have just provided: a single group of 10 agents (indices 0, 1, …, 9).
[22]:
import yaml
print(yaml.safe_dump(env1.asdict['groups']))
- action:
dof: null
dtype: ''
fix_orientation: false
has_wheels: null
max_acceleration: .inf
max_angular_acceleration: .inf
max_angular_speed: .inf
max_speed: .inf
type: Control
use_acceleration_action: false
use_wheels: false
indices:
start: 0
step: null
stop: 10
type: slice
observation:
dof: null
dtype: ''
flat: false
history: 1
include_angular_speed: false
include_radius: false
include_target_angular_speed: false
include_target_direction: true
include_target_direction_validity: false
include_target_distance: true
include_target_distance_validity: false
include_target_speed: false
include_velocity: false
max_angular_speed: .inf
max_radius: .inf
max_speed: .inf
max_target_distance: .inf
type: Default
reward:
alpha: 0.0
beta: 1.0
critical_safety_margin: 0.0
default_social_margin: 0.0
safety_margin: null
social_margins: {}
type: Social
sensor: {}
[ ]: