diff --git a/.gitignore b/.gitignore
index c6335cf4..1529576e 100644
--- a/.gitignore
+++ b/.gitignore
@@ -44,3 +44,12 @@ dataset/
# configs
configs.json
+
+# pyenv
+.python-version
+
+# poetry
+poetry.lock
+
+# wandb
+wandb
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
new file mode 100644
index 00000000..d1117cb4
--- /dev/null
+++ b/.pre-commit-config.yaml
@@ -0,0 +1,66 @@
+# See https://pre-commit.com for more information
+# See https://pre-commit.com/hooks.html for more hooks
+repos:
+- repo: https://github.com/pre-commit/pre-commit-hooks
+ rev: v4.4.0
+ hooks:
+ - id: trailing-whitespace
+ - id: end-of-file-fixer
+ - id: sort-simple-yaml
+ - id: check-json
+ - id: check-merge-conflict
+ - id: check-symlinks
+ - id: debug-statements
+ - id: check-added-large-files
+- repo: https://github.com/python-poetry/poetry
+ rev: 1.6.0
+ hooks:
+ - id: poetry-check
+ - id: poetry-lock
+ # - id: poetry-publish
+- repo: https://github.com/psf/black
+ rev: 23.9.1
+ hooks:
+ - id: black
+ args: [--line-length, '120']
+- repo: https://github.com/PyCQA/isort
+ rev: 5.12.0
+ hooks:
+ - id: isort
+- repo: https://github.com/PyCQA/flake8
+ rev: 6.1.0
+ hooks:
+ - id: flake8
+ args: [--max-line-length=120, --extend-ignore=E203]
+- repo: https://github.com/PyCQA/pydocstyle
+ rev: 6.3.0
+ hooks:
+ - id: pydocstyle
+ args: [--convention=numpy]
+ additional_dependencies: [tomli]
+- repo: https://github.com/macisamuele/language-formatters-pre-commit-hooks
+ rev: v2.10.0
+ hooks:
+ - id: pretty-format-toml
+ args: [--autofix, --no-sort]
+ - id: pretty-format-yaml
+ args: [--autofix]
+- repo: local
+ hooks:
+ - id: pylint
+ name: pylint
+ entry: poetry run pylint
+ language: system
+ types: [python]
+ - id: poetry-export-requirements
+ name: poetry-export-requirements
+ entry: poetry export --without-hashes --with=main,research -f requirements.txt -o requirements.txt
+ language: system
+ types: [python]
+ pass_filenames: false
+ - id: poetry-export-requirements-dev
+ name: poetry-export-requirements-dev
+ entry: poetry export --without-hashes --only dev -f requirements.txt -o requirements.dev.txt
+ language: system
+ types: [python]
+ pass_filenames: false
diff --git a/README.md b/README.md
index cafb3bff..7f00d26b 100644
--- a/README.md
+++ b/README.md
@@ -1,235 +1,156 @@
-# Nocturne
+# `nocturne_lab`: fast driving simulator 🧪 + 🚗
-Nocturne is a 2D, partially observed, driving simulator, built in C++ for speed and exported as a Python library.
+`nocturne_lab` is a maintained fork of [Nocturne](https://github.com/facebookresearch/nocturne); a 2D, partially observed, driving simulator built in C++. Currently, `nocturne_lab` is used internally at the Emerge lab. You can get started with the intro examples 🏎️💨 [here](https://github.com/Emerge-Lab/nocturne_lab/tree/feature/nocturne_fork_cleanup/examples).
-It is currently designed to handle traffic scenarios from the [Waymo Open Dataset](https://github.com/waymo-research/waymo-open-dataset), and with some work could be extended to support different driving datasets. Using the Python library `nocturne`, one is able to train controllers for AVs to solve various tasks from the Waymo dataset, which we provide as a benchmark, then use the tools we offer to evaluate the designed controllers.
+## Basic usage
-Using this rich data source, Nocturne contains a wide range of scenarios whose solution requires the formation of complex coordination, theory of mind, and handling of partial observability. Below we show replays of the expert data, centered on the light blue agent, with the corresponding view of the agent on the right.
-
-
+```python
+from nocturne.envs.base_env import BaseEnv
-Nocturne features a rich variety of scenes, ranging from parking lots, to merges, to roundabouts, to unsignalized intersections.
+# Initialize an environment
+env = BaseEnv(config=env_config)
-
+# Reset
+obs_dict = env.reset()
-More videos can be found [here](https://www.nathanlct.com/research/nocturne).
+# Get info
+agent_ids = [agent_id for agent_id in obs_dict.keys()]
+dead_agent_ids = []
-The corresponding paper is available at: [https://arxiv.org/abs/2206.09889](https://arxiv.org/abs/2206.09889). Please cite the paper and not the GitHub repository, using the following citation:
+for step in range(1000):
-```bibtex
-@article{nocturne2022,
- author = {Vinitsky, Eugene and Lichtlé, Nathan and Yang, Xiaomeng and Amos, Brandon and Foerster, Jakob},
- journal = {arXiv preprint arXiv:2206.09889},
- title = {{Nocturne: a scalable driving benchmark for bringing multi-agent learning one step closer to the real world}},
- url = {https://arxiv.org/abs/2206.09889},
- year = {2022}
-}
-```
+ # Sample actions
+ action_dict = {
+ agent_id: env.action_space.sample()
+ for agent_id in agent_ids
+ if agent_id not in dead_agent_ids
+ }
+
+ # Step in env
+ obs_dict, rew_dict, done_dict, info_dict = env.step(action_dict)
-# Installation
+ # Update dead agents
+ for agent_id, is_done in done_dict.items():
+ if is_done and agent_id not in dead_agent_ids:
+ dead_agent_ids.append(agent_id)
-**Feel free to [open an issue](https://github.com/facebookresearch/nocturne/issues/new/choose) at any time if you encounter a problem, need some help with installing or using Nocturne, want to ask us any related question, or even propose a new feature. We will be happy to help!**
+ # Reset if all agents are done
+ if done_dict["__all__"]:
+ obs_dict = env.reset()
+ dead_agent_ids = []
-## Dependencies
+# Close environment
+env.close()
+```
-[CMake](https://cmake.org/) is required to compile the C++ library.
+## Implemented algorithms
-Run `cmake --version` to see whether CMake is already installed in your environment. If not, refer to the CMake website instructions for installation, or you can use:
+| Algorithm | Reference | Code | Compatible with | Notes |
+| -------------------------------------- | ---------------------------------------------------------- | ----- | ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| PPO **single-agent** control | [Schulman et al., 2017](https://arxiv.org/pdf/1707.06347.pdf) | [ppo_with_sb3.ipynb](https://github.com/Emerge-Lab/nocturne_lab/blob/feature/nocturne_fork_cleanup/examples/04_ppo_with_sb3.ipynb) | Stable baselines 3 | |
+| PPO **multi-agent** control | [Schulman et al., 2017](https://arxiv.org/pdf/1707.06347.pdf) | `#TODO` | Stable baselines 3 | SB3 doesn't support multi-agent environments. Using the `VecEnv`class to treat observations from multiple agents as a set of vectorized single-agent environments. |
+| | | | | |
+| | | | | |
-- `sudo apt-get -y install cmake` (Linux)
-- `brew install cmake` (MacOS)
+## Installation
-### All machines besides OS with Mac M1 chip follow instructions below
-Nocturne uses [SFML](https://github.com/SFML/SFML) for drawing and visualization, as well as on [pybind11](https://pybind11.readthedocs.io/en/latest/) for compiling the C++ code as a Python library.
+### Requirements
-To install SFML:
+* Python (>=3.10)
-- `sudo apt-get install libsfml-dev` (Linux)
-- `brew install sfml` (MacOS)
+### Virtual environment
+Below different options for setting up a virtual environment are described. Either option works although `pyenv` is recommended.
-pybind11 is included as a submodule and will be installed in the next step.
+> _Note:_ The virtual environment needs to be **activated each time** before you start working.
-### Machines with a Mac M1 chip
-Unfortunately if you have a Mac M1 chip you need to ensure that your SFML version is x86_64 instead of arm64; by default brew will install the arm64 variant. The following instructions will help you do this.
+#### Option 1: `pyenv`
+Create a virtual environment by running:
-1. Make sure you have rosetta2 installed. You can do this by running `softwareupdate --install-rosetta` from the command line.
-2. Build an x86_64 version of brew (which you alias to brow) using the instructions here: [stackoverflow](https://stackoverflow.com/questions/64951024/how-can-i-run-two-isolated-installations-of-homebrew).
-3. Now, run `brow install sfml`
-then everything will compile fine.
+```shell
+pyenv virtualenv 3.10.12 nocturne_lab
+```
-## Installing Nocturne
+The virtual environment should be activated every time you start a new shell session before running subsequent commands:
-Start by cloning the repo:
+```shell
+pyenv shell nocturne_lab
+```
-```bash
-git clone https://github.com/facebookresearch/nocturne.git
-cd nocturne
+Fortunately, `pyenv` provides a way to assign a virtual environment to a directory. To set it for this project, run:
+```shell
+pyenv local nocturne_lab
```
-Then run the following to install git submodules:
+#### Option 2: `conda`
+Create a conda environment by running:
-```bash
-git submodule sync
-git submodule update --init --recursive
+```shell
+conda env create -f ./environment.yml
```
-If you are using [Conda](https://docs.conda.io/en/latest/) (recommended), you can instantiate an environment and install Nocturne into it with the following:
-
-```bash
-# create the environment and install the dependencies
-conda env create -f environment.yml
+This creates a conda environment using Python 3.10 called `nocturne_lab`.
-# activate the environment where the Python library should be installed
-conda activate nocturne
+To activate the virtual environment, run:
-# run the C++ build and install Nocturne into the simulation environment
-python setup.py develop
+```shell
+conda activate nocturne_lab
```
-If you are not using Conda, simply run the last command to build and install Nocturne at your default Python path.
+#### Option 3: `venv`
+Create a virtual environment by running:
-You should then be all set to use the library. To find an example of constructing a Gym environment, using a basic Simulation, or rendering scenes, go to
-```examples``` and run respectively, ```create_env.py```, ```nocturne_functions.py``` or ```rendering.py```.
+```shell
+python -m venv .venv
+```
-Python tests can be run with `pytest`.
+The virtual environment should be activated every time you start a new shell session before running the subsequent command:
-
-Click here for a list of common installation errors
+```shell
+source .venv/bin/activate
+```
-### pybind11 installation errors
+### Dependencies
-If you are getting errors with pybind11, install it directly in your conda environment (eg. `conda install -c conda-forge pybind11` or `pip install pybind11`, cf. https://pybind11.readthedocs.io/en/latest/installing.html for more info).
-
+`poetry` is used to manage the project and its dependencies. Start by installing `poetry` in your virtual environment:
-## Dataset
+```shell
+pip install poetry
+```
+
+Before installing the package, you first need to synchronise and update the git submodules by running:
-### Downloading the dataset
-Two versions of the dataset are available:
-- a mini-one that is about 1 GB and consists of 1000 training files and 100 validation / test files at: [Dropbox Link](https://www.dropbox.com/sh/8mxue9rdoizen3h/AADGRrHYBb86pZvDnHplDGvXa?dl=0).
-- the full dataset (150 GB) and consists of 134453 training files and 12205 validation / test files: [Dropbox Link](https://www.dropbox.com/sh/wv75pjd8phxizj3/AABfNPWfjQdoTWvdVxsAjUL_a?dl=0)
+```shell
+# Synchronise and update git submodules
+git submodule sync
+git submodule update --init --recursive
+```
-Place the dataset in a folder of your choosing, unzip the folders inside of it, and change the DATA_FOLDER in ```cfgs/config.py``` to point to where you have
-downloaded it.
+Now install the package by running:
-### (Optional) Rebuilding the Dataset
-**Warning** this step is not necessary, the dataset has already been downloaded in the prior step. This is only needed if you want to rebuild the dataset from scratch.
+```shell
+poetry install
+```
-First, go to [Waymo Open](https://github.com/waymo-research/waymo-open-dataset/blob/master/tutorial/tutorial.ipynb) and follow the instructions to install the required packages. This may require additional steps if you are not on a Linux machine.
+> _Note:_ Under the hood the `nocturne` package uses the `nocturne_cpp` Python package that wraps the Nocturne C++ code base and provides bindings for Python to interact with the C++ code using `pybind11`.
-If you do want to rebuild the dataset, download the Waymo Motion version 1.1 files.
-- Open ```cfgs/config.py``` and change ```DATA_FOLDER``` to be the path to your Waymo motion files
-- Run ```python scripts/json_generation/run_waymo_constructor.py --parallel --no_tl --all_files --datatype train valid```. This will construct, in parallel, a dataset of all the train and validation files in the waymo motion data. It should take on the order of 5 minutes with 20 CPUs. If you want to include traffic lights scenes, remove the ```--no_tl``` flag.
-- To ensure that only files that have a guaranteed solution are included (for example, that there are no files where the agent goal is across an apparently uncrossable road edge), run ```python scripts/json_generation/make_solvable_files.py --datatype train valid```.
-## C++ build instructions
+### Development setup
+To configure the development setup, run:
+```shell
+# Install poetry dev dependencies
+poetry install --only=dev
-If you want to build the C++ library independently of the Python one, run the following:
+# Install pre-commit (for flake8, isort, black, etc.)
+pre-commit install
-```bash
-cd nocturne/cpp
-mkdir build
-cd build
-cmake ..
-make
-make install
+# Optional: Install poetry docs dependencies
+poetry install --only=docs
```
-Subsequently, the C++ tests can be ran with `./tests/nocturne_test` from within the `nocturne/cpp/build` directory.
-
-# Usage
-
-To get a sense of available functionality in Nocturne, we have provided a few examples in the `examples` folder of how to construct the env (`create_env.py`), how to construct particular observations (`nocturne_functions.py`), and how to render results (`rendering.py`).
-
-**Note**: by default, Nocturne will log to ```$NOCTURNE_LOG_DIR``` which is set in ```nocturne/__init__.py``` and defaults to ```/logs```. If you'd like to log somewhere else, go to ```nocturne/__init__.py``` and change ```$NOCTURNE_LOG_DIR``` to a different path.
-
-The following goes over how to use training algorithms using the Nocturne environment.
-
-## Running the RL algorithms
-Nocturne comes shipped with a default Gym environment in ```nocturne/envs/base_env.py```. Atop this, we build integration for a few popular RL libraries.
-
-Nocturne by default comes with support for three versions of Proximal Policy Optimization:
-1. Sample Factory, a high throughput asynchronous PPO implementation (https://github.com/alex-petrenko/sample-factory)
-2. RLlib's PPO (https://github.com/ray-project/ray/tree/master/rllib)
-3. Multi-Agent PPO from (https://github.com/marlbenchmark/on-policy)
-Each algorithm is in its corresponding folder in examples and has a corresponding config file in cfgs/
-
-**Warning:** only the Sample Factory code has been extensively swept and tested. The default hyperparameters in there
-should work for training the agents from the corresponding paper. The other versions are provided for convenience
-but are not guaranteed to train a performant agent with the current hyperparameter settings.
-
-### Important hyperparameters to be aware of
-There are a few key hyperparameters that we expect users to care quite a bit about. Each of these can be toggled by adding
-```++=``` to the run command.
-- ```num_files```: this controls how many training scenarios are used. Set to -1 to use all of them.
-- ```max_num_vehicles```: this controls the maximum number of controllable agents in a scenario. If there are more than ```max_num_vehicles``` controllable agents in the scene, we sample ```max_num_vehicles``` randomly from them and set the remainder to be experts. If you want to ensure that all agents are controllable, simply pick a large number like 100.
-
-### Running Sample Factory
-Files from Sample Factory can be run from examples/sample_factory_files and should work by default by running
-```python examples/sample_factory_files/run_sample_factory.py algorithm=APPO```
-Additional config options for hyperparameters can be found in the config file.
-
-Once you have a trained checkpoint, you can visualize the results and make a movie of them by running ```python examples/sample_factory_files/visualize_sample_factory.py ```.
-
-*Warning*: because of how the algorithm is configured, Sample Factory works best with a fixed number of agents
-operating on a fixed horizon. To enable this, we use the config parameter ```max_num_vehicles``` which initializes the environment with only scenes that have fewer controllable agents than ```max_num_vehicles```. Additionally, if there are fewer than ```max_num_vehicles``` in the scene we add dummy agents that receive a vector of -1 at all timesteps. When a vehicle exits the scene we continue providing it a vector of -1 as an observation and a reward of 0.
-
-### Running RLlib
-Files from RLlib examples can be run from examples/rllib_files and should work by default by running
-```python examples/rllib_files/run_rllib.py```
-
-### Running on-policy PPO
-Files from [MAPPO](https://github.com/marlbenchmark/on-policy) examples can be run from examples/rllib_files and should work by default by running
-```python examples/on_policy_files/nocturne_runner.py algorithm=ppo```
-
-## Running the IL Algorithms
-Nocturne comes with a baseline implementation of behavioral cloning and a corresponding
-DataLoader. This can be run via ```python examples/imitation_learning/train.py```.
-
-# Contributors
-
-
-
-# License
-
-The majority of Nocturne is licensed under the MIT license, however portions of the project are available under separate license terms. The Waymo Motion Dataset License can be found at https://waymo.com/open/terms/.
+## Ongoing work
+
+Here is a list of features that we are developing:
+
+- @Daphne: Support for SB3's PPO algorithm with multi-agent control
+- @Alex: Logging and unit testing
+- @Tiyas: Random resets
diff --git a/algos/ppo/__init__.py b/algos/ppo/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/algos/ppo/base_runner.py b/algos/ppo/base_runner.py
deleted file mode 100644
index e4656b04..00000000
--- a/algos/ppo/base_runner.py
+++ /dev/null
@@ -1,180 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.
-#
-# This source code is licensed under the MIT license found in the
-# LICENSE file in the root directory of this source tree.
-# Code modified from https://github.com/marlbenchmark/on-policy
-import wandb
-import os
-import numpy as np
-import torch
-from tensorboardX import SummaryWriter
-from algos.ppo.utils.shared_buffer import SharedReplayBuffer
-
-
-def _t2n(x):
- """Convert torch tensor to a numpy array."""
- return x.detach().cpu().numpy()
-
-
-class Runner(object):
- """
- Base class for training recurrent policies.
- :param config: (dict) Config dictionary containing parameters for training.
- """
-
- def __init__(self, config):
-
- self.all_args = config['cfg.algo']
- self.envs = config['envs']
- self.eval_envs = config['eval_envs']
- self.device = config['device']
- self.num_agents = config['num_agents']
- if config.__contains__("render_envs"):
- self.render_envs = config['render_envs']
-
- # parameters
- # self.env_name = self.all_args.env_name
- self.algorithm_name = self.all_args.algorithm_name
- self.experiment_name = self.all_args.experiment
- self.use_centralized_V = self.all_args.use_centralized_V
- self.use_obs_instead_of_state = self.all_args.use_obs_instead_of_state
- self.num_env_steps = self.all_args.num_env_steps
- self.episode_length = self.all_args.episode_length
- # self.episodes_per_thread = self.all_args.episodes_per_thread
- self.n_rollout_threads = self.all_args.n_rollout_threads
- self.n_eval_rollout_threads = self.all_args.n_eval_rollout_threads
- self.n_render_rollout_threads = self.all_args.n_render_rollout_threads
- self.use_linear_lr_decay = self.all_args.use_linear_lr_decay
- self.hidden_size = self.all_args.hidden_size
- self.use_wandb = self.all_args.wandb
- self.use_render = self.all_args.use_render
- self.recurrent_N = self.all_args.recurrent_N
-
- # interval
- self.save_interval = self.all_args.save_interval
- self.use_eval = self.all_args.use_eval
- self.eval_interval = self.all_args.eval_interval
- self.log_interval = self.all_args.log_interval
-
- # dir
- self.model_dir = self.all_args.model_dir
-
- if self.use_wandb:
- self.save_dir = str(wandb.run.dir)
- self.run_dir = str(wandb.run.dir)
- else:
- self.run_dir = config["logdir"]
- self.log_dir = str(self.run_dir / 'logs')
- if not os.path.exists(self.log_dir):
- os.makedirs(self.log_dir)
- self.writter = SummaryWriter(self.log_dir)
- self.save_dir = str(self.run_dir / 'models')
- if not os.path.exists(self.save_dir):
- os.makedirs(self.save_dir)
-
- from algos.ppo.r_mappo.r_mappo import R_MAPPO as TrainAlgo
- from algos.ppo.r_mappo.algorithm.rMAPPOPolicy import R_MAPPOPolicy as Policy
- share_observation_space = self.envs.share_observation_space[
- 0] if self.use_centralized_V else self.envs.observation_space[0]
-
- # policy network
- self.policy = Policy(self.all_args,
- self.envs.observation_space[0],
- share_observation_space,
- self.envs.action_space[0],
- device=self.device)
-
- if self.model_dir is not None:
- self.restore()
-
- # algorithm
- self.trainer = TrainAlgo(self.all_args,
- self.policy,
- device=self.device)
-
- # buffer
- self.buffer = SharedReplayBuffer(self.all_args, self.num_agents,
- self.envs.observation_space[0],
- share_observation_space,
- self.envs.action_space[0])
-
- def run(self):
- """Collect training data, perform training updates, and evaluate policy."""
- raise NotImplementedError
-
- def warmup(self):
- """Collect warmup pre-training data."""
- raise NotImplementedError
-
- def collect(self, step):
- """Collect rollouts for training."""
- raise NotImplementedError
-
- def insert(self, data):
- """
- Insert data into buffer.
- :param data: (Tuple) data to insert into training buffer.
- """
- raise NotImplementedError
-
- @torch.no_grad()
- def compute(self):
- """Calculate returns for the collected data."""
- self.trainer.prep_rollout()
- next_values = self.trainer.policy.get_values(
- np.concatenate(self.buffer.share_obs[-1]),
- np.concatenate(self.buffer.rnn_states_critic[-1]),
- np.concatenate(self.buffer.masks[-1]))
- next_values = np.array(
- np.split(_t2n(next_values), self.n_rollout_threads))
- self.buffer.compute_returns(next_values, self.trainer.value_normalizer)
-
- def train(self):
- """Train policies with data in buffer. """
- self.trainer.prep_training()
- train_infos = self.trainer.train(self.buffer)
- self.buffer.after_update()
- return train_infos
-
- def save(self):
- """Save policy's actor and critic networks."""
- policy_actor = self.trainer.policy.actor
- torch.save(policy_actor.state_dict(), str(self.save_dir) + "/actor.pt")
- policy_critic = self.trainer.policy.critic
- torch.save(policy_critic.state_dict(),
- str(self.save_dir) + "/critic.pt")
-
- def restore(self):
- """Restore policy's networks from a saved model."""
- policy_actor_state_dict = torch.load(str(self.model_dir) + '/actor.pt')
- self.policy.actor.load_state_dict(policy_actor_state_dict)
- if not self.all_args.use_render:
- policy_critic_state_dict = torch.load(
- str(self.model_dir) + '/critic.pt')
- self.policy.critic.load_state_dict(policy_critic_state_dict)
-
- def log_train(self, train_infos, total_num_steps):
- """
- Log training info.
- :param train_infos: (dict) information about training update.
- :param total_num_steps: (int) total number of training env steps.
- """
- for k, v in train_infos.items():
- if self.use_wandb:
- wandb.log({k: v}, step=total_num_steps)
- else:
- self.writter.add_scalars(k, {k: v}, total_num_steps)
-
- def log_env(self, env_infos, total_num_steps):
- """
- Log env info.
- :param env_infos: (dict) information about env state.
- :param total_num_steps: (int) total number of training env steps.
- """
- for k, v in env_infos.items():
- if len(v) > 0:
- if self.use_wandb:
- wandb.log({k: np.mean(v)}, step=total_num_steps)
- else:
- self.writter.add_scalars(k, {k: np.mean(v)},
- total_num_steps)
diff --git a/algos/ppo/env_wrappers.py b/algos/ppo/env_wrappers.py
deleted file mode 100644
index eb0191d8..00000000
--- a/algos/ppo/env_wrappers.py
+++ /dev/null
@@ -1,867 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.
-#
-# This source code is licensed under the MIT license found in the
-# LICENSE file in the root directory of this source tree.
-# Code modified from https://github.com/marlbenchmark/on-policy
-"""
-Modified from OpenAI Baselines code to work with multi-agent envs
-"""
-import numpy as np
-import torch
-from multiprocessing import Process, Pipe
-from abc import ABC, abstractmethod
-from algos.ppo.utils.util import tile_images
-
-
-class CloudpickleWrapper(object):
- """
- Uses cloudpickle to serialize contents (otherwise multiprocessing tries to use pickle)
- """
-
- def __init__(self, x):
- self.x = x
-
- def __getstate__(self):
- import cloudpickle
- return cloudpickle.dumps(self.x)
-
- def __setstate__(self, ob):
- import pickle
- self.x = pickle.loads(ob)
-
-
-class ShareVecEnv(ABC):
- """
- An abstract asynchronous, vectorized environment.
- Used to batch data from multiple copies of an environment, so that
- each observation becomes an batch of observations, and expected action is a batch of actions to
- be applied per-environment.
- """
- closed = False
- viewer = None
-
- metadata = {'render.modes': ['human', 'rgb_array']}
-
- def __init__(self, num_envs, observation_space, share_observation_space,
- action_space):
- self.num_envs = num_envs
- self.observation_space = observation_space
- self.share_observation_space = share_observation_space
- self.action_space = action_space
-
- @abstractmethod
- def reset(self):
- """
- Reset all the environments and return an array of
- observations, or a dict of observation arrays.
- If step_async is still doing work, that work will
- be cancelled and step_wait() should not be called
- until step_async() is invoked again.
- """
- pass
-
- @abstractmethod
- def step_async(self, actions):
- """
- Tell all the environments to start taking a step
- with the given actions.
- Call step_wait() to get the results of the step.
- You should not call this if a step_async run is
- already pending.
- """
- pass
-
- @abstractmethod
- def step_wait(self):
- """
- Wait for the step taken with step_async().
- Returns (obs, rews, dones, infos):
- - obs: an array of observations, or a dict of
- arrays of observations.
- - rews: an array of rewards
- - dones: an array of "episode done" booleans
- - infos: a sequence of info objects
- """
- pass
-
- def close_extras(self):
- """
- Clean up the extra resources, beyond what's in this base class.
- Only runs when not self.closed.
- """
- pass
-
- def close(self):
- if self.closed:
- return
- if self.viewer is not None:
- self.viewer.close()
- self.close_extras()
- self.closed = True
-
- def step(self, actions):
- """
- Step the environments synchronously.
- This is available for backwards compatibility.
- """
- self.step_async(actions)
- return self.step_wait()
-
- def render(self, mode='human'):
- imgs = self.get_images()
- bigimg = tile_images(imgs)
- if mode == 'human':
- self.get_viewer().imshow(bigimg)
- return self.get_viewer().isopen
- elif mode == 'rgb_array':
- return bigimg
- else:
- raise NotImplementedError
-
- def get_images(self):
- """
- Return RGB images from each environment
- """
- raise NotImplementedError
-
- @property
- def unwrapped(self):
- if isinstance(self, VecEnvWrapper):
- return self.venv.unwrapped
- else:
- return self
-
- def get_viewer(self):
- if self.viewer is None:
- from gym.envs.classic_control import rendering
- self.viewer = rendering.SimpleImageViewer()
- return self.viewer
-
-
-def worker(remote, parent_remote, env_fn_wrapper):
- parent_remote.close()
- env = env_fn_wrapper.x()
- while True:
- cmd, data = remote.recv()
- if cmd == 'step':
- ob, reward, done, info = env.step(data)
- if 'bool' in done.__class__.__name__:
- if done:
- ob = env.reset()
- else:
- if np.all(done):
- ob = env.reset()
-
- remote.send((ob, reward, done, info))
- elif cmd == 'reset':
- ob = env.reset()
- remote.send((ob))
- elif cmd == 'render':
- if data == "rgb_array":
- fr = env.render(mode=data)
- remote.send(fr)
- elif data == "human":
- env.render(mode=data)
- elif cmd == 'reset_task':
- ob = env.reset_task()
- remote.send(ob)
- elif cmd == 'close':
- env.close()
- remote.close()
- break
- elif cmd == 'get_spaces':
- remote.send((env.observation_space, env.share_observation_space,
- env.action_space))
- else:
- raise NotImplementedError
-
-
-class GuardSubprocVecEnv(ShareVecEnv):
-
- def __init__(self, env_fns, spaces=None):
- """
- envs: list of gym environments to run in subprocesses
- """
- self.waiting = False
- self.closed = False
- nenvs = len(env_fns)
- self.remotes, self.work_remotes = zip(*[Pipe() for _ in range(nenvs)])
- self.ps = [
- Process(target=worker,
- args=(work_remote, remote, CloudpickleWrapper(env_fn)))
- for (work_remote, remote,
- env_fn) in zip(self.work_remotes, self.remotes, env_fns)
- ]
- for p in self.ps:
- p.daemon = False # could cause zombie process
- p.start()
- for remote in self.work_remotes:
- remote.close()
-
- self.remotes[0].send(('get_spaces', None))
- observation_space, share_observation_space, action_space = self.remotes[
- 0].recv()
- ShareVecEnv.__init__(self, len(env_fns), observation_space,
- share_observation_space, action_space)
-
- def step_async(self, actions):
-
- for remote, action in zip(self.remotes, actions):
- remote.send(('step', action))
- self.waiting = True
-
- def step_wait(self):
- results = [remote.recv() for remote in self.remotes]
- self.waiting = False
- obs, rews, dones, infos = zip(*results)
- return np.stack(obs), np.stack(rews), np.stack(dones), infos
-
- def reset(self):
- for remote in self.remotes:
- remote.send(('reset', None))
- obs = [remote.recv() for remote in self.remotes]
- return np.stack(obs)
-
- def reset_task(self):
- for remote in self.remotes:
- remote.send(('reset_task', None))
- return np.stack([remote.recv() for remote in self.remotes])
-
- def close(self):
- if self.closed:
- return
- if self.waiting:
- for remote in self.remotes:
- remote.recv()
- for remote in self.remotes:
- remote.send(('close', None))
- for p in self.ps:
- p.join()
- self.closed = True
-
-
-class SubprocVecEnv(ShareVecEnv):
-
- def __init__(self, env_fns, spaces=None):
- """
- envs: list of gym environments to run in subprocesses
- """
- self.waiting = False
- self.closed = False
- nenvs = len(env_fns)
- self.remotes, self.work_remotes = zip(*[Pipe() for _ in range(nenvs)])
- self.ps = [
- Process(target=worker,
- args=(work_remote, remote, CloudpickleWrapper(env_fn)))
- for (work_remote, remote,
- env_fn) in zip(self.work_remotes, self.remotes, env_fns)
- ]
- for p in self.ps:
- p.daemon = True # if the main process crashes, we should not cause things to hang
- p.start()
- for remote in self.work_remotes:
- remote.close()
-
- self.remotes[0].send(('get_spaces', None))
- observation_space, share_observation_space, action_space = self.remotes[
- 0].recv()
- ShareVecEnv.__init__(self, len(env_fns), observation_space,
- share_observation_space, action_space)
-
- def step_async(self, actions):
- for remote, action in zip(self.remotes, actions):
- remote.send(('step', action))
- self.waiting = True
-
- def step_wait(self):
- results = [remote.recv() for remote in self.remotes]
- self.waiting = False
- obs, rews, dones, infos = zip(*results)
- return np.stack(obs), np.stack(rews), np.stack(dones), infos
-
- def reset(self):
- for remote in self.remotes:
- remote.send(('reset', None))
- obs = [remote.recv() for remote in self.remotes]
- return np.stack(obs)
-
- def reset_task(self):
- for remote in self.remotes:
- remote.send(('reset_task', None))
- return np.stack([remote.recv() for remote in self.remotes])
-
- def close(self):
- if self.closed:
- return
- if self.waiting:
- for remote in self.remotes:
- remote.recv()
- for remote in self.remotes:
- remote.send(('close', None))
- for p in self.ps:
- p.join()
- self.closed = True
-
- def render(self, mode="rgb_array"):
- for remote in self.remotes:
- remote.send(('render', mode))
- if mode == "rgb_array":
- frame = [remote.recv() for remote in self.remotes]
- return np.stack(frame)
-
-
-def shareworker(remote, parent_remote, env_fn_wrapper):
- parent_remote.close()
- env = env_fn_wrapper.x()
- while True:
- cmd, data = remote.recv()
- if cmd == 'step':
- ob, s_ob, reward, done, info, available_actions = env.step(data)
- if 'bool' in done.__class__.__name__:
- if done:
- ob, s_ob, available_actions = env.reset()
- else:
- if np.all(done):
- ob, s_ob, available_actions = env.reset()
-
- remote.send((ob, s_ob, reward, done, info, available_actions))
- elif cmd == 'reset':
- ob, s_ob, available_actions = env.reset()
- remote.send((ob, s_ob, available_actions))
- elif cmd == 'reset_task':
- ob = env.reset_task()
- remote.send(ob)
- elif cmd == 'render':
- if data == "rgb_array":
- fr = env.render(mode=data)
- remote.send(fr)
- elif data == "human":
- env.render(mode=data)
- elif cmd == 'close':
- env.close()
- remote.close()
- break
- elif cmd == 'get_spaces':
- remote.send((env.observation_space, env.share_observation_space,
- env.action_space))
- elif cmd == 'render_vulnerability':
- fr = env.render_vulnerability(data)
- remote.send((fr))
- else:
- raise NotImplementedError
-
-
-class ShareSubprocVecEnv(ShareVecEnv):
-
- def __init__(self, env_fns, spaces=None):
- """
- envs: list of gym environments to run in subprocesses
- """
- self.waiting = False
- self.closed = False
- nenvs = len(env_fns)
- self.remotes, self.work_remotes = zip(*[Pipe() for _ in range(nenvs)])
- self.ps = [
- Process(target=shareworker,
- args=(work_remote, remote, CloudpickleWrapper(env_fn)))
- for (work_remote, remote,
- env_fn) in zip(self.work_remotes, self.remotes, env_fns)
- ]
- for p in self.ps:
- p.daemon = True # if the main process crashes, we should not cause things to hang
- p.start()
- for remote in self.work_remotes:
- remote.close()
- self.remotes[0].send(('get_spaces', None))
- observation_space, share_observation_space, action_space = self.remotes[
- 0].recv()
- ShareVecEnv.__init__(self, len(env_fns), observation_space,
- share_observation_space, action_space)
-
- def step_async(self, actions):
- for remote, action in zip(self.remotes, actions):
- remote.send(('step', action))
- self.waiting = True
-
- def step_wait(self):
- results = [remote.recv() for remote in self.remotes]
- self.waiting = False
- obs, share_obs, rews, dones, infos, available_actions = zip(*results)
- return np.stack(obs), np.stack(share_obs), np.stack(rews), np.stack(
- dones), infos, np.stack(available_actions)
-
- def reset(self):
- for remote in self.remotes:
- remote.send(('reset', None))
- results = [remote.recv() for remote in self.remotes]
- obs, share_obs, available_actions = zip(*results)
- return np.stack(obs), np.stack(share_obs), np.stack(available_actions)
-
- def reset_task(self):
- for remote in self.remotes:
- remote.send(('reset_task', None))
- return np.stack([remote.recv() for remote in self.remotes])
-
- def close(self):
- if self.closed:
- return
- if self.waiting:
- for remote in self.remotes:
- remote.recv()
- for remote in self.remotes:
- remote.send(('close', None))
- for p in self.ps:
- p.join()
- self.closed = True
-
-
-def choosesimpleworker(remote, parent_remote, env_fn_wrapper):
- parent_remote.close()
- env = env_fn_wrapper.x()
- while True:
- cmd, data = remote.recv()
- if cmd == 'step':
- ob, reward, done, info = env.step(data)
- remote.send((ob, reward, done, info))
- elif cmd == 'reset':
- ob = env.reset(data)
- remote.send((ob))
- elif cmd == 'reset_task':
- ob = env.reset_task()
- remote.send(ob)
- elif cmd == 'close':
- env.close()
- remote.close()
- break
- elif cmd == 'render':
- if data == "rgb_array":
- fr = env.render(mode=data)
- remote.send(fr)
- elif data == "human":
- env.render(mode=data)
- elif cmd == 'get_spaces':
- remote.send((env.observation_space, env.share_observation_space,
- env.action_space))
- else:
- raise NotImplementedError
-
-
-class ChooseSimpleSubprocVecEnv(ShareVecEnv):
-
- def __init__(self, env_fns, spaces=None):
- """
- envs: list of gym environments to run in subprocesses
- """
- self.waiting = False
- self.closed = False
- nenvs = len(env_fns)
- self.remotes, self.work_remotes = zip(*[Pipe() for _ in range(nenvs)])
- self.ps = [
- Process(target=choosesimpleworker,
- args=(work_remote, remote, CloudpickleWrapper(env_fn)))
- for (work_remote, remote,
- env_fn) in zip(self.work_remotes, self.remotes, env_fns)
- ]
- for p in self.ps:
- p.daemon = True # if the main process crashes, we should not cause things to hang
- p.start()
- for remote in self.work_remotes:
- remote.close()
- self.remotes[0].send(('get_spaces', None))
- observation_space, share_observation_space, action_space = self.remotes[
- 0].recv()
- ShareVecEnv.__init__(self, len(env_fns), observation_space,
- share_observation_space, action_space)
-
- def step_async(self, actions):
- for remote, action in zip(self.remotes, actions):
- remote.send(('step', action))
- self.waiting = True
-
- def step_wait(self):
- results = [remote.recv() for remote in self.remotes]
- self.waiting = False
- obs, rews, dones, infos = zip(*results)
- return np.stack(obs), np.stack(rews), np.stack(dones), infos
-
- def reset(self, reset_choose):
- for remote, choose in zip(self.remotes, reset_choose):
- remote.send(('reset', choose))
- obs = [remote.recv() for remote in self.remotes]
- return np.stack(obs)
-
- def render(self, mode="rgb_array"):
- for remote in self.remotes:
- remote.send(('render', mode))
- if mode == "rgb_array":
- frame = [remote.recv() for remote in self.remotes]
- return np.stack(frame)
-
- def reset_task(self):
- for remote in self.remotes:
- remote.send(('reset_task', None))
- return np.stack([remote.recv() for remote in self.remotes])
-
- def close(self):
- if self.closed:
- return
- if self.waiting:
- for remote in self.remotes:
- remote.recv()
- for remote in self.remotes:
- remote.send(('close', None))
- for p in self.ps:
- p.join()
- self.closed = True
-
-
-def chooseworker(remote, parent_remote, env_fn_wrapper):
- parent_remote.close()
- env = env_fn_wrapper.x()
- while True:
- cmd, data = remote.recv()
- if cmd == 'step':
- ob, s_ob, reward, done, info, available_actions = env.step(data)
- remote.send((ob, s_ob, reward, done, info, available_actions))
- elif cmd == 'reset':
- ob, s_ob, available_actions = env.reset(data)
- remote.send((ob, s_ob, available_actions))
- elif cmd == 'reset_task':
- ob = env.reset_task()
- remote.send(ob)
- elif cmd == 'close':
- env.close()
- remote.close()
- break
- elif cmd == 'render':
- remote.send(env.render(mode='rgb_array'))
- elif cmd == 'get_spaces':
- remote.send((env.observation_space, env.share_observation_space,
- env.action_space))
- else:
- raise NotImplementedError
-
-
-class ChooseSubprocVecEnv(ShareVecEnv):
-
- def __init__(self, env_fns, spaces=None):
- """
- envs: list of gym environments to run in subprocesses
- """
- self.waiting = False
- self.closed = False
- nenvs = len(env_fns)
- self.remotes, self.work_remotes = zip(*[Pipe() for _ in range(nenvs)])
- self.ps = [
- Process(target=chooseworker,
- args=(work_remote, remote, CloudpickleWrapper(env_fn)))
- for (work_remote, remote,
- env_fn) in zip(self.work_remotes, self.remotes, env_fns)
- ]
- for p in self.ps:
- p.daemon = True # if the main process crashes, we should not cause things to hang
- p.start()
- for remote in self.work_remotes:
- remote.close()
- self.remotes[0].send(('get_spaces', None))
- observation_space, share_observation_space, action_space = self.remotes[
- 0].recv()
- ShareVecEnv.__init__(self, len(env_fns), observation_space,
- share_observation_space, action_space)
-
- def step_async(self, actions):
- for remote, action in zip(self.remotes, actions):
- remote.send(('step', action))
- self.waiting = True
-
- def step_wait(self):
- results = [remote.recv() for remote in self.remotes]
- self.waiting = False
- obs, share_obs, rews, dones, infos, available_actions = zip(*results)
- return np.stack(obs), np.stack(share_obs), np.stack(rews), np.stack(
- dones), infos, np.stack(available_actions)
-
- def reset(self, reset_choose):
- for remote, choose in zip(self.remotes, reset_choose):
- remote.send(('reset', choose))
- results = [remote.recv() for remote in self.remotes]
- obs, share_obs, available_actions = zip(*results)
- return np.stack(obs), np.stack(share_obs), np.stack(available_actions)
-
- def reset_task(self):
- for remote in self.remotes:
- remote.send(('reset_task', None))
- return np.stack([remote.recv() for remote in self.remotes])
-
- def close(self):
- if self.closed:
- return
- if self.waiting:
- for remote in self.remotes:
- remote.recv()
- for remote in self.remotes:
- remote.send(('close', None))
- for p in self.ps:
- p.join()
- self.closed = True
-
-
-def chooseguardworker(remote, parent_remote, env_fn_wrapper):
- parent_remote.close()
- env = env_fn_wrapper.x()
- while True:
- cmd, data = remote.recv()
- if cmd == 'step':
- ob, reward, done, info = env.step(data)
- remote.send((ob, reward, done, info))
- elif cmd == 'reset':
- ob = env.reset(data)
- remote.send((ob))
- elif cmd == 'reset_task':
- ob = env.reset_task()
- remote.send(ob)
- elif cmd == 'close':
- env.close()
- remote.close()
- break
- elif cmd == 'get_spaces':
- remote.send((env.observation_space, env.share_observation_space,
- env.action_space))
- else:
- raise NotImplementedError
-
-
-class ChooseGuardSubprocVecEnv(ShareVecEnv):
-
- def __init__(self, env_fns, spaces=None):
- """
- envs: list of gym environments to run in subprocesses
- """
- self.waiting = False
- self.closed = False
- nenvs = len(env_fns)
- self.remotes, self.work_remotes = zip(*[Pipe() for _ in range(nenvs)])
- self.ps = [
- Process(target=chooseguardworker,
- args=(work_remote, remote, CloudpickleWrapper(env_fn)))
- for (work_remote, remote,
- env_fn) in zip(self.work_remotes, self.remotes, env_fns)
- ]
- for p in self.ps:
- p.daemon = False # if the main process crashes, we should not cause things to hang
- p.start()
- for remote in self.work_remotes:
- remote.close()
- self.remotes[0].send(('get_spaces', None))
- observation_space, share_observation_space, action_space = self.remotes[
- 0].recv()
- ShareVecEnv.__init__(self, len(env_fns), observation_space,
- share_observation_space, action_space)
-
- def step_async(self, actions):
- for remote, action in zip(self.remotes, actions):
- remote.send(('step', action))
- self.waiting = True
-
- def step_wait(self):
- results = [remote.recv() for remote in self.remotes]
- self.waiting = False
- obs, rews, dones, infos = zip(*results)
- return np.stack(obs), np.stack(rews), np.stack(dones), infos
-
- def reset(self, reset_choose):
- for remote, choose in zip(self.remotes, reset_choose):
- remote.send(('reset', choose))
- obs = [remote.recv() for remote in self.remotes]
- return np.stack(obs)
-
- def reset_task(self):
- for remote in self.remotes:
- remote.send(('reset_task', None))
- return np.stack([remote.recv() for remote in self.remotes])
-
- def close(self):
- if self.closed:
- return
- if self.waiting:
- for remote in self.remotes:
- remote.recv()
- for remote in self.remotes:
- remote.send(('close', None))
- for p in self.ps:
- p.join()
- self.closed = True
-
-
-# single env
-class DummyVecEnv(ShareVecEnv):
-
- def __init__(self, env_fns):
- self.envs = [fn() for fn in env_fns]
- env = self.envs[0]
- ShareVecEnv.__init__(self, len(env_fns), env.observation_space,
- env.share_observation_space, env.action_space)
- self.actions = None
-
- def step_async(self, actions):
- self.actions = actions
-
- def step_wait(self):
- results = [env.step(a) for (a, env) in zip(self.actions, self.envs)]
- # TODO(eugenevinitsky) remove this
- obs, rews, dones, infos = map(np.array, zip(*results))
-
- for (i, done) in enumerate(dones):
- if 'bool' in done.__class__.__name__:
- if done:
- obs[i] = self.envs[i].reset()
- else:
- if np.all(done):
- obs[i] = self.envs[i].reset()
-
- self.actions = None
- return obs, rews, dones, infos
-
- def reset(self):
- obs = [env.reset() for env in self.envs]
- return np.array(obs)
-
- def close(self):
- for env in self.envs:
- env.close()
-
- def render(self, mode="human"):
- if mode == "rgb_array":
- return np.array([env.render(mode=mode) for env in self.envs])
- elif mode == "human":
- for env in self.envs:
- env.render(mode=mode)
- else:
- raise NotImplementedError
-
-
-class ShareDummyVecEnv(ShareVecEnv):
-
- def __init__(self, env_fns):
- self.envs = [fn() for fn in env_fns]
- env = self.envs[0]
- ShareVecEnv.__init__(self, len(env_fns), env.observation_space,
- env.share_observation_space, env.action_space)
- self.actions = None
-
- def step_async(self, actions):
- self.actions = actions
-
- def step_wait(self):
- results = [env.step(a) for (a, env) in zip(self.actions, self.envs)]
- obs, share_obs, rews, dones, infos, available_actions = map(
- np.array, zip(*results))
-
- for (i, done) in enumerate(dones):
- if 'bool' in done.__class__.__name__:
- if done:
- obs[i], share_obs[i], available_actions[i] = self.envs[
- i].reset()
- else:
- if np.all(done):
- obs[i], share_obs[i], available_actions[i] = self.envs[
- i].reset()
- self.actions = None
-
- return obs, share_obs, rews, dones, infos, available_actions
-
- def reset(self):
- results = [env.reset() for env in self.envs]
- obs, share_obs, available_actions = map(np.array, zip(*results))
- return obs, share_obs, available_actions
-
- def close(self):
- for env in self.envs:
- env.close()
-
- def render(self, mode="human"):
- if mode == "rgb_array":
- return np.array([env.render(mode=mode) for env in self.envs])
- elif mode == "human":
- for env in self.envs:
- env.render(mode=mode)
- else:
- raise NotImplementedError
-
-
-class ChooseDummyVecEnv(ShareVecEnv):
-
- def __init__(self, env_fns):
- self.envs = [fn() for fn in env_fns]
- env = self.envs[0]
- ShareVecEnv.__init__(self, len(env_fns), env.observation_space,
- env.share_observation_space, env.action_space)
- self.actions = None
-
- def step_async(self, actions):
- self.actions = actions
-
- def step_wait(self):
- results = [env.step(a) for (a, env) in zip(self.actions, self.envs)]
- obs, share_obs, rews, dones, infos, available_actions = map(
- np.array, zip(*results))
- self.actions = None
- return obs, share_obs, rews, dones, infos, available_actions
-
- def reset(self, reset_choose):
- results = [
- env.reset(choose) for (env, choose) in zip(self.envs, reset_choose)
- ]
- obs, share_obs, available_actions = map(np.array, zip(*results))
- return obs, share_obs, available_actions
-
- def close(self):
- for env in self.envs:
- env.close()
-
- def render(self, mode="human"):
- if mode == "rgb_array":
- return np.array([env.render(mode=mode) for env in self.envs])
- elif mode == "human":
- for env in self.envs:
- env.render(mode=mode)
- else:
- raise NotImplementedError
-
-
-class ChooseSimpleDummyVecEnv(ShareVecEnv):
-
- def __init__(self, env_fns):
- self.envs = [fn() for fn in env_fns]
- env = self.envs[0]
- ShareVecEnv.__init__(self, len(env_fns), env.observation_space,
- env.share_observation_space, env.action_space)
- self.actions = None
-
- def step_async(self, actions):
- self.actions = actions
-
- def step_wait(self):
- results = [env.step(a) for (a, env) in zip(self.actions, self.envs)]
- obs, rews, dones, infos = map(np.array, zip(*results))
- self.actions = None
- return obs, rews, dones, infos
-
- def reset(self, reset_choose):
- obs = [
- env.reset(choose) for (env, choose) in zip(self.envs, reset_choose)
- ]
- return np.array(obs)
-
- def close(self):
- for env in self.envs:
- env.close()
-
- def render(self, mode="human"):
- if mode == "rgb_array":
- return np.array([env.render(mode=mode) for env in self.envs])
- elif mode == "human":
- for env in self.envs:
- env.render(mode=mode)
- else:
- raise NotImplementedError
diff --git a/algos/ppo/ppo_utils/act.py b/algos/ppo/ppo_utils/act.py
deleted file mode 100644
index 387c9b3e..00000000
--- a/algos/ppo/ppo_utils/act.py
+++ /dev/null
@@ -1,199 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.
-#
-# This source code is licensed under the MIT license found in the
-# LICENSE file in the root directory of this source tree.
-# Code modified from https://github.com/marlbenchmark/on-policy
-from .distributions import Bernoulli, Categorical, DiagGaussian
-import torch
-import torch.nn as nn
-
-
-class ACTLayer(nn.Module):
- """
- MLP Module to compute actions.
- :param action_space: (gym.Space) action space.
- :param inputs_dim: (int) dimension of network input.
- :param use_orthogonal: (bool) whether to use orthogonal initialization.
- :param gain: (float) gain of the output layer of the network.
- """
-
- def __init__(self, action_space, inputs_dim, use_orthogonal, gain, device):
- super(ACTLayer, self).__init__()
- self.mixed_action = False
- self.multi_discrete = False
-
- if action_space.__class__.__name__ == "Discrete":
- action_dim = action_space.n
- self.action_out = Categorical(inputs_dim, action_dim,
- use_orthogonal, gain)
- elif action_space.__class__.__name__ == "Box":
- action_dim = action_space.shape[0]
- self.action_out = DiagGaussian(inputs_dim, action_dim,
- use_orthogonal, gain, device)
- elif action_space.__class__.__name__ == "MultiBinary":
- action_dim = action_space.shape[0]
- self.action_out = Bernoulli(inputs_dim, action_dim, use_orthogonal,
- gain)
- elif action_space.__class__.__name__ == "MultiDiscrete":
- self.multi_discrete = True
- action_dims = action_space.high - action_space.low + 1
- self.action_outs = []
- for action_dim in action_dims:
- self.action_outs.append(
- Categorical(inputs_dim, action_dim, use_orthogonal, gain))
- self.action_outs = nn.ModuleList(self.action_outs)
- else: # discrete + continous
- self.mixed_action = True
- continous_dim = action_space[0].shape[0]
- discrete_dim = action_space[1].n
- self.action_outs = nn.ModuleList([
- DiagGaussian(inputs_dim, continous_dim, use_orthogonal, gain),
- Categorical(inputs_dim, discrete_dim, use_orthogonal, gain)
- ])
-
- self.to(device)
-
- def forward(self, x, available_actions=None, deterministic=False):
- """
- Compute actions and action logprobs from given input.
- :param x: (torch.Tensor) input to network.
- :param available_actions: (torch.Tensor) denotes which actions are available to agent
- (if None, all actions available)
- :param deterministic: (bool) whether to sample from action distribution or return the mode.
-
- :return actions: (torch.Tensor) actions to take.
- :return action_log_probs: (torch.Tensor) log probabilities of taken actions.
- """
- if self.mixed_action:
- actions = []
- action_log_probs = []
- for action_out in self.action_outs:
- action_logit = action_out(x)
- action = action_logit.mode(
- ) if deterministic else action_logit.sample()
- action_log_prob = action_logit.log_probs(action)
- actions.append(action.float())
- action_log_probs.append(action_log_prob)
-
- actions = torch.cat(actions, -1)
- action_log_probs = torch.sum(torch.cat(action_log_probs, -1),
- -1,
- keepdim=True)
-
- elif self.multi_discrete:
- actions = []
- action_log_probs = []
- for action_out in self.action_outs:
- action_logit = action_out(x)
- action = action_logit.mode(
- ) if deterministic else action_logit.sample()
- action_log_prob = action_logit.log_probs(action)
- actions.append(action)
- action_log_probs.append(action_log_prob)
-
- actions = torch.cat(actions, -1)
- action_log_probs = torch.cat(action_log_probs, -1)
-
- else:
- action_logits = self.action_out(x)
- actions = action_logits.mode(
- ) if deterministic else action_logits.sample()
- action_log_probs = action_logits.log_probs(actions)
-
- return actions, action_log_probs
-
- def get_probs(self, x, available_actions=None):
- """
- Compute action probabilities from inputs.
- :param x: (torch.Tensor) input to network.
- :param available_actions: (torch.Tensor) denotes which actions are available to agent
- (if None, all actions available)
-
- :return action_probs: (torch.Tensor)
- """
- if self.mixed_action or self.multi_discrete:
- action_probs = []
- for action_out in self.action_outs:
- action_logit = action_out(x)
- action_prob = action_logit.probs
- action_probs.append(action_prob)
- action_probs = torch.cat(action_probs, -1)
- else:
- action_logits = self.action_out(x, available_actions)
- action_probs = action_logits.probs
-
- return action_probs
-
- def evaluate_actions(self,
- x,
- action,
- available_actions=None,
- active_masks=None):
- """
- Compute log probability and entropy of given actions.
- :param x: (torch.Tensor) input to network.
- :param action: (torch.Tensor) actions whose entropy and log probability to evaluate.
- :param available_actions: (torch.Tensor) denotes which actions are available to agent
- (if None, all actions available)
- :param active_masks: (torch.Tensor) denotes whether an agent is active or dead.
-
- :return action_log_probs: (torch.Tensor) log probabilities of the input actions.
- :return dist_entropy: (torch.Tensor) action distribution entropy for the given inputs.
- """
- if self.mixed_action:
- a, b = action.split((2, 1), -1)
- b = b.long()
- action = [a, b]
- action_log_probs = []
- dist_entropy = []
- for action_out, act in zip(self.action_outs, action):
- action_logit = action_out(x)
- action_log_probs.append(action_logit.log_probs(act))
- if active_masks is not None:
- if len(action_logit.entropy().shape) == len(
- active_masks.shape):
- dist_entropy.append(
- (action_logit.entropy() * active_masks).sum() /
- active_masks.sum())
- else:
- dist_entropy.append((action_logit.entropy() *
- active_masks.squeeze(-1)).sum() /
- active_masks.sum())
- else:
- dist_entropy.append(action_logit.entropy().mean())
-
- action_log_probs = torch.sum(torch.cat(action_log_probs, -1),
- -1,
- keepdim=True)
- dist_entropy = dist_entropy[0] / 2.0 + dist_entropy[
- 1] / 0.98 #! dosen't make sense
-
- elif self.multi_discrete:
- action = torch.transpose(action, 0, 1)
- action_log_probs = []
- dist_entropy = []
- for action_out, act in zip(self.action_outs, action):
- action_logit = action_out(x)
- action_log_probs.append(action_logit.log_probs(act))
- if active_masks is not None:
- dist_entropy.append(
- (action_logit.entropy() *
- active_masks.squeeze(-1)).sum() / active_masks.sum())
- else:
- dist_entropy.append(action_logit.entropy().mean())
-
- action_log_probs = torch.cat(action_log_probs,
- -1) # ! could be wrong
- dist_entropy = torch.tensor(dist_entropy).mean()
-
- else:
- action_logits = self.action_out(x, available_actions)
- action_log_probs = action_logits.log_probs(action)
- if active_masks is not None:
- dist_entropy = (
- action_logits.entropy() *
- active_masks.squeeze(-1)).sum() / active_masks.sum()
- else:
- dist_entropy = action_logits.entropy().mean()
-
- return action_log_probs, dist_entropy
diff --git a/algos/ppo/ppo_utils/cnn.py b/algos/ppo/ppo_utils/cnn.py
deleted file mode 100644
index 95fb8218..00000000
--- a/algos/ppo/ppo_utils/cnn.py
+++ /dev/null
@@ -1,80 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.
-#
-# This source code is licensed under the MIT license found in the
-# LICENSE file in the root directory of this source tree.
-# Code modified from https://github.com/marlbenchmark/on-policy
-from torchvision import transforms
-import torch.nn as nn
-from .util import init
-"""CNN Modules and utils."""
-
-
-class Flatten(nn.Module):
-
- def forward(self, x):
- return x.view(x.size(0), -1)
-
-
-class CNNLayer(nn.Module):
-
- def __init__(self,
- obs_shape,
- hidden_size,
- use_orthogonal,
- use_ReLU,
- kernel_size=3,
- stride=1):
- super(CNNLayer, self).__init__()
-
- active_func = [nn.Tanh(), nn.ReLU()][use_ReLU]
- init_method = [nn.init.xavier_uniform_,
- nn.init.orthogonal_][use_orthogonal]
- gain = nn.init.calculate_gain(['tanh', 'relu'][use_ReLU])
-
- self.resize = transforms.Resize(84)
-
- def init_(m):
- return init(m,
- init_method,
- lambda x: nn.init.constant_(x, 0),
- gain=gain)
-
- input_channel = obs_shape[0]
- input_width = obs_shape[1]
- input_height = obs_shape[2]
-
- self.cnn = nn.Sequential(
- init_(
- nn.Conv2d(in_channels=input_channel,
- out_channels=hidden_size // 2,
- kernel_size=kernel_size,
- stride=stride)), active_func, Flatten(),
- init_(
- nn.Linear(
- hidden_size // 2 * (input_width - kernel_size + stride) *
- (input_height - kernel_size + stride),
- hidden_size)), active_func,
- init_(nn.Linear(hidden_size, hidden_size)), active_func)
-
- def forward(self, x):
- # TODO(eugenevinitsky) hardcoding is bad
- x = self.resize(x) / 255.0
- x = self.cnn(x)
- return x
-
-
-class CNNBase(nn.Module):
-
- def __init__(self, args, obs_shape):
- super(CNNBase, self).__init__()
-
- self._use_orthogonal = args.use_orthogonal
- self._use_ReLU = args.use_ReLU
- self.hidden_size = args.hidden_size
-
- self.cnn = CNNLayer(obs_shape, self.hidden_size, self._use_orthogonal,
- self._use_ReLU)
-
- def forward(self, x):
- x = self.cnn(x)
- return x
diff --git a/algos/ppo/ppo_utils/distributions.py b/algos/ppo/ppo_utils/distributions.py
deleted file mode 100644
index 9249d700..00000000
--- a/algos/ppo/ppo_utils/distributions.py
+++ /dev/null
@@ -1,151 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.
-#
-# This source code is licensed under the MIT license found in the
-# LICENSE file in the root directory of this source tree.
-# Code modified from https://github.com/marlbenchmark/on-policy
-import torch
-import torch.nn as nn
-from .util import init
-"""
-Modify standard PyTorch distributions so they to make compatible with this codebase.
-"""
-
-#
-# Standardize distribution interfaces
-#
-
-
-# Categorical
-class FixedCategorical(torch.distributions.Categorical):
-
- def sample(self):
- return super().sample().unsqueeze(-1)
-
- def log_probs(self, actions):
- return (super().log_prob(actions.squeeze(-1)).view(
- actions.size(0), -1).sum(-1).unsqueeze(-1))
-
- def mode(self):
- return self.probs.argmax(dim=-1, keepdim=True)
-
-
-# Normal
-class FixedNormal(torch.distributions.Normal):
-
- def log_probs(self, actions):
- return super().log_prob(actions).sum(-1, keepdim=True)
-
- def entrop(self):
- return super.entropy().sum(-1)
-
- def mode(self):
- return self.mean
-
-
-# Bernoulli
-class FixedBernoulli(torch.distributions.Bernoulli):
-
- def log_probs(self, actions):
- return super.log_prob(actions).view(actions.size(0),
- -1).sum(-1).unsqueeze(-1)
-
- def entropy(self):
- return super().entropy().sum(-1)
-
- def mode(self):
- return torch.gt(self.probs, 0.5).float()
-
-
-class Categorical(nn.Module):
-
- def __init__(self,
- num_inputs,
- num_outputs,
- use_orthogonal=True,
- gain=0.01):
- super(Categorical, self).__init__()
- init_method = [nn.init.xavier_uniform_,
- nn.init.orthogonal_][use_orthogonal]
-
- def init_(m):
- return init(m, init_method, lambda x: nn.init.constant_(x, 0),
- gain)
-
- self.linear = init_(nn.Linear(num_inputs, num_outputs))
-
- def forward(self, x, available_actions=None):
- x = self.linear(x)
- if available_actions is not None:
- x[available_actions == 0] = -1e10
- return FixedCategorical(logits=x)
-
-
-class DiagGaussian(nn.Module):
-
- def __init__(self,
- num_inputs,
- num_outputs,
- use_orthogonal=True,
- gain=0.01,
- device='cpu'):
- super(DiagGaussian, self).__init__()
-
- init_method = [nn.init.xavier_uniform_,
- nn.init.orthogonal_][use_orthogonal]
-
- def init_(m):
- return init(m, init_method, lambda x: nn.init.constant_(x, 0),
- gain)
-
- self.fc_mean = init_(nn.Linear(num_inputs, num_outputs))
- self.logstd = AddBias(torch.zeros(num_outputs))
- self.to(device)
- self.device = device
-
- def forward(self, x):
- action_mean = self.fc_mean(x)
-
- # An ugly hack for my KFAC implementation.
- zeros = torch.zeros(action_mean.size()).to(self.device)
- # if x.is_cuda:
- # zeros = zeros.cuda()
-
- action_logstd = self.logstd(zeros)
- return FixedNormal(action_mean, action_logstd.exp())
-
-
-class Bernoulli(nn.Module):
-
- def __init__(self,
- num_inputs,
- num_outputs,
- use_orthogonal=True,
- gain=0.01):
- super(Bernoulli, self).__init__()
- init_method = [nn.init.xavier_uniform_,
- nn.init.orthogonal_][use_orthogonal]
-
- def init_(m):
- return init(m, init_method, lambda x: nn.init.constant_(x, 0),
- gain)
-
- self.linear = init_(nn.Linear(num_inputs, num_outputs))
-
- def forward(self, x):
- x = self.linear(x)
- return FixedBernoulli(logits=x)
-
-
-class AddBias(nn.Module):
-
- def __init__(self, bias):
- super(AddBias, self).__init__()
- self._bias = nn.Parameter(bias.unsqueeze(1))
-
- def forward(self, x):
- if x.dim() == 2:
- bias = self._bias.t().view(1, -1)
- else:
- bias = self._bias.t().view(1, -1, 1, 1)
-
- return x + bias
diff --git a/algos/ppo/ppo_utils/encoder.py b/algos/ppo/ppo_utils/encoder.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/algos/ppo/ppo_utils/mlp.py b/algos/ppo/ppo_utils/mlp.py
deleted file mode 100644
index b066a3d2..00000000
--- a/algos/ppo/ppo_utils/mlp.py
+++ /dev/null
@@ -1,68 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.
-#
-# This source code is licensed under the MIT license found in the
-# LICENSE file in the root directory of this source tree.
-# Code modified from https://github.com/marlbenchmark/on-policy
-import torch.nn as nn
-from .util import init, get_clones
-"""MLP modules."""
-
-
-class MLPLayer(nn.Module):
-
- def __init__(self, input_dim, hidden_size, layer_N, use_orthogonal,
- use_ReLU):
- super(MLPLayer, self).__init__()
- self._layer_N = layer_N
-
- active_func = [nn.Tanh(), nn.ReLU()][use_ReLU]
- init_method = [nn.init.xavier_uniform_,
- nn.init.orthogonal_][use_orthogonal]
- gain = nn.init.calculate_gain(['tanh', 'relu'][use_ReLU])
-
- def init_(m):
- return init(m,
- init_method,
- lambda x: nn.init.constant_(x, 0),
- gain=gain)
-
- self.fc1 = nn.Sequential(init_(nn.Linear(input_dim, hidden_size)),
- active_func, nn.LayerNorm(hidden_size))
- self.fc_h = nn.Sequential(init_(nn.Linear(hidden_size, hidden_size)),
- active_func, nn.LayerNorm(hidden_size))
- self.fc2 = get_clones(self.fc_h, self._layer_N)
-
- def forward(self, x):
- x = self.fc1(x)
- for i in range(self._layer_N):
- x = self.fc2[i](x)
- return x
-
-
-class MLPBase(nn.Module):
-
- def __init__(self, args, obs_shape, cat_self=True, attn_internal=False):
- super(MLPBase, self).__init__()
-
- self._use_feature_normalization = args.use_feature_normalization
- self._use_orthogonal = args.use_orthogonal
- self._use_ReLU = args.use_ReLU
- self._stacked_frames = args.stacked_frames
- self._layer_N = args.layer_N
- self.hidden_size = args.hidden_size
-
- obs_dim = obs_shape[0]
-
- if self._use_feature_normalization:
- self.feature_norm = nn.LayerNorm(obs_dim)
-
- self.mlp = MLPLayer(obs_dim, self.hidden_size, self._layer_N,
- self._use_orthogonal, self._use_ReLU)
-
- def forward(self, x):
- if self._use_feature_normalization:
- x = self.feature_norm(x)
-
- x = self.mlp(x)
-
- return x
diff --git a/algos/ppo/ppo_utils/popart.py b/algos/ppo/ppo_utils/popart.py
deleted file mode 100644
index 7dd4be1b..00000000
--- a/algos/ppo/ppo_utils/popart.py
+++ /dev/null
@@ -1,120 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.
-#
-# This source code is licensed under the MIT license found in the
-# LICENSE file in the root directory of this source tree.
-# Code modified from https://github.com/marlbenchmark/on-policy
-import math
-import numpy as np
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-
-
-class PopArt(torch.nn.Module):
-
- def __init__(self,
- input_shape,
- output_shape,
- norm_axes=1,
- beta=0.99999,
- epsilon=1e-5,
- device=torch.device("cpu")):
-
- super(PopArt, self).__init__()
-
- self.beta = beta
- self.epsilon = epsilon
- self.norm_axes = norm_axes
- self.tpdv = dict(dtype=torch.float32, device=device)
-
- self.input_shape = input_shape
- self.output_shape = output_shape
-
- self.weight = nn.Parameter(torch.Tensor(output_shape,
- input_shape)).to(**self.tpdv)
- self.bias = nn.Parameter(torch.Tensor(output_shape)).to(**self.tpdv)
-
- self.stddev = nn.Parameter(torch.ones(output_shape),
- requires_grad=False).to(**self.tpdv)
- self.mean = nn.Parameter(torch.zeros(output_shape),
- requires_grad=False).to(**self.tpdv)
- self.mean_sq = nn.Parameter(torch.zeros(output_shape),
- requires_grad=False).to(**self.tpdv)
- self.debiasing_term = nn.Parameter(torch.tensor(0.0),
- requires_grad=False).to(**self.tpdv)
-
- self.reset_parameters()
-
- def reset_parameters(self):
- torch.nn.init.kaiming_uniform_(self.weight, a=math.sqrt(5))
- if self.bias is not None:
- fan_in, _ = torch.nn.init._calculate_fan_in_and_fan_out(
- self.weight)
- bound = 1 / math.sqrt(fan_in)
- torch.nn.init.uniform_(self.bias, -bound, bound)
- self.mean.zero_()
- self.mean_sq.zero_()
- self.debiasing_term.zero_()
-
- def forward(self, input_vector):
- if type(input_vector) == np.ndarray:
- input_vector = torch.from_numpy(input_vector)
- input_vector = input_vector.to(**self.tpdv)
-
- return F.linear(input_vector, self.weight, self.bias)
-
- @torch.no_grad()
- def update(self, input_vector):
- if type(input_vector) == np.ndarray:
- input_vector = torch.from_numpy(input_vector)
- input_vector = input_vector.to(**self.tpdv)
-
- old_mean, old_var = self.debiased_mean_var()
- old_stddev = torch.sqrt(old_var)
-
- batch_mean = input_vector.mean(dim=tuple(range(self.norm_axes)))
- batch_sq_mean = (input_vector**2).mean(
- dim=tuple(range(self.norm_axes)))
-
- self.mean.mul_(self.beta).add_(batch_mean * (1.0 - self.beta))
- self.mean_sq.mul_(self.beta).add_(batch_sq_mean * (1.0 - self.beta))
- self.debiasing_term.mul_(self.beta).add_(1.0 * (1.0 - self.beta))
-
- self.stddev = (self.mean_sq - self.mean**2).sqrt().clamp(min=1e-4)
-
- new_mean, new_var = self.debiased_mean_var()
- new_stddev = torch.sqrt(new_var)
-
- self.weight = self.weight * old_stddev / new_stddev
- self.bias = (old_stddev * self.bias + old_mean - new_mean) / new_stddev
-
- def debiased_mean_var(self):
- debiased_mean = self.mean / self.debiasing_term.clamp(min=self.epsilon)
- debiased_mean_sq = self.mean_sq / self.debiasing_term.clamp(
- min=self.epsilon)
- debiased_var = (debiased_mean_sq - debiased_mean**2).clamp(min=1e-2)
- return debiased_mean, debiased_var
-
- def normalize(self, input_vector):
- if type(input_vector) == np.ndarray:
- input_vector = torch.from_numpy(input_vector)
- input_vector = input_vector.to(**self.tpdv)
-
- mean, var = self.debiased_mean_var()
- out = (input_vector - mean[(None, ) * self.norm_axes]
- ) / torch.sqrt(var)[(None, ) * self.norm_axes]
-
- return out
-
- def denormalize(self, input_vector):
- if type(input_vector) == np.ndarray:
- input_vector = torch.from_numpy(input_vector)
- input_vector = input_vector.to(**self.tpdv)
-
- mean, var = self.debiased_mean_var()
- out = input_vector * torch.sqrt(var)[(None, ) * self.norm_axes] + mean[
- (None, ) * self.norm_axes]
-
- out = out.cpu().numpy()
-
- return out
diff --git a/algos/ppo/ppo_utils/rnn.py b/algos/ppo/ppo_utils/rnn.py
deleted file mode 100644
index 2720be9c..00000000
--- a/algos/ppo/ppo_utils/rnn.py
+++ /dev/null
@@ -1,90 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.
-#
-# This source code is licensed under the MIT license found in the
-# LICENSE file in the root directory of this source tree.
-# Code modified from https://github.com/marlbenchmark/on-policy
-import torch
-import torch.nn as nn
-"""RNN modules."""
-
-
-class RNNLayer(nn.Module):
-
- def __init__(self, inputs_dim, outputs_dim, recurrent_N, use_orthogonal,
- device):
- super(RNNLayer, self).__init__()
- self._recurrent_N = recurrent_N
- self._use_orthogonal = use_orthogonal
-
- self.rnn = nn.GRU(inputs_dim,
- outputs_dim,
- num_layers=self._recurrent_N)
- for name, param in self.rnn.named_parameters():
- if 'bias' in name:
- nn.init.constant_(param, 0)
- elif 'weight' in name:
- if self._use_orthogonal:
- nn.init.orthogonal_(param)
- else:
- nn.init.xavier_uniform_(param)
- self.norm = nn.LayerNorm(outputs_dim)
- self.to(device)
-
- def forward(self, x, hxs, masks):
- if x.size(0) == hxs.size(0):
- x, hxs = self.rnn(
- x.unsqueeze(0),
- (hxs *
- masks.repeat(1, self._recurrent_N).unsqueeze(-1)).transpose(
- 0, 1).contiguous())
- x = x.squeeze(0)
- hxs = hxs.transpose(0, 1)
- else:
- # x is a (T, N, -1) tensor that has been flatten to (T * N, -1)
- N = hxs.size(0)
- T = int(x.size(0) / N)
-
- # unflatten
- x = x.view(T, N, x.size(1))
-
- # Same deal with masks
- masks = masks.view(T, N)
-
- # Let's figure out which steps in the sequence have a zero for any agent
- # We will always assume t=0 has a zero in it as that makes the logic cleaner
- has_zeros = ((masks[1:] == 0.0).any(
- dim=-1).nonzero().squeeze().cpu())
-
- # +1 to correct the masks[1:]
- if has_zeros.dim() == 0:
- # Deal with scalar
- has_zeros = [has_zeros.item() + 1]
- else:
- has_zeros = (has_zeros + 1).numpy().tolist()
-
- # add t=0 and t=T to the list
- has_zeros = [0] + has_zeros + [T]
-
- hxs = hxs.transpose(0, 1)
-
- outputs = []
- for i in range(len(has_zeros) - 1):
- # We can now process steps that don't have any zeros in masks together!
- # This is much faster
- start_idx = has_zeros[i]
- end_idx = has_zeros[i + 1]
- temp = (hxs * masks[start_idx].view(1, -1, 1).repeat(
- self._recurrent_N, 1, 1)).contiguous()
- rnn_scores, hxs = self.rnn(x[start_idx:end_idx], temp)
- outputs.append(rnn_scores)
-
- # assert len(outputs) == T
- # x is a (T, N, -1) tensor
- x = torch.cat(outputs, dim=0)
-
- # flatten
- x = x.reshape(T * N, -1)
- hxs = hxs.transpose(0, 1)
-
- x = self.norm(x)
- return x, hxs
diff --git a/algos/ppo/ppo_utils/util.py b/algos/ppo/ppo_utils/util.py
deleted file mode 100644
index 6f2735cc..00000000
--- a/algos/ppo/ppo_utils/util.py
+++ /dev/null
@@ -1,25 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.
-#
-# This source code is licensed under the MIT license found in the
-# LICENSE file in the root directory of this source tree.
-# Code modified from https://github.com/marlbenchmark/on-policy
-import copy
-import numpy as np
-
-import torch
-import torch.nn as nn
-
-
-def init(module, weight_init, bias_init, gain=1):
- weight_init(module.weight.data, gain=gain)
- bias_init(module.bias.data)
- return module
-
-
-def get_clones(module, N):
- return nn.ModuleList([copy.deepcopy(module) for i in range(N)])
-
-
-def check(input):
- output = torch.from_numpy(input) if type(input) == np.ndarray else input
- return output
diff --git a/algos/ppo/r_mappo/__init__.py b/algos/ppo/r_mappo/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/algos/ppo/r_mappo/algorithm/rMAPPOPolicy.py b/algos/ppo/r_mappo/algorithm/rMAPPOPolicy.py
deleted file mode 100644
index c211cdb6..00000000
--- a/algos/ppo/r_mappo/algorithm/rMAPPOPolicy.py
+++ /dev/null
@@ -1,156 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.
-#
-# This source code is licensed under the MIT license found in the
-# LICENSE file in the root directory of this source tree.
-# Code modified from https://github.com/marlbenchmark/on-policy
-import torch
-from algos.ppo.r_mappo.algorithm.r_actor_critic import R_Actor, R_Critic
-from algos.ppo.utils.util import update_linear_schedule
-
-
-class R_MAPPOPolicy:
- """
- MAPPO Policy class. Wraps actor and critic networks to compute actions and value function predictions.
-
- :param args: (argparse.Namespace) arguments containing relevant model and policy information.
- :param obs_space: (gym.Space) observation space.
- :param cent_obs_space: (gym.Space) value function input space (centralized input for MAPPO, decentralized for IPPO).
- :param action_space: (gym.Space) action space.
- :param device: (torch.device) specifies the device to run on (cpu/gpu).
- """
-
- def __init__(self,
- args,
- obs_space,
- cent_obs_space,
- act_space,
- device=torch.device("cpu")):
- self.device = device
- self.lr = args.lr
- self.critic_lr = args.critic_lr
- self.opti_eps = args.opti_eps
- self.weight_decay = args.weight_decay
-
- self.obs_space = obs_space
- self.share_obs_space = cent_obs_space
- self.act_space = act_space
-
- self.actor = R_Actor(args, self.obs_space, self.act_space, self.device)
- self.critic = R_Critic(args, self.share_obs_space, self.device)
-
- self.actor_optimizer = torch.optim.Adam(self.actor.parameters(),
- lr=self.lr,
- eps=self.opti_eps,
- weight_decay=self.weight_decay)
- self.critic_optimizer = torch.optim.Adam(
- self.critic.parameters(),
- lr=self.critic_lr,
- eps=self.opti_eps,
- weight_decay=self.weight_decay)
-
- def lr_decay(self, episode, episodes):
- """
- Decay the actor and critic learning rates.
- :param episode: (int) current training episode.
- :param episodes: (int) total number of training episodes.
- """
- update_linear_schedule(self.actor_optimizer, episode, episodes,
- self.lr)
- update_linear_schedule(self.critic_optimizer, episode, episodes,
- self.critic_lr)
-
- def get_actions(self,
- cent_obs,
- obs,
- rnn_states_actor,
- rnn_states_critic,
- masks,
- available_actions=None,
- deterministic=False):
- """
- Compute actions and value function predictions for the given inputs.
- :param cent_obs (np.ndarray): centralized input to the critic.
- :param obs (np.ndarray): local agent inputs to the actor.
- :param rnn_states_actor: (np.ndarray) if actor is RNN, RNN states for actor.
- :param rnn_states_critic: (np.ndarray) if critic is RNN, RNN states for critic.
- :param masks: (np.ndarray) denotes points at which RNN states should be reset.
- :param available_actions: (np.ndarray) denotes which actions are available to agent
- (if None, all actions available)
- :param deterministic: (bool) whether the action should be mode of distribution or should be sampled.
-
- :return values: (torch.Tensor) value function predictions.
- :return actions: (torch.Tensor) actions to take.
- :return action_log_probs: (torch.Tensor) log probabilities of chosen actions.
- :return rnn_states_actor: (torch.Tensor) updated actor network RNN states.
- :return rnn_states_critic: (torch.Tensor) updated critic network RNN states.
- """
- actions, action_log_probs, rnn_states_actor = self.actor(
- obs, rnn_states_actor, masks, available_actions, deterministic)
-
- values, rnn_states_critic = self.critic(cent_obs, rnn_states_critic,
- masks)
- return values, actions, action_log_probs, rnn_states_actor, rnn_states_critic
-
- def get_values(self, cent_obs, rnn_states_critic, masks):
- """
- Get value function predictions.
- :param cent_obs (np.ndarray): centralized input to the critic.
- :param rnn_states_critic: (np.ndarray) if critic is RNN, RNN states for critic.
- :param masks: (np.ndarray) denotes points at which RNN states should be reset.
-
- :return values: (torch.Tensor) value function predictions.
- """
- values, _ = self.critic(cent_obs, rnn_states_critic, masks)
- return values
-
- def evaluate_actions(self,
- cent_obs,
- obs,
- rnn_states_actor,
- rnn_states_critic,
- action,
- masks,
- available_actions=None,
- active_masks=None):
- """
- Get action logprobs / entropy and value function predictions for actor update.
- :param cent_obs (np.ndarray): centralized input to the critic.
- :param obs (np.ndarray): local agent inputs to the actor.
- :param rnn_states_actor: (np.ndarray) if actor is RNN, RNN states for actor.
- :param rnn_states_critic: (np.ndarray) if critic is RNN, RNN states for critic.
- :param action: (np.ndarray) actions whose log probabilites and entropy to compute.
- :param masks: (np.ndarray) denotes points at which RNN states should be reset.
- :param available_actions: (np.ndarray) denotes which actions are available to agent
- (if None, all actions available)
- :param active_masks: (torch.Tensor) denotes whether an agent is active or dead.
-
- :return values: (torch.Tensor) value function predictions.
- :return action_log_probs: (torch.Tensor) log probabilities of the input actions.
- :return dist_entropy: (torch.Tensor) action distribution entropy for the given inputs.
- """
- action_log_probs, dist_entropy = self.actor.evaluate_actions(
- obs, rnn_states_actor, action, masks, available_actions,
- active_masks)
-
- values, _ = self.critic(cent_obs, rnn_states_critic, masks)
- return values, action_log_probs, dist_entropy
-
- def act(self,
- obs,
- rnn_states_actor,
- masks,
- available_actions=None,
- deterministic=False):
- """
- Compute actions using the given inputs.
- :param obs (np.ndarray): local agent inputs to the actor.
- :param rnn_states_actor: (np.ndarray) if actor is RNN, RNN states for actor.
- :param masks: (np.ndarray) denotes points at which RNN states should be reset.
- :param available_actions: (np.ndarray) denotes which actions are available to agent
- (if None, all actions available)
- :param deterministic: (bool) whether the action should be mode of distribution or should be sampled.
- """
- actions, _, rnn_states_actor = self.actor(obs, rnn_states_actor, masks,
- available_actions,
- deterministic)
- return actions, rnn_states_actor
diff --git a/algos/ppo/r_mappo/algorithm/r_actor_critic.py b/algos/ppo/r_mappo/algorithm/r_actor_critic.py
deleted file mode 100644
index ee9dfdf0..00000000
--- a/algos/ppo/r_mappo/algorithm/r_actor_critic.py
+++ /dev/null
@@ -1,197 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.
-#
-# This source code is licensed under the MIT license found in the
-# LICENSE file in the root directory of this source tree.
-# Code modified from https://github.com/marlbenchmark/on-policy
-import torch
-import torch.nn as nn
-from algos.ppo.ppo_utils.util import init, check
-from algos.ppo.ppo_utils.mlp import MLPBase
-from algos.ppo.ppo_utils.rnn import RNNLayer
-from algos.ppo.ppo_utils.act import ACTLayer
-from algos.ppo.ppo_utils.popart import PopArt
-from algos.ppo.utils.util import get_shape_from_obs_space
-
-
-class R_Actor(nn.Module):
- """
- Actor network class for MAPPO. Outputs actions given observations.
- :param args: (argparse.Namespace) arguments containing relevant model information.
- :param obs_space: (gym.Space) observation space.
- :param action_space: (gym.Space) action space.
- :param device: (torch.device) specifies the device to run on (cpu/gpu).
- """
-
- def __init__(self,
- args,
- obs_space,
- action_space,
- device=torch.device("cpu")):
- super(R_Actor, self).__init__()
- self.hidden_size = args.hidden_size
-
- self._gain = args.gain
- self._use_orthogonal = args.use_orthogonal
- self._use_policy_active_masks = args.use_policy_active_masks
- self._use_naive_recurrent_policy = args.use_naive_recurrent_policy
- self._use_recurrent_policy = args.use_recurrent_policy
- self._recurrent_N = args.recurrent_N
- self.tpdv = dict(dtype=torch.float32, device=device)
-
- obs_shape = get_shape_from_obs_space(obs_space)
- base = MLPBase
- self.base = base(args, obs_shape)
-
- if self._use_naive_recurrent_policy or self._use_recurrent_policy:
- self.rnn = RNNLayer(self.hidden_size, self.hidden_size,
- self._recurrent_N, self._use_orthogonal,
- device)
-
- self.act = ACTLayer(action_space, self.hidden_size,
- self._use_orthogonal, self._gain, device)
-
- self.to(device)
-
- def forward(self,
- obs,
- rnn_states,
- masks,
- available_actions=None,
- deterministic=False):
- """
- Compute actions from the given inputs.
- :param obs: (np.ndarray / torch.Tensor) observation inputs into network.
- :param rnn_states: (np.ndarray / torch.Tensor) if RNN network, hidden states for RNN.
- :param masks: (np.ndarray / torch.Tensor) mask tensor denoting if hidden states should be reinitialized to zeros.
- :param available_actions: (np.ndarray / torch.Tensor) denotes which actions are available to agent
- (if None, all actions available)
- :param deterministic: (bool) whether to sample from action distribution or return the mode.
-
- :return actions: (torch.Tensor) actions to take.
- :return action_log_probs: (torch.Tensor) log probabilities of taken actions.
- :return rnn_states: (torch.Tensor) updated RNN hidden states.
- """
- obs = check(obs).to(**self.tpdv)
- rnn_states = check(rnn_states).to(**self.tpdv)
- masks = check(masks).to(**self.tpdv)
- if available_actions is not None:
- available_actions = check(available_actions).to(**self.tpdv)
-
- actor_features = self.base(obs)
-
- if self._use_naive_recurrent_policy or self._use_recurrent_policy:
- actor_features, rnn_states = self.rnn(actor_features, rnn_states,
- masks)
-
- actions, action_log_probs = self.act(actor_features, available_actions,
- deterministic)
-
- return actions, action_log_probs, rnn_states
-
- def evaluate_actions(self,
- obs,
- rnn_states,
- action,
- masks,
- available_actions=None,
- active_masks=None):
- """
- Compute log probability and entropy of given actions.
- :param obs: (torch.Tensor) observation inputs into network.
- :param action: (torch.Tensor) actions whose entropy and log probability to evaluate.
- :param rnn_states: (torch.Tensor) if RNN network, hidden states for RNN.
- :param masks: (torch.Tensor) mask tensor denoting if hidden states should be reinitialized to zeros.
- :param available_actions: (torch.Tensor) denotes which actions are available to agent
- (if None, all actions available)
- :param active_masks: (torch.Tensor) denotes whether an agent is active or dead.
-
- :return action_log_probs: (torch.Tensor) log probabilities of the input actions.
- :return dist_entropy: (torch.Tensor) action distribution entropy for the given inputs.
- """
- obs = check(obs).to(**self.tpdv)
- rnn_states = check(rnn_states).to(**self.tpdv)
- action = check(action).to(**self.tpdv)
- masks = check(masks).to(**self.tpdv)
- if available_actions is not None:
- available_actions = check(available_actions).to(**self.tpdv)
-
- if active_masks is not None:
- active_masks = check(active_masks).to(**self.tpdv)
-
- actor_features = self.base(obs)
-
- if self._use_naive_recurrent_policy or self._use_recurrent_policy:
- actor_features, rnn_states = self.rnn(actor_features, rnn_states,
- masks)
-
- action_log_probs, dist_entropy = self.act.evaluate_actions(
- actor_features,
- action,
- available_actions,
- active_masks=active_masks
- if self._use_policy_active_masks else None)
-
- return action_log_probs, dist_entropy
-
-
-class R_Critic(nn.Module):
- """
- Critic network class for MAPPO. Outputs value function predictions given centralized input (MAPPO) or
- local observations (IPPO).
- :param args: (argparse.Namespace) arguments containing relevant model information.
- :param cent_obs_space: (gym.Space) (centralized) observation space.
- :param device: (torch.device) specifies the device to run on (cpu/gpu).
- """
-
- def __init__(self, args, cent_obs_space, device=torch.device("cpu")):
- super(R_Critic, self).__init__()
- self.hidden_size = args.hidden_size
- self._use_orthogonal = args.use_orthogonal
- self._use_naive_recurrent_policy = args.use_naive_recurrent_policy
- self._use_recurrent_policy = args.use_recurrent_policy
- self._recurrent_N = args.recurrent_N
- self._use_popart = args.use_popart
- self.tpdv = dict(dtype=torch.float32, device=device)
- init_method = [nn.init.xavier_uniform_,
- nn.init.orthogonal_][self._use_orthogonal]
-
- cent_obs_shape = get_shape_from_obs_space(cent_obs_space)
- base = MLPBase
- self.base = base(args, cent_obs_shape)
-
- if self._use_naive_recurrent_policy or self._use_recurrent_policy:
- self.rnn = RNNLayer(self.hidden_size, self.hidden_size,
- self._recurrent_N, self._use_orthogonal,
- device)
-
- def init_(m):
- return init(m, init_method, lambda x: nn.init.constant_(x, 0))
-
- if self._use_popart:
- self.v_out = init_(PopArt(self.hidden_size, 1, device=device))
- else:
- self.v_out = init_(nn.Linear(self.hidden_size, 1))
-
- self.to(device)
-
- def forward(self, cent_obs, rnn_states, masks):
- """
- Compute actions from the given inputs.
- :param cent_obs: (np.ndarray / torch.Tensor) observation inputs into network.
- :param rnn_states: (np.ndarray / torch.Tensor) if RNN network, hidden states for RNN.
- :param masks: (np.ndarray / torch.Tensor) mask tensor denoting if RNN states should be reinitialized to zeros.
-
- :return values: (torch.Tensor) value function predictions.
- :return rnn_states: (torch.Tensor) updated RNN hidden states.
- """
- cent_obs = check(cent_obs).to(**self.tpdv)
- rnn_states = check(rnn_states).to(**self.tpdv)
- masks = check(masks).to(**self.tpdv)
-
- critic_features = self.base(cent_obs)
- if self._use_naive_recurrent_policy or self._use_recurrent_policy:
- critic_features, rnn_states = self.rnn(critic_features, rnn_states,
- masks)
- values = self.v_out(critic_features)
-
- return values, rnn_states
diff --git a/algos/ppo/r_mappo/r_mappo.py b/algos/ppo/r_mappo/r_mappo.py
deleted file mode 100644
index 0bae8b24..00000000
--- a/algos/ppo/r_mappo/r_mappo.py
+++ /dev/null
@@ -1,244 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.
-#
-# This source code is licensed under the MIT license found in the
-# LICENSE file in the root directory of this source tree.
-# Code modified from https://github.com/marlbenchmark/on-policy
-import numpy as np
-import torch
-import torch.nn as nn
-from algos.ppo.utils.util import get_gard_norm, huber_loss, mse_loss
-from algos.ppo.utils.valuenorm import ValueNorm
-from algos.ppo.ppo_utils.util import check
-
-
-class R_MAPPO():
- """
- Trainer class for MAPPO to update policies.
- :param args: (argparse.Namespace) arguments containing relevant model, policy, and env information.
- :param policy: (R_MAPPO_Policy) policy to update.
- :param device: (torch.device) specifies the device to run on (cpu/gpu).
- """
-
- def __init__(self, args, policy, device=torch.device("cpu")):
-
- self.device = device
- self.tpdv = dict(dtype=torch.float32, device=device)
- self.policy = policy
-
- self.clip_param = args.clip_param
- self.ppo_epoch = args.ppo_epoch
- self.num_mini_batch = args.num_mini_batch
- self.data_chunk_length = args.data_chunk_length
- self.value_loss_coef = args.value_loss_coef
- self.entropy_coef = args.entropy_coef
- self.max_grad_norm = args.max_grad_norm
- self.huber_delta = args.huber_delta
-
- self._use_recurrent_policy = args.use_recurrent_policy
- self._use_naive_recurrent = args.use_naive_recurrent_policy
- self._use_max_grad_norm = args.use_max_grad_norm
- self._use_clipped_value_loss = args.use_clipped_value_loss
- self._use_huber_loss = args.use_huber_loss
- self._use_popart = args.use_popart
- self._use_valuenorm = args.use_valuenorm
- self._use_value_active_masks = args.use_value_active_masks
- self._use_policy_active_masks = args.use_policy_active_masks
-
- assert (self._use_popart and self._use_valuenorm) == False, (
- "self._use_popart and self._use_valuenorm can not be set True simultaneously"
- )
-
- if self._use_popart:
- self.value_normalizer = self.policy.critic.v_out
- elif self._use_valuenorm:
- self.value_normalizer = ValueNorm(1, device=self.device)
- else:
- self.value_normalizer = None
-
- def cal_value_loss(self, values, value_preds_batch, return_batch,
- active_masks_batch):
- """
- Calculate value function loss.
- :param values: (torch.Tensor) value function predictions.
- :param value_preds_batch: (torch.Tensor) "old" value predictions from data batch (used for value clip loss)
- :param return_batch: (torch.Tensor) reward to go returns.
- :param active_masks_batch: (torch.Tensor) denotes if agent is active or dead at a given timesep.
-
- :return value_loss: (torch.Tensor) value function loss.
- """
- value_pred_clipped = value_preds_batch + (
- values - value_preds_batch).clamp(-self.clip_param,
- self.clip_param)
- if self._use_popart or self._use_valuenorm:
- self.value_normalizer.update(return_batch)
- error_clipped = self.value_normalizer.normalize(
- return_batch) - value_pred_clipped
- error_original = self.value_normalizer.normalize(
- return_batch) - values
- else:
- error_clipped = return_batch - value_pred_clipped
- error_original = return_batch - values
-
- if self._use_huber_loss:
- value_loss_clipped = huber_loss(error_clipped, self.huber_delta)
- value_loss_original = huber_loss(error_original, self.huber_delta)
- else:
- value_loss_clipped = mse_loss(error_clipped)
- value_loss_original = mse_loss(error_original)
-
- if self._use_clipped_value_loss:
- value_loss = torch.max(value_loss_original, value_loss_clipped)
- else:
- value_loss = value_loss_original
-
- if self._use_value_active_masks:
- value_loss = (value_loss *
- active_masks_batch).sum() / active_masks_batch.sum()
- else:
- value_loss = value_loss.mean()
-
- return value_loss
-
- def ppo_update(self, sample, update_actor=True):
- """
- Update actor and critic networks.
- :param sample: (Tuple) contains data batch with which to update networks.
- :update_actor: (bool) whether to update actor network.
-
- :return value_loss: (torch.Tensor) value function loss.
- :return critic_grad_norm: (torch.Tensor) gradient norm from critic up9date.
- ;return policy_loss: (torch.Tensor) actor(policy) loss value.
- :return dist_entropy: (torch.Tensor) action entropies.
- :return actor_grad_norm: (torch.Tensor) gradient norm from actor update.
- :return imp_weights: (torch.Tensor) importance sampling weights.
- """
- share_obs_batch, obs_batch, rnn_states_batch, rnn_states_critic_batch, actions_batch, \
- value_preds_batch, return_batch, masks_batch, active_masks_batch, old_action_log_probs_batch, \
- adv_targ, available_actions_batch = sample
-
- old_action_log_probs_batch = check(old_action_log_probs_batch).to(
- **self.tpdv)
- adv_targ = check(adv_targ).to(**self.tpdv)
- value_preds_batch = check(value_preds_batch).to(**self.tpdv)
- return_batch = check(return_batch).to(**self.tpdv)
- active_masks_batch = check(active_masks_batch).to(**self.tpdv)
-
- # Reshape to do in a single forward pass for all steps
- values, action_log_probs, dist_entropy = self.policy.evaluate_actions(
- share_obs_batch, obs_batch, rnn_states_batch,
- rnn_states_critic_batch, actions_batch, masks_batch,
- available_actions_batch, active_masks_batch)
- # actor update
- imp_weights = torch.exp(action_log_probs - old_action_log_probs_batch)
-
- surr1 = imp_weights * adv_targ
- surr2 = torch.clamp(imp_weights, 1.0 - self.clip_param,
- 1.0 + self.clip_param) * adv_targ
-
- if self._use_policy_active_masks:
- policy_action_loss = (
- -torch.sum(torch.min(surr1, surr2), dim=-1, keepdim=True) *
- active_masks_batch).sum() / active_masks_batch.sum()
- else:
- policy_action_loss = -torch.sum(
- torch.min(surr1, surr2), dim=-1, keepdim=True).mean()
-
- policy_loss = policy_action_loss
-
- self.policy.actor_optimizer.zero_grad()
-
- if update_actor:
- (policy_loss - dist_entropy * self.entropy_coef).backward()
-
- if self._use_max_grad_norm:
- actor_grad_norm = nn.utils.clip_grad_norm_(
- self.policy.actor.parameters(), self.max_grad_norm)
- else:
- actor_grad_norm = get_gard_norm(self.policy.actor.parameters())
-
- self.policy.actor_optimizer.step()
-
- # critic update
- value_loss = self.cal_value_loss(values, value_preds_batch,
- return_batch, active_masks_batch)
-
- self.policy.critic_optimizer.zero_grad()
-
- (value_loss * self.value_loss_coef).backward()
-
- if self._use_max_grad_norm:
- critic_grad_norm = nn.utils.clip_grad_norm_(
- self.policy.critic.parameters(), self.max_grad_norm)
- else:
- critic_grad_norm = get_gard_norm(self.policy.critic.parameters())
-
- self.policy.critic_optimizer.step()
-
- return value_loss, critic_grad_norm, policy_loss, dist_entropy, actor_grad_norm, imp_weights
-
- def train(self, buffer, update_actor=True):
- """
- Perform a training update using minibatch GD.
- :param buffer: (SharedReplayBuffer) buffer containing training data.
- :param update_actor: (bool) whether to update actor network.
-
- :return train_info: (dict) contains information regarding training update (e.g. loss, grad norms, etc).
- """
- if self._use_popart or self._use_valuenorm:
- advantages = buffer.returns[:
- -1] - self.value_normalizer.denormalize(
- buffer.value_preds[:-1])
- else:
- advantages = buffer.returns[:-1] - buffer.value_preds[:-1]
- advantages_copy = advantages.copy()
- advantages_copy[buffer.active_masks[:-1] == 0.0] = np.nan
- mean_advantages = np.nanmean(advantages_copy)
- std_advantages = np.nanstd(advantages_copy)
- advantages = (advantages - mean_advantages) / (std_advantages + 1e-5)
-
- train_info = {}
-
- train_info['value_loss'] = 0
- train_info['policy_loss'] = 0
- train_info['dist_entropy'] = 0
- train_info['actor_grad_norm'] = 0
- train_info['critic_grad_norm'] = 0
- train_info['ratio'] = 0
-
- for _ in range(self.ppo_epoch):
- if self._use_recurrent_policy:
- data_generator = buffer.recurrent_generator(
- advantages, self.num_mini_batch, self.data_chunk_length)
- elif self._use_naive_recurrent:
- data_generator = buffer.naive_recurrent_generator(
- advantages, self.num_mini_batch)
- else:
- data_generator = buffer.feed_forward_generator(
- advantages, self.num_mini_batch)
-
- for sample in data_generator:
-
- value_loss, critic_grad_norm, policy_loss, dist_entropy, actor_grad_norm, imp_weights \
- = self.ppo_update(sample, update_actor)
-
- train_info['value_loss'] += value_loss.item()
- train_info['policy_loss'] += policy_loss.item()
- train_info['dist_entropy'] += dist_entropy.item()
- train_info['actor_grad_norm'] += actor_grad_norm
- train_info['critic_grad_norm'] += critic_grad_norm
- train_info['ratio'] += imp_weights.mean()
-
- num_updates = self.ppo_epoch * self.num_mini_batch
-
- for k in train_info.keys():
- train_info[k] /= num_updates
-
- return train_info
-
- def prep_training(self):
- self.policy.actor.train()
- self.policy.critic.train()
-
- def prep_rollout(self):
- self.policy.actor.eval()
- self.policy.critic.eval()
diff --git a/algos/ppo/utils/__init__.py b/algos/ppo/utils/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/algos/ppo/utils/multi_discrete.py b/algos/ppo/utils/multi_discrete.py
deleted file mode 100644
index 64f106fa..00000000
--- a/algos/ppo/utils/multi_discrete.py
+++ /dev/null
@@ -1,58 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.
-#
-# This source code is licensed under the MIT license found in the
-# LICENSE file in the root directory of this source tree.
-# Code modified from https://github.com/marlbenchmark/on-policy
-import gym
-import numpy as np
-
-
-# An old version of OpenAI Gym's multi_discrete.py. (Was getting affected by Gym updates)
-# (https://github.com/openai/gym/blob/1fb81d4e3fb780ccf77fec731287ba07da35eb84/gym/spaces/multi_discrete.py)
-class MultiDiscrete(gym.Space):
- """
- - The multi-discrete action space consists of a series of discrete action spaces with different parameters
- - It can be adapted to both a Discrete action space or a continuous (Box) action space
- - It is useful to represent game controllers or keyboards where each key can be represented as a discrete action space
- - It is parametrized by passing an array of arrays containing [min, max] for each discrete action space where the discrete action space can take any integers from `min` to `max` (both inclusive)
- Note: A value of 0 always need to represent the NOOP action.
- e.g. Nintendo Game Controller
- - Can be conceptualized as 3 discrete action spaces:
- 1) Arrow Keys: Discrete 5 - NOOP[0], UP[1], RIGHT[2], DOWN[3], LEFT[4] - params: min: 0, max: 4
- 2) Button A: Discrete 2 - NOOP[0], Pressed[1] - params: min: 0, max: 1
- 3) Button B: Discrete 2 - NOOP[0], Pressed[1] - params: min: 0, max: 1
- - Can be initialized as
- MultiDiscrete([ [0,4], [0,1], [0,1] ])
- """
-
- def __init__(self, array_of_param_array):
- self.low = np.array([x[0] for x in array_of_param_array])
- self.high = np.array([x[1] for x in array_of_param_array])
- self.num_discrete_space = self.low.shape[0]
- self.n = np.sum(self.high) + 2
-
- def sample(self):
- """ Returns a array with one sample from each discrete action space """
- # For each row: round(random .* (max - min) + min, 0)
- random_array = np.random.rand(self.num_discrete_space)
- return [
- int(x) for x in np.floor(
- np.multiply((self.high - self.low + 1.), random_array) +
- self.low)
- ]
-
- def contains(self, x):
- return len(x) == self.num_discrete_space and (
- np.array(x) >= self.low).all() and (np.array(x) <=
- self.high).all()
-
- @property
- def shape(self):
- return self.num_discrete_space
-
- def __repr__(self):
- return "MultiDiscrete" + str(self.num_discrete_space)
-
- def __eq__(self, other):
- return np.array_equal(self.low, other.low) and np.array_equal(
- self.high, other.high)
diff --git a/algos/ppo/utils/separated_buffer.py b/algos/ppo/utils/separated_buffer.py
deleted file mode 100644
index 342b51ff..00000000
--- a/algos/ppo/utils/separated_buffer.py
+++ /dev/null
@@ -1,505 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.
-#
-# This source code is licensed under the MIT license found in the
-# LICENSE file in the root directory of this source tree.
-# Code modified from https://github.com/marlbenchmark/on-policy
-import torch
-import numpy as np
-from collections import defaultdict
-
-from algos.ppo.utils.util import check, get_shape_from_obs_space, get_shape_from_act_space
-
-
-def _flatten(T, N, x):
- return x.reshape(T * N, *x.shape[2:])
-
-
-def _cast(x):
- return x.transpose(1, 0, 2).reshape(-1, *x.shape[2:])
-
-
-class SeparatedReplayBuffer(object):
-
- def __init__(self, args, obs_space, share_obs_space, act_space):
- self.episode_length = args.episode_length
- self.n_rollout_threads = args.n_rollout_threads
- self.rnn_hidden_size = args.hidden_size
- self.recurrent_N = args.recurrent_N
- self.gamma = args.gamma
- self.gae_lambda = args.gae_lambda
- self._use_gae = args.use_gae
- self._use_popart = args.use_popart
- self._use_valuenorm = args.use_valuenorm
- self._use_proper_time_limits = args.use_proper_time_limits
-
- obs_shape = get_shape_from_obs_space(obs_space)
- share_obs_shape = get_shape_from_obs_space(share_obs_space)
-
- if type(obs_shape[-1]) == list:
- obs_shape = obs_shape[:1]
-
- if type(share_obs_shape[-1]) == list:
- share_obs_shape = share_obs_shape[:1]
-
- self.share_obs = np.zeros((self.episode_length + 1,
- self.n_rollout_threads, *share_obs_shape),
- dtype=np.float32)
- self.obs = np.zeros(
- (self.episode_length + 1, self.n_rollout_threads, *obs_shape),
- dtype=np.float32)
-
- self.rnn_states = np.zeros(
- (self.episode_length + 1, self.n_rollout_threads, self.recurrent_N,
- self.rnn_hidden_size),
- dtype=np.float32)
- self.rnn_states_critic = np.zeros_like(self.rnn_states)
-
- self.value_preds = np.zeros(
- (self.episode_length + 1, self.n_rollout_threads, 1),
- dtype=np.float32)
- self.returns = np.zeros(
- (self.episode_length + 1, self.n_rollout_threads, 1),
- dtype=np.float32)
-
- if act_space.__class__.__name__ == 'Discrete':
- self.available_actions = np.ones(
- (self.episode_length + 1, self.n_rollout_threads, act_space.n),
- dtype=np.float32)
- else:
- self.available_actions = None
-
- act_shape = get_shape_from_act_space(act_space)
-
- self.actions = np.zeros(
- (self.episode_length, self.n_rollout_threads, act_shape),
- dtype=np.float32)
- self.action_log_probs = np.zeros(
- (self.episode_length, self.n_rollout_threads, act_shape),
- dtype=np.float32)
- self.rewards = np.zeros(
- (self.episode_length, self.n_rollout_threads, 1), dtype=np.float32)
-
- self.masks = np.ones(
- (self.episode_length + 1, self.n_rollout_threads, 1),
- dtype=np.float32)
- self.bad_masks = np.ones_like(self.masks)
- self.active_masks = np.ones_like(self.masks)
-
- self.step = 0
-
- def insert(self,
- share_obs,
- obs,
- rnn_states,
- rnn_states_critic,
- actions,
- action_log_probs,
- value_preds,
- rewards,
- masks,
- bad_masks=None,
- active_masks=None,
- available_actions=None):
- self.share_obs[self.step + 1] = share_obs.copy()
- self.obs[self.step + 1] = obs.copy()
- self.rnn_states[self.step + 1] = rnn_states.copy()
- self.rnn_states_critic[self.step + 1] = rnn_states_critic.copy()
- self.actions[self.step] = actions.copy()
- self.action_log_probs[self.step] = action_log_probs.copy()
- self.value_preds[self.step] = value_preds.copy()
- self.rewards[self.step] = rewards.copy()
- self.masks[self.step + 1] = masks.copy()
- if bad_masks is not None:
- self.bad_masks[self.step + 1] = bad_masks.copy()
- if active_masks is not None:
- self.active_masks[self.step + 1] = active_masks.copy()
- if available_actions is not None:
- self.available_actions[self.step + 1] = available_actions.copy()
-
- self.step = (self.step + 1) % self.episode_length
-
- def chooseinsert(self,
- share_obs,
- obs,
- rnn_states,
- rnn_states_critic,
- actions,
- action_log_probs,
- value_preds,
- rewards,
- masks,
- bad_masks=None,
- active_masks=None,
- available_actions=None):
- self.share_obs[self.step] = share_obs.copy()
- self.obs[self.step] = obs.copy()
- self.rnn_states[self.step + 1] = rnn_states.copy()
- self.rnn_states_critic[self.step + 1] = rnn_states_critic.copy()
- self.actions[self.step] = actions.copy()
- self.action_log_probs[self.step] = action_log_probs.copy()
- self.value_preds[self.step] = value_preds.copy()
- self.rewards[self.step] = rewards.copy()
- self.masks[self.step + 1] = masks.copy()
- if bad_masks is not None:
- self.bad_masks[self.step + 1] = bad_masks.copy()
- if active_masks is not None:
- self.active_masks[self.step] = active_masks.copy()
- if available_actions is not None:
- self.available_actions[self.step] = available_actions.copy()
-
- self.step = (self.step + 1) % self.episode_length
-
- def after_update(self):
- self.share_obs[0] = self.share_obs[-1].copy()
- self.obs[0] = self.obs[-1].copy()
- self.rnn_states[0] = self.rnn_states[-1].copy()
- self.rnn_states_critic[0] = self.rnn_states_critic[-1].copy()
- self.masks[0] = self.masks[-1].copy()
- self.bad_masks[0] = self.bad_masks[-1].copy()
- self.active_masks[0] = self.active_masks[-1].copy()
- if self.available_actions is not None:
- self.available_actions[0] = self.available_actions[-1].copy()
-
- def chooseafter_update(self):
- self.rnn_states[0] = self.rnn_states[-1].copy()
- self.rnn_states_critic[0] = self.rnn_states_critic[-1].copy()
- self.masks[0] = self.masks[-1].copy()
- self.bad_masks[0] = self.bad_masks[-1].copy()
-
- def compute_returns(self, next_value, value_normalizer=None):
- if self._use_proper_time_limits:
- if self._use_gae:
- self.value_preds[-1] = next_value
- gae = 0
- for step in reversed(range(self.rewards.shape[0])):
- if self._use_popart or self._use_valuenorm:
- delta = self.rewards[
- step] + self.gamma * value_normalizer.denormalize(
- self.value_preds[step + 1]) * self.masks[
- step + 1] - value_normalizer.denormalize(
- self.value_preds[step])
- gae = delta + self.gamma * self.gae_lambda * self.masks[
- step + 1] * gae
- gae = gae * self.bad_masks[step + 1]
- self.returns[
- step] = gae + value_normalizer.denormalize(
- self.value_preds[step])
- else:
- delta = self.rewards[
- step] + self.gamma * self.value_preds[
- step + 1] * self.masks[
- step + 1] - self.value_preds[step]
- gae = delta + self.gamma * self.gae_lambda * self.masks[
- step + 1] * gae
- gae = gae * self.bad_masks[step + 1]
- self.returns[step] = gae + self.value_preds[step]
- else:
- self.returns[-1] = next_value
- for step in reversed(range(self.rewards.shape[0])):
- if self._use_popart:
- self.returns[step] = (self.returns[step + 1] * self.gamma * self.masks[step + 1] + self.rewards[step]) * self.bad_masks[step + 1] \
- + (1 - self.bad_masks[step + 1]) * value_normalizer.denormalize(self.value_preds[step])
- else:
- self.returns[step] = (self.returns[step + 1] * self.gamma * self.masks[step + 1] + self.rewards[step]) * self.bad_masks[step + 1] \
- + (1 - self.bad_masks[step + 1]) * self.value_preds[step]
- else:
- if self._use_gae:
- self.value_preds[-1] = next_value
- gae = 0
- for step in reversed(range(self.rewards.shape[0])):
- if self._use_popart or self._use_valuenorm:
- delta = self.rewards[
- step] + self.gamma * value_normalizer.denormalize(
- self.value_preds[step + 1]) * self.masks[
- step + 1] - value_normalizer.denormalize(
- self.value_preds[step])
- gae = delta + self.gamma * self.gae_lambda * self.masks[
- step + 1] * gae
- self.returns[
- step] = gae + value_normalizer.denormalize(
- self.value_preds[step])
- else:
- delta = self.rewards[
- step] + self.gamma * self.value_preds[
- step + 1] * self.masks[
- step + 1] - self.value_preds[step]
- gae = delta + self.gamma * self.gae_lambda * self.masks[
- step + 1] * gae
- self.returns[step] = gae + self.value_preds[step]
- else:
- self.returns[-1] = next_value
- for step in reversed(range(self.rewards.shape[0])):
- self.returns[step] = self.returns[
- step + 1] * self.gamma * self.masks[
- step + 1] + self.rewards[step]
-
- def feed_forward_generator(self,
- advantages,
- num_mini_batch=None,
- mini_batch_size=None):
- episode_length, n_rollout_threads = self.rewards.shape[0:2]
- batch_size = n_rollout_threads * episode_length
-
- if mini_batch_size is None:
- assert batch_size >= num_mini_batch, (
- "PPO requires the number of processes ({}) "
- "* number of steps ({}) = {} "
- "to be greater than or equal to the number of PPO mini batches ({})."
- "".format(n_rollout_threads, episode_length,
- n_rollout_threads * episode_length, num_mini_batch))
- mini_batch_size = batch_size // num_mini_batch
-
- rand = torch.randperm(batch_size).numpy()
- sampler = [
- rand[i * mini_batch_size:(i + 1) * mini_batch_size]
- for i in range(num_mini_batch)
- ]
-
- share_obs = self.share_obs[:-1].reshape(-1, *self.share_obs.shape[2:])
- obs = self.obs[:-1].reshape(-1, *self.obs.shape[2:])
- rnn_states = self.rnn_states[:-1].reshape(-1,
- *self.rnn_states.shape[2:])
- rnn_states_critic = self.rnn_states_critic[:-1].reshape(
- -1, *self.rnn_states_critic.shape[2:])
- actions = self.actions.reshape(-1, self.actions.shape[-1])
- if self.available_actions is not None:
- available_actions = self.available_actions[:-1].reshape(
- -1, self.available_actions.shape[-1])
- value_preds = self.value_preds[:-1].reshape(-1, 1)
- returns = self.returns[:-1].reshape(-1, 1)
- masks = self.masks[:-1].reshape(-1, 1)
- active_masks = self.active_masks[:-1].reshape(-1, 1)
- action_log_probs = self.action_log_probs.reshape(
- -1, self.action_log_probs.shape[-1])
- advantages = advantages.reshape(-1, 1)
-
- for indices in sampler:
- # obs size [T+1 N Dim]-->[T N Dim]-->[T*N,Dim]-->[index,Dim]
- share_obs_batch = share_obs[indices]
- obs_batch = obs[indices]
- rnn_states_batch = rnn_states[indices]
- rnn_states_critic_batch = rnn_states_critic[indices]
- actions_batch = actions[indices]
- if self.available_actions is not None:
- available_actions_batch = available_actions[indices]
- else:
- available_actions_batch = None
- value_preds_batch = value_preds[indices]
- return_batch = returns[indices]
- masks_batch = masks[indices]
- active_masks_batch = active_masks[indices]
- old_action_log_probs_batch = action_log_probs[indices]
- if advantages is None:
- adv_targ = None
- else:
- adv_targ = advantages[indices]
-
- yield share_obs_batch, obs_batch, rnn_states_batch, rnn_states_critic_batch, actions_batch, value_preds_batch, return_batch, masks_batch, active_masks_batch, old_action_log_probs_batch, adv_targ, available_actions_batch
-
- def naive_recurrent_generator(self, advantages, num_mini_batch):
- n_rollout_threads = self.rewards.shape[1]
- assert n_rollout_threads >= num_mini_batch, (
- "PPO requires the number of processes ({}) "
- "to be greater than or equal to the number of "
- "PPO mini batches ({}).".format(n_rollout_threads, num_mini_batch))
- num_envs_per_batch = n_rollout_threads // num_mini_batch
- perm = torch.randperm(n_rollout_threads).numpy()
- for start_ind in range(0, n_rollout_threads, num_envs_per_batch):
- share_obs_batch = []
- obs_batch = []
- rnn_states_batch = []
- rnn_states_critic_batch = []
- actions_batch = []
- available_actions_batch = []
- value_preds_batch = []
- return_batch = []
- masks_batch = []
- active_masks_batch = []
- old_action_log_probs_batch = []
- adv_targ = []
-
- for offset in range(num_envs_per_batch):
- ind = perm[start_ind + offset]
- share_obs_batch.append(self.share_obs[:-1, ind])
- obs_batch.append(self.obs[:-1, ind])
- rnn_states_batch.append(self.rnn_states[0:1, ind])
- rnn_states_critic_batch.append(self.rnn_states_critic[0:1,
- ind])
- actions_batch.append(self.actions[:, ind])
- if self.available_actions is not None:
- available_actions_batch.append(self.available_actions[:-1,
- ind])
- value_preds_batch.append(self.value_preds[:-1, ind])
- return_batch.append(self.returns[:-1, ind])
- masks_batch.append(self.masks[:-1, ind])
- active_masks_batch.append(self.active_masks[:-1, ind])
- old_action_log_probs_batch.append(self.action_log_probs[:,
- ind])
- adv_targ.append(advantages[:, ind])
-
- # [N[T, dim]]
- T, N = self.episode_length, num_envs_per_batch
- # These are all from_numpys of size (T, N, -1)
- share_obs_batch = np.stack(share_obs_batch, 1)
- obs_batch = np.stack(obs_batch, 1)
- actions_batch = np.stack(actions_batch, 1)
- if self.available_actions is not None:
- available_actions_batch = np.stack(available_actions_batch, 1)
- value_preds_batch = np.stack(value_preds_batch, 1)
- return_batch = np.stack(return_batch, 1)
- masks_batch = np.stack(masks_batch, 1)
- active_masks_batch = np.stack(active_masks_batch, 1)
- old_action_log_probs_batch = np.stack(old_action_log_probs_batch,
- 1)
- adv_targ = np.stack(adv_targ, 1)
-
- # States is just a (N, -1) from_numpy [N[1,dim]]
- rnn_states_batch = np.stack(rnn_states_batch,
- 1).reshape(N,
- *self.rnn_states.shape[2:])
- rnn_states_critic_batch = np.stack(
- rnn_states_critic_batch,
- 1).reshape(N, *self.rnn_states_critic.shape[2:])
-
- # Flatten the (T, N, ...) from_numpys to (T * N, ...)
- share_obs_batch = _flatten(T, N, share_obs_batch)
- obs_batch = _flatten(T, N, obs_batch)
- actions_batch = _flatten(T, N, actions_batch)
- if self.available_actions is not None:
- available_actions_batch = _flatten(T, N,
- available_actions_batch)
- else:
- available_actions_batch = None
- value_preds_batch = _flatten(T, N, value_preds_batch)
- return_batch = _flatten(T, N, return_batch)
- masks_batch = _flatten(T, N, masks_batch)
- active_masks_batch = _flatten(T, N, active_masks_batch)
- old_action_log_probs_batch = _flatten(T, N,
- old_action_log_probs_batch)
- adv_targ = _flatten(T, N, adv_targ)
-
- yield share_obs_batch, obs_batch, rnn_states_batch, rnn_states_critic_batch, actions_batch, value_preds_batch, return_batch, masks_batch, active_masks_batch, old_action_log_probs_batch, adv_targ, available_actions_batch
-
- def recurrent_generator(self, advantages, num_mini_batch,
- data_chunk_length):
- episode_length, n_rollout_threads = self.rewards.shape[0:2]
- batch_size = n_rollout_threads * episode_length
- data_chunks = batch_size // data_chunk_length # [C=r*T/L]
- mini_batch_size = data_chunks // num_mini_batch
-
- assert episode_length * n_rollout_threads >= data_chunk_length, (
- "PPO requires the number of processes ({}) * episode length ({}) "
- "to be greater than or equal to the number of "
- "data chunk length ({}).".format(n_rollout_threads, episode_length,
- data_chunk_length))
- assert data_chunks >= 2, ("need larger batch size")
-
- rand = torch.randperm(data_chunks).numpy()
- sampler = [
- rand[i * mini_batch_size:(i + 1) * mini_batch_size]
- for i in range(num_mini_batch)
- ]
-
- if len(self.share_obs.shape) > 3:
- share_obs = self.share_obs[:-1].transpose(1, 0, 2, 3, 4).reshape(
- -1, *self.share_obs.shape[2:])
- obs = self.obs[:-1].transpose(1, 0, 2, 3,
- 4).reshape(-1, *self.obs.shape[2:])
- else:
- share_obs = _cast(self.share_obs[:-1])
- obs = _cast(self.obs[:-1])
-
- actions = _cast(self.actions)
- action_log_probs = _cast(self.action_log_probs)
- advantages = _cast(advantages)
- value_preds = _cast(self.value_preds[:-1])
- returns = _cast(self.returns[:-1])
- masks = _cast(self.masks[:-1])
- active_masks = _cast(self.active_masks[:-1])
- # rnn_states = _cast(self.rnn_states[:-1])
- # rnn_states_critic = _cast(self.rnn_states_critic[:-1])
- rnn_states = self.rnn_states[:-1].transpose(1, 0, 2, 3).reshape(
- -1, *self.rnn_states.shape[2:])
- rnn_states_critic = self.rnn_states_critic[:-1].transpose(
- 1, 0, 2, 3).reshape(-1, *self.rnn_states_critic.shape[2:])
-
- if self.available_actions is not None:
- available_actions = _cast(self.available_actions[:-1])
-
- for indices in sampler:
- share_obs_batch = []
- obs_batch = []
- rnn_states_batch = []
- rnn_states_critic_batch = []
- actions_batch = []
- available_actions_batch = []
- value_preds_batch = []
- return_batch = []
- masks_batch = []
- active_masks_batch = []
- old_action_log_probs_batch = []
- adv_targ = []
-
- for index in indices:
- ind = index * data_chunk_length
- # size [T+1 N M Dim]-->[T N Dim]-->[N T Dim]-->[T*N,Dim]-->[L,Dim]
- share_obs_batch.append(share_obs[ind:ind + data_chunk_length])
- obs_batch.append(obs[ind:ind + data_chunk_length])
- actions_batch.append(actions[ind:ind + data_chunk_length])
- if self.available_actions is not None:
- available_actions_batch.append(
- available_actions[ind:ind + data_chunk_length])
- value_preds_batch.append(value_preds[ind:ind +
- data_chunk_length])
- return_batch.append(returns[ind:ind + data_chunk_length])
- masks_batch.append(masks[ind:ind + data_chunk_length])
- active_masks_batch.append(active_masks[ind:ind +
- data_chunk_length])
- old_action_log_probs_batch.append(
- action_log_probs[ind:ind + data_chunk_length])
- adv_targ.append(advantages[ind:ind + data_chunk_length])
- # size [T+1 N Dim]-->[T N Dim]-->[T*N,Dim]-->[1,Dim]
- rnn_states_batch.append(rnn_states[ind])
- rnn_states_critic_batch.append(rnn_states_critic[ind])
-
- L, N = data_chunk_length, mini_batch_size
-
- # These are all from_numpys of size (N, L, Dim)
- share_obs_batch = np.stack(share_obs_batch)
- obs_batch = np.stack(obs_batch)
-
- actions_batch = np.stack(actions_batch)
- if self.available_actions is not None:
- available_actions_batch = np.stack(available_actions_batch)
- value_preds_batch = np.stack(value_preds_batch)
- return_batch = np.stack(return_batch)
- masks_batch = np.stack(masks_batch)
- active_masks_batch = np.stack(active_masks_batch)
- old_action_log_probs_batch = np.stack(old_action_log_probs_batch)
- adv_targ = np.stack(adv_targ)
-
- # States is just a (N, -1) from_numpy
- rnn_states_batch = np.stack(rnn_states_batch).reshape(
- N, *self.rnn_states.shape[2:])
- rnn_states_critic_batch = np.stack(
- rnn_states_critic_batch).reshape(
- N, *self.rnn_states_critic.shape[2:])
-
- # Flatten the (L, N, ...) from_numpys to (L * N, ...)
- share_obs_batch = _flatten(L, N, share_obs_batch)
- obs_batch = _flatten(L, N, obs_batch)
- actions_batch = _flatten(L, N, actions_batch)
- if self.available_actions is not None:
- available_actions_batch = _flatten(L, N,
- available_actions_batch)
- else:
- available_actions_batch = None
- value_preds_batch = _flatten(L, N, value_preds_batch)
- return_batch = _flatten(L, N, return_batch)
- masks_batch = _flatten(L, N, masks_batch)
- active_masks_batch = _flatten(L, N, active_masks_batch)
- old_action_log_probs_batch = _flatten(L, N,
- old_action_log_probs_batch)
- adv_targ = _flatten(L, N, adv_targ)
-
- yield share_obs_batch, obs_batch, rnn_states_batch, rnn_states_critic_batch, actions_batch, value_preds_batch, return_batch, masks_batch, active_masks_batch, old_action_log_probs_batch, adv_targ, available_actions_batch
diff --git a/algos/ppo/utils/shared_buffer.py b/algos/ppo/utils/shared_buffer.py
deleted file mode 100644
index 5bd6c20a..00000000
--- a/algos/ppo/utils/shared_buffer.py
+++ /dev/null
@@ -1,584 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.
-#
-# This source code is licensed under the MIT license found in the
-# LICENSE file in the root directory of this source tree.
-# Code modified from https://github.com/marlbenchmark/on-policy
-import torch
-import numpy as np
-from algos.ppo.utils.util import get_shape_from_obs_space, get_shape_from_act_space
-
-
-def _flatten(T, N, x):
- return x.reshape(T * N, *x.shape[2:])
-
-
-def _cast(x):
- return x.transpose(1, 2, 0, 3).reshape(-1, *x.shape[3:])
-
-
-class SharedReplayBuffer(object):
- """
- Buffer to store training data.
- :param args: (argparse.Namespace) arguments containing relevant model, policy, and env information.
- :param num_agents: (int) number of agents in the env.
- :param obs_space: (gym.Space) observation space of agents.
- :param cent_obs_space: (gym.Space) centralized observation space of agents.
- :param act_space: (gym.Space) action space for agents.
- """
-
- def __init__(self, args, num_agents, obs_space, cent_obs_space, act_space):
- self.episode_length = args.episode_length
- self.n_rollout_threads = args.n_rollout_threads
- self.hidden_size = args.hidden_size
- self.recurrent_N = args.recurrent_N
- self.gamma = args.gamma
- self.gae_lambda = args.gae_lambda
- self._use_gae = args.use_gae
- self._use_popart = args.use_popart
- self._use_valuenorm = args.use_valuenorm
- self._use_proper_time_limits = args.use_proper_time_limits
-
- obs_shape = get_shape_from_obs_space(obs_space)
- share_obs_shape = get_shape_from_obs_space(cent_obs_space)
-
- if type(obs_shape[-1]) == list:
- obs_shape = obs_shape[:1]
-
- if type(share_obs_shape[-1]) == list:
- share_obs_shape = share_obs_shape[:1]
-
- self.share_obs = np.zeros(
- (self.episode_length + 1, self.n_rollout_threads, num_agents,
- *share_obs_shape),
- dtype=np.float32)
- self.obs = np.zeros((self.episode_length + 1, self.n_rollout_threads,
- num_agents, *obs_shape),
- dtype=np.float32)
-
- self.rnn_states = np.zeros(
- (self.episode_length + 1, self.n_rollout_threads, num_agents,
- self.recurrent_N, self.hidden_size),
- dtype=np.float32)
- self.rnn_states_critic = np.zeros_like(self.rnn_states)
-
- self.value_preds = np.zeros(
- (self.episode_length + 1, self.n_rollout_threads, num_agents, 1),
- dtype=np.float32)
- self.returns = np.zeros_like(self.value_preds)
-
- if act_space.__class__.__name__ == 'Discrete':
- self.available_actions = np.ones(
- (self.episode_length + 1, self.n_rollout_threads, num_agents,
- act_space.n),
- dtype=np.float32)
- else:
- self.available_actions = None
-
- act_shape = get_shape_from_act_space(act_space)
-
- self.actions = np.zeros((self.episode_length, self.n_rollout_threads,
- num_agents, act_shape),
- dtype=np.float32)
- self.action_log_probs = np.zeros(
- (self.episode_length, self.n_rollout_threads, num_agents,
- act_shape),
- dtype=np.float32)
- self.rewards = np.zeros(
- (self.episode_length, self.n_rollout_threads, num_agents, 1),
- dtype=np.float32)
-
- self.masks = np.ones(
- (self.episode_length + 1, self.n_rollout_threads, num_agents, 1),
- dtype=np.float32)
- self.bad_masks = np.ones_like(self.masks)
- self.active_masks = np.ones_like(self.masks)
-
- self.step = 0
-
- def insert(self,
- share_obs,
- obs,
- rnn_states_actor,
- rnn_states_critic,
- actions,
- action_log_probs,
- value_preds,
- rewards,
- masks,
- bad_masks=None,
- active_masks=None,
- available_actions=None):
- """
- Insert data into the buffer.
- :param share_obs: (argparse.Namespace) arguments containing relevant model, policy, and env information.
- :param obs: (np.ndarray) local agent observations.
- :param rnn_states_actor: (np.ndarray) RNN states for actor network.
- :param rnn_states_critic: (np.ndarray) RNN states for critic network.
- :param actions:(np.ndarray) actions taken by agents.
- :param action_log_probs:(np.ndarray) log probs of actions taken by agents
- :param value_preds: (np.ndarray) value function prediction at each step.
- :param rewards: (np.ndarray) reward collected at each step.
- :param masks: (np.ndarray) denotes whether the environment has terminated or not.
- :param bad_masks: (np.ndarray) action space for agents.
- :param active_masks: (np.ndarray) denotes whether an agent is active or dead in the env.
- :param available_actions: (np.ndarray) actions available to each agent. If None, all actions are available.
- """
- self.share_obs[self.step + 1] = share_obs.copy()
- self.obs[self.step + 1] = obs.copy()
- self.rnn_states[self.step + 1] = rnn_states_actor.copy()
- self.rnn_states_critic[self.step + 1] = rnn_states_critic.copy()
- self.actions[self.step] = actions.copy()
- self.action_log_probs[self.step] = action_log_probs.copy()
- self.value_preds[self.step] = value_preds.copy()
- self.rewards[self.step] = rewards.copy()
- self.masks[self.step + 1] = masks.copy()
- if bad_masks is not None:
- self.bad_masks[self.step + 1] = bad_masks.copy()
- if active_masks is not None:
- self.active_masks[self.step + 1] = active_masks.copy()
- if available_actions is not None:
- self.available_actions[self.step + 1] = available_actions.copy()
-
- self.step = (self.step + 1) % self.episode_length
-
- def chooseinsert(self,
- share_obs,
- obs,
- rnn_states,
- rnn_states_critic,
- actions,
- action_log_probs,
- value_preds,
- rewards,
- masks,
- bad_masks=None,
- active_masks=None,
- available_actions=None):
- """
- Insert data into the buffer. This insert function is used specifically for Hanabi, which is turn based.
- :param share_obs: (argparse.Namespace) arguments containing relevant model, policy, and env information.
- :param obs: (np.ndarray) local agent observations.
- :param rnn_states_actor: (np.ndarray) RNN states for actor network.
- :param rnn_states_critic: (np.ndarray) RNN states for critic network.
- :param actions:(np.ndarray) actions taken by agents.
- :param action_log_probs:(np.ndarray) log probs of actions taken by agents
- :param value_preds: (np.ndarray) value function prediction at each step.
- :param rewards: (np.ndarray) reward collected at each step.
- :param masks: (np.ndarray) denotes whether the environment has terminated or not.
- :param bad_masks: (np.ndarray) denotes indicate whether whether true terminal state or due to episode limit
- :param active_masks: (np.ndarray) denotes whether an agent is active or dead in the env.
- :param available_actions: (np.ndarray) actions available to each agent. If None, all actions are available.
- """
- self.share_obs[self.step] = share_obs.copy()
- self.obs[self.step] = obs.copy()
- self.rnn_states[self.step + 1] = rnn_states.copy()
- self.rnn_states_critic[self.step + 1] = rnn_states_critic.copy()
- self.actions[self.step] = actions.copy()
- self.action_log_probs[self.step] = action_log_probs.copy()
- self.value_preds[self.step] = value_preds.copy()
- self.rewards[self.step] = rewards.copy()
- self.masks[self.step + 1] = masks.copy()
- if bad_masks is not None:
- self.bad_masks[self.step + 1] = bad_masks.copy()
- if active_masks is not None:
- self.active_masks[self.step] = active_masks.copy()
- if available_actions is not None:
- self.available_actions[self.step] = available_actions.copy()
-
- self.step = (self.step + 1) % self.episode_length
-
- def after_update(self):
- """Copy last timestep data to first index. Called after update to model."""
- self.share_obs[0] = self.share_obs[-1].copy()
- self.obs[0] = self.obs[-1].copy()
- self.rnn_states[0] = self.rnn_states[-1].copy()
- self.rnn_states_critic[0] = self.rnn_states_critic[-1].copy()
- self.masks[0] = self.masks[-1].copy()
- self.bad_masks[0] = self.bad_masks[-1].copy()
- self.active_masks[0] = self.active_masks[-1].copy()
- if self.available_actions is not None:
- self.available_actions[0] = self.available_actions[-1].copy()
-
- def chooseafter_update(self):
- """Copy last timestep data to first index. This method is used for Hanabi."""
- self.rnn_states[0] = self.rnn_states[-1].copy()
- self.rnn_states_critic[0] = self.rnn_states_critic[-1].copy()
- self.masks[0] = self.masks[-1].copy()
- self.bad_masks[0] = self.bad_masks[-1].copy()
-
- def compute_returns(self, next_value, value_normalizer=None):
- """
- Compute returns either as discounted sum of rewards, or using GAE.
- :param next_value: (np.ndarray) value predictions for the step after the last episode step.
- :param value_normalizer: (PopArt) If not None, PopArt value normalizer instance.
- """
- if self._use_proper_time_limits:
- if self._use_gae:
- self.value_preds[-1] = next_value
- gae = 0
- for step in reversed(range(self.rewards.shape[0])):
- if self._use_popart or self._use_valuenorm:
- # step + 1
- delta = self.rewards[step] + self.gamma * value_normalizer.denormalize(
- self.value_preds[step + 1]) * self.masks[step + 1] \
- - value_normalizer.denormalize(self.value_preds[step])
- gae = delta + self.gamma * self.gae_lambda * gae * self.masks[
- step + 1]
- gae = gae * self.bad_masks[step + 1]
- self.returns[
- step] = gae + value_normalizer.denormalize(
- self.value_preds[step])
- else:
- delta = self.rewards[step] + self.gamma * self.value_preds[step + 1] * self.masks[step + 1] - \
- self.value_preds[step]
- gae = delta + self.gamma * self.gae_lambda * self.masks[
- step + 1] * gae
- gae = gae * self.bad_masks[step + 1]
- self.returns[step] = gae + self.value_preds[step]
- else:
- self.returns[-1] = next_value
- for step in reversed(range(self.rewards.shape[0])):
- if self._use_popart or self._use_valuenorm:
- self.returns[step] = (self.returns[step + 1] * self.gamma * self.masks[step + 1] + self.rewards[
- step]) * self.bad_masks[step + 1] \
- + (1 - self.bad_masks[step + 1]) * value_normalizer.denormalize(
- self.value_preds[step])
- else:
- self.returns[step] = (self.returns[step + 1] * self.gamma * self.masks[step + 1] + self.rewards[
- step]) * self.bad_masks[step + 1] \
- + (1 - self.bad_masks[step + 1]) * self.value_preds[step]
- else:
- if self._use_gae:
- self.value_preds[-1] = next_value
- gae = 0
- for step in reversed(range(self.rewards.shape[0])):
- if self._use_popart or self._use_valuenorm:
- delta = self.rewards[step] + self.gamma * value_normalizer.denormalize(
- self.value_preds[step + 1]) * self.masks[step + 1] \
- - value_normalizer.denormalize(self.value_preds[step])
- gae = delta + self.gamma * self.gae_lambda * self.masks[
- step + 1] * gae
- self.returns[
- step] = gae + value_normalizer.denormalize(
- self.value_preds[step])
- else:
- delta = self.rewards[step] + self.gamma * self.value_preds[step + 1] * self.masks[step + 1] - \
- self.value_preds[step]
- gae = delta + self.gamma * self.gae_lambda * self.masks[
- step + 1] * gae
- self.returns[step] = gae + self.value_preds[step]
- else:
- self.returns[-1] = next_value
- for step in reversed(range(self.rewards.shape[0])):
- self.returns[step] = self.returns[
- step + 1] * self.gamma * self.masks[
- step + 1] + self.rewards[step]
-
- def feed_forward_generator(self,
- advantages,
- num_mini_batch=None,
- mini_batch_size=None):
- """
- Yield training data for MLP policies.
- :param advantages: (np.ndarray) advantage estimates.
- :param num_mini_batch: (int) number of minibatches to split the batch into.
- :param mini_batch_size: (int) number of samples in each minibatch.
- """
- episode_length, n_rollout_threads, num_agents = self.rewards.shape[0:3]
- batch_size = n_rollout_threads * episode_length * num_agents
-
- if mini_batch_size is None:
- assert batch_size >= num_mini_batch, (
- "PPO requires the number of processes ({}) "
- "* number of steps ({}) * number of agents ({}) = {} "
- "to be greater than or equal to the number of PPO mini batches ({})."
- "".format(n_rollout_threads, episode_length, num_agents,
- n_rollout_threads * episode_length * num_agents,
- num_mini_batch))
- mini_batch_size = batch_size // num_mini_batch
-
- rand = torch.randperm(batch_size).numpy()
- sampler = [
- rand[i * mini_batch_size:(i + 1) * mini_batch_size]
- for i in range(num_mini_batch)
- ]
-
- share_obs = self.share_obs[:-1].reshape(-1, *self.share_obs.shape[3:])
- obs = self.obs[:-1].reshape(-1, *self.obs.shape[3:])
- rnn_states = self.rnn_states[:-1].reshape(-1,
- *self.rnn_states.shape[3:])
- rnn_states_critic = self.rnn_states_critic[:-1].reshape(
- -1, *self.rnn_states_critic.shape[3:])
- actions = self.actions.reshape(-1, self.actions.shape[-1])
- if self.available_actions is not None:
- available_actions = self.available_actions[:-1].reshape(
- -1, self.available_actions.shape[-1])
- value_preds = self.value_preds[:-1].reshape(-1, 1)
- returns = self.returns[:-1].reshape(-1, 1)
- masks = self.masks[:-1].reshape(-1, 1)
- active_masks = self.active_masks[:-1].reshape(-1, 1)
- action_log_probs = self.action_log_probs.reshape(
- -1, self.action_log_probs.shape[-1])
- advantages = advantages.reshape(-1, 1)
-
- for indices in sampler:
- # obs size [T+1 N M Dim]-->[T N M Dim]-->[T*N*M,Dim]-->[index,Dim]
- share_obs_batch = share_obs[indices]
- obs_batch = obs[indices]
- rnn_states_batch = rnn_states[indices]
- rnn_states_critic_batch = rnn_states_critic[indices]
- actions_batch = actions[indices]
- if self.available_actions is not None:
- available_actions_batch = available_actions[indices]
- else:
- available_actions_batch = None
- value_preds_batch = value_preds[indices]
- return_batch = returns[indices]
- masks_batch = masks[indices]
- active_masks_batch = active_masks[indices]
- old_action_log_probs_batch = action_log_probs[indices]
- if advantages is None:
- adv_targ = None
- else:
- adv_targ = advantages[indices]
-
- yield share_obs_batch, obs_batch, rnn_states_batch, rnn_states_critic_batch, actions_batch,\
- value_preds_batch, return_batch, masks_batch, active_masks_batch, old_action_log_probs_batch,\
- adv_targ, available_actions_batch
-
- def naive_recurrent_generator(self, advantages, num_mini_batch):
- """
- Yield training data for non-chunked RNN training.
- :param advantages: (np.ndarray) advantage estimates.
- :param num_mini_batch: (int) number of minibatches to split the batch into.
- """
- episode_length, n_rollout_threads, num_agents = self.rewards.shape[0:3]
- batch_size = n_rollout_threads * num_agents
- assert n_rollout_threads * num_agents >= num_mini_batch, (
- "PPO requires the number of processes ({})* number of agents ({}) "
- "to be greater than or equal to the number of "
- "PPO mini batches ({}).".format(n_rollout_threads, num_agents,
- num_mini_batch))
- num_envs_per_batch = batch_size // num_mini_batch
- perm = torch.randperm(batch_size).numpy()
-
- share_obs = self.share_obs.reshape(-1, batch_size,
- *self.share_obs.shape[3:])
- obs = self.obs.reshape(-1, batch_size, *self.obs.shape[3:])
- rnn_states = self.rnn_states.reshape(-1, batch_size,
- *self.rnn_states.shape[3:])
- rnn_states_critic = self.rnn_states_critic.reshape(
- -1, batch_size, *self.rnn_states_critic.shape[3:])
- actions = self.actions.reshape(-1, batch_size, self.actions.shape[-1])
- if self.available_actions is not None:
- available_actions = self.available_actions.reshape(
- -1, batch_size, self.available_actions.shape[-1])
- value_preds = self.value_preds.reshape(-1, batch_size, 1)
- returns = self.returns.reshape(-1, batch_size, 1)
- masks = self.masks.reshape(-1, batch_size, 1)
- active_masks = self.active_masks.reshape(-1, batch_size, 1)
- action_log_probs = self.action_log_probs.reshape(
- -1, batch_size, self.action_log_probs.shape[-1])
- advantages = advantages.reshape(-1, batch_size, 1)
-
- for start_ind in range(0, batch_size, num_envs_per_batch):
- share_obs_batch = []
- obs_batch = []
- rnn_states_batch = []
- rnn_states_critic_batch = []
- actions_batch = []
- available_actions_batch = []
- value_preds_batch = []
- return_batch = []
- masks_batch = []
- active_masks_batch = []
- old_action_log_probs_batch = []
- adv_targ = []
-
- for offset in range(num_envs_per_batch):
- ind = perm[start_ind + offset]
- share_obs_batch.append(share_obs[:-1, ind])
- obs_batch.append(obs[:-1, ind])
- rnn_states_batch.append(rnn_states[0:1, ind])
- rnn_states_critic_batch.append(rnn_states_critic[0:1, ind])
- actions_batch.append(actions[:, ind])
- if self.available_actions is not None:
- available_actions_batch.append(available_actions[:-1, ind])
- value_preds_batch.append(value_preds[:-1, ind])
- return_batch.append(returns[:-1, ind])
- masks_batch.append(masks[:-1, ind])
- active_masks_batch.append(active_masks[:-1, ind])
- old_action_log_probs_batch.append(action_log_probs[:, ind])
- adv_targ.append(advantages[:, ind])
-
- # [N[T, dim]]
- T, N = self.episode_length, num_envs_per_batch
- # These are all from_numpys of size (T, N, -1)
- share_obs_batch = np.stack(share_obs_batch, 1)
- obs_batch = np.stack(obs_batch, 1)
- actions_batch = np.stack(actions_batch, 1)
- if self.available_actions is not None:
- available_actions_batch = np.stack(available_actions_batch, 1)
- value_preds_batch = np.stack(value_preds_batch, 1)
- return_batch = np.stack(return_batch, 1)
- masks_batch = np.stack(masks_batch, 1)
- active_masks_batch = np.stack(active_masks_batch, 1)
- old_action_log_probs_batch = np.stack(old_action_log_probs_batch,
- 1)
- adv_targ = np.stack(adv_targ, 1)
-
- # States is just a (N, dim) from_numpy [N[1,dim]]
- rnn_states_batch = np.stack(rnn_states_batch).reshape(
- N, *self.rnn_states.shape[3:])
- rnn_states_critic_batch = np.stack(
- rnn_states_critic_batch).reshape(
- N, *self.rnn_states_critic.shape[3:])
-
- # Flatten the (T, N, ...) from_numpys to (T * N, ...)
- share_obs_batch = _flatten(T, N, share_obs_batch)
- obs_batch = _flatten(T, N, obs_batch)
- actions_batch = _flatten(T, N, actions_batch)
- if self.available_actions is not None:
- available_actions_batch = _flatten(T, N,
- available_actions_batch)
- else:
- available_actions_batch = None
- value_preds_batch = _flatten(T, N, value_preds_batch)
- return_batch = _flatten(T, N, return_batch)
- masks_batch = _flatten(T, N, masks_batch)
- active_masks_batch = _flatten(T, N, active_masks_batch)
- old_action_log_probs_batch = _flatten(T, N,
- old_action_log_probs_batch)
- adv_targ = _flatten(T, N, adv_targ)
-
- yield share_obs_batch, obs_batch, rnn_states_batch, rnn_states_critic_batch, actions_batch,\
- value_preds_batch, return_batch, masks_batch, active_masks_batch, old_action_log_probs_batch,\
- adv_targ, available_actions_batch
-
- def recurrent_generator(self, advantages, num_mini_batch,
- data_chunk_length):
- """
- Yield training data for chunked RNN training.
- :param advantages: (np.ndarray) advantage estimates.
- :param num_mini_batch: (int) number of minibatches to split the batch into.
- :param data_chunk_length: (int) length of sequence chunks with which to train RNN.
- """
- episode_length, n_rollout_threads, num_agents = self.rewards.shape[0:3]
- batch_size = n_rollout_threads * episode_length * num_agents
- data_chunks = batch_size // data_chunk_length # [C=r*T*M/L]
- mini_batch_size = data_chunks // num_mini_batch
-
- rand = torch.randperm(data_chunks).numpy()
- sampler = [
- rand[i * mini_batch_size:(i + 1) * mini_batch_size]
- for i in range(num_mini_batch)
- ]
-
- if len(self.share_obs.shape) > 4:
- share_obs = self.share_obs[:-1].transpose(
- 1, 2, 0, 3, 4, 5).reshape(-1, *self.share_obs.shape[3:])
- obs = self.obs[:-1].transpose(1, 2, 0, 3, 4,
- 5).reshape(-1, *self.obs.shape[3:])
- else:
- share_obs = _cast(self.share_obs[:-1])
- obs = _cast(self.obs[:-1])
-
- actions = _cast(self.actions)
- action_log_probs = _cast(self.action_log_probs)
- advantages = _cast(advantages)
- value_preds = _cast(self.value_preds[:-1])
- returns = _cast(self.returns[:-1])
- masks = _cast(self.masks[:-1])
- active_masks = _cast(self.active_masks[:-1])
- # rnn_states = _cast(self.rnn_states[:-1])
- # rnn_states_critic = _cast(self.rnn_states_critic[:-1])
- rnn_states = self.rnn_states[:-1].transpose(1, 2, 0, 3, 4).reshape(
- -1, *self.rnn_states.shape[3:])
- rnn_states_critic = self.rnn_states_critic[:-1].transpose(
- 1, 2, 0, 3, 4).reshape(-1, *self.rnn_states_critic.shape[3:])
-
- if self.available_actions is not None:
- available_actions = _cast(self.available_actions[:-1])
-
- for indices in sampler:
- share_obs_batch = []
- obs_batch = []
- rnn_states_batch = []
- rnn_states_critic_batch = []
- actions_batch = []
- available_actions_batch = []
- value_preds_batch = []
- return_batch = []
- masks_batch = []
- active_masks_batch = []
- old_action_log_probs_batch = []
- adv_targ = []
-
- for index in indices:
-
- ind = index * data_chunk_length
- # size [T+1 N M Dim]-->[T N M Dim]-->[N,M,T,Dim]-->[N*M*T,Dim]-->[L,Dim]
- share_obs_batch.append(share_obs[ind:ind + data_chunk_length])
- obs_batch.append(obs[ind:ind + data_chunk_length])
- actions_batch.append(actions[ind:ind + data_chunk_length])
- if self.available_actions is not None:
- available_actions_batch.append(
- available_actions[ind:ind + data_chunk_length])
- value_preds_batch.append(value_preds[ind:ind +
- data_chunk_length])
- return_batch.append(returns[ind:ind + data_chunk_length])
- masks_batch.append(masks[ind:ind + data_chunk_length])
- active_masks_batch.append(active_masks[ind:ind +
- data_chunk_length])
- old_action_log_probs_batch.append(
- action_log_probs[ind:ind + data_chunk_length])
- adv_targ.append(advantages[ind:ind + data_chunk_length])
- # size [T+1 N M Dim]-->[T N M Dim]-->[N M T Dim]-->[N*M*T,Dim]-->[1,Dim]
- rnn_states_batch.append(rnn_states[ind])
- rnn_states_critic_batch.append(rnn_states_critic[ind])
-
- L, N = data_chunk_length, mini_batch_size
-
- # These are all from_numpys of size (L, N, Dim)
- share_obs_batch = np.stack(share_obs_batch, axis=1)
- obs_batch = np.stack(obs_batch, axis=1)
-
- actions_batch = np.stack(actions_batch, axis=1)
- if self.available_actions is not None:
- available_actions_batch = np.stack(available_actions_batch,
- axis=1)
- value_preds_batch = np.stack(value_preds_batch, axis=1)
- return_batch = np.stack(return_batch, axis=1)
- masks_batch = np.stack(masks_batch, axis=1)
- active_masks_batch = np.stack(active_masks_batch, axis=1)
- old_action_log_probs_batch = np.stack(old_action_log_probs_batch,
- axis=1)
- adv_targ = np.stack(adv_targ, axis=1)
-
- # States is just a (N, -1) from_numpy
- rnn_states_batch = np.stack(rnn_states_batch).reshape(
- N, *self.rnn_states.shape[3:])
- rnn_states_critic_batch = np.stack(
- rnn_states_critic_batch).reshape(
- N, *self.rnn_states_critic.shape[3:])
-
- # Flatten the (L, N, ...) from_numpys to (L * N, ...)
- share_obs_batch = _flatten(L, N, share_obs_batch)
- obs_batch = _flatten(L, N, obs_batch)
- actions_batch = _flatten(L, N, actions_batch)
- if self.available_actions is not None:
- available_actions_batch = _flatten(L, N,
- available_actions_batch)
- else:
- available_actions_batch = None
- value_preds_batch = _flatten(L, N, value_preds_batch)
- return_batch = _flatten(L, N, return_batch)
- masks_batch = _flatten(L, N, masks_batch)
- active_masks_batch = _flatten(L, N, active_masks_batch)
- old_action_log_probs_batch = _flatten(L, N,
- old_action_log_probs_batch)
- adv_targ = _flatten(L, N, adv_targ)
-
- yield share_obs_batch, obs_batch, rnn_states_batch, rnn_states_critic_batch, actions_batch,\
- value_preds_batch, return_batch, masks_batch, active_masks_batch, old_action_log_probs_batch,\
- adv_targ, available_actions_batch
diff --git a/algos/ppo/utils/util.py b/algos/ppo/utils/util.py
deleted file mode 100644
index 7e23b9ea..00000000
--- a/algos/ppo/utils/util.py
+++ /dev/null
@@ -1,85 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.
-#
-# This source code is licensed under the MIT license found in the
-# LICENSE file in the root directory of this source tree.
-# Code modified from https://github.com/marlbenchmark/on-policy
-import numpy as np
-import math
-import torch
-
-
-def check(input):
- if type(input) == np.ndarray:
- return torch.from_numpy(input)
-
-
-def get_gard_norm(it):
- sum_grad = 0
- for x in it:
- if x.grad is None:
- continue
- sum_grad += x.grad.norm()**2
- return math.sqrt(sum_grad)
-
-
-def update_linear_schedule(optimizer, epoch, total_num_epochs, initial_lr):
- """Decreases the learning rate linearly"""
- lr = initial_lr - (initial_lr * (epoch / float(total_num_epochs)))
- for param_group in optimizer.param_groups:
- param_group['lr'] = lr
-
-
-def huber_loss(e, d):
- a = (abs(e) <= d).float()
- b = (e > d).float()
- return a * e**2 / 2 + b * d * (abs(e) - d / 2)
-
-
-def mse_loss(e):
- return e**2 / 2
-
-
-def get_shape_from_obs_space(obs_space):
- if obs_space.__class__.__name__ == 'Box':
- obs_shape = obs_space.shape
- elif obs_space.__class__.__name__ == 'list':
- obs_shape = obs_space
- else:
- raise NotImplementedError
- return obs_shape
-
-
-def get_shape_from_act_space(act_space):
- if act_space.__class__.__name__ == 'Discrete':
- act_shape = 1
- elif act_space.__class__.__name__ == "MultiDiscrete":
- act_shape = act_space.shape
- elif act_space.__class__.__name__ == "Box":
- act_shape = act_space.shape[0]
- elif act_space.__class__.__name__ == "MultiBinary":
- act_shape = act_space.shape[0]
- else: # agar
- act_shape = act_space[0].shape[0] + 1
- return act_shape
-
-
-def tile_images(img_nhwc):
- """
- Tile N images into one big PxQ image
- (P,Q) are chosen to be as close as possible, and if N
- is square, then P=Q.
- input: img_nhwc, list or array of images, ndim=4 once turned into array
- n = batch index, h = height, w = width, c = channel
- returns:
- bigim_HWc, ndarray with ndim=3
- """
- img_nhwc = np.asarray(img_nhwc)
- N, h, w, c = img_nhwc.shape
- H = int(np.ceil(np.sqrt(N)))
- W = int(np.ceil(float(N) / H))
- img_nhwc = np.array(
- list(img_nhwc) + [img_nhwc[0] * 0 for _ in range(N, H * W)])
- img_HWhwc = img_nhwc.reshape(H, W, h, w, c)
- img_HhWwc = img_HWhwc.transpose(0, 2, 1, 3, 4)
- img_Hh_Ww_c = img_HhWwc.reshape(H * h, W * w, c)
- return img_Hh_Ww_c
diff --git a/algos/ppo/utils/valuenorm.py b/algos/ppo/utils/valuenorm.py
deleted file mode 100644
index 76df255d..00000000
--- a/algos/ppo/utils/valuenorm.py
+++ /dev/null
@@ -1,97 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.
-#
-# This source code is licensed under the MIT license found in the
-# LICENSE file in the root directory of this source tree.
-# Code modified from https://github.com/marlbenchmark/on-policy
-import numpy as np
-
-import torch
-import torch.nn as nn
-
-
-class ValueNorm(nn.Module):
- """ Normalize a vector of observations - across the first norm_axes dimensions"""
-
- def __init__(self,
- input_shape,
- norm_axes=1,
- beta=0.99999,
- per_element_update=False,
- epsilon=1e-5,
- device=torch.device("cpu")):
- super(ValueNorm, self).__init__()
-
- self.input_shape = input_shape
- self.norm_axes = norm_axes
- self.epsilon = epsilon
- self.beta = beta
- self.per_element_update = per_element_update
- self.tpdv = dict(dtype=torch.float32, device=device)
-
- self.running_mean = nn.Parameter(torch.zeros(input_shape),
- requires_grad=False).to(**self.tpdv)
- self.running_mean_sq = nn.Parameter(
- torch.zeros(input_shape), requires_grad=False).to(**self.tpdv)
- self.debiasing_term = nn.Parameter(torch.tensor(0.0),
- requires_grad=False).to(**self.tpdv)
-
- self.reset_parameters()
-
- def reset_parameters(self):
- self.running_mean.zero_()
- self.running_mean_sq.zero_()
- self.debiasing_term.zero_()
-
- def running_mean_var(self):
- debiased_mean = self.running_mean / self.debiasing_term.clamp(
- min=self.epsilon)
- debiased_mean_sq = self.running_mean_sq / self.debiasing_term.clamp(
- min=self.epsilon)
- debiased_var = (debiased_mean_sq - debiased_mean**2).clamp(min=1e-2)
- return debiased_mean, debiased_var
-
- @torch.no_grad()
- def update(self, input_vector):
- if type(input_vector) == np.ndarray:
- input_vector = torch.from_numpy(input_vector)
- input_vector = input_vector.to(**self.tpdv)
-
- batch_mean = input_vector.mean(dim=tuple(range(self.norm_axes)))
- batch_sq_mean = (input_vector**2).mean(
- dim=tuple(range(self.norm_axes)))
-
- if self.per_element_update:
- batch_size = np.prod(input_vector.size()[:self.norm_axes])
- weight = self.beta**batch_size
- else:
- weight = self.beta
-
- self.running_mean.mul_(weight).add_(batch_mean * (1.0 - weight))
- self.running_mean_sq.mul_(weight).add_(batch_sq_mean * (1.0 - weight))
- self.debiasing_term.mul_(weight).add_(1.0 * (1.0 - weight))
-
- def normalize(self, input_vector):
- # Make sure input is float32
- if type(input_vector) == np.ndarray:
- input_vector = torch.from_numpy(input_vector)
- input_vector = input_vector.to(**self.tpdv)
-
- mean, var = self.running_mean_var()
- out = (input_vector - mean[(None, ) * self.norm_axes]
- ) / torch.sqrt(var)[(None, ) * self.norm_axes]
-
- return out
-
- def denormalize(self, input_vector):
- """ Transform normalized data back into original distribution """
- if type(input_vector) == np.ndarray:
- input_vector = torch.from_numpy(input_vector)
- input_vector = input_vector.to(**self.tpdv)
-
- mean, var = self.running_mean_var()
- out = input_vector * torch.sqrt(var)[(None, ) * self.norm_axes] + mean[
- (None, ) * self.norm_axes]
-
- out = out.cpu().numpy()
-
- return out
diff --git a/setup.py b/build.py
similarity index 74%
rename from setup.py
rename to build.py
index 4863ae61..43af0ae5 100644
--- a/setup.py
+++ b/build.py
@@ -1,9 +1,4 @@
-# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.
-#
-# This source code is licensed under the MIT license found in the
-# LICENSE file in the root directory of this source tree.
-
-"""Run via ```python setup.py develop``` to install Nocturne in your environment."""
+from pybind11.setup_helpers import build_ext, Pybind11Extension
import logging
import multiprocessing
import os
@@ -12,18 +7,14 @@
import sys
from distutils.version import LooseVersion
-from setuptools import Extension, setup
-from setuptools.command.build_ext import build_ext
-
-# Reference:
-# https://www.benjack.io/2017/06/12/python-cpp-tests.html
+logging.basicConfig(level=logging.INFO)
-class CMakeExtension(Extension):
+class CMakeExtension(Pybind11Extension):
"""Use CMake to construct the Nocturne extension."""
def __init__(self, name, src_dir=""):
- Extension.__init__(self, name, sources=[])
+ Pybind11Extension.__init__(self, name, sources=[])
self.src_dir = os.path.abspath(src_dir)
@@ -87,15 +78,9 @@ def build_extension(self, ext):
print() # Add an empty line for cleaner output
-def main():
- """Build the C++ code."""
- # with open("./requirements.txt", "r") as f:
- # requires = f.read().splitlines()
- setup(
- ext_modules=[CMakeExtension("nocturne", "./nocturne")],
- cmdclass=dict(build_ext=CMakeBuild),
- )
-
-
-if __name__ == "__main__":
- main()
+def build(setup_kwargs):
+ setup_kwargs.update({
+ "ext_modules": [CMakeExtension("nocturne", "./nocturne")],
+ "cmdclass": {"build_ext": CMakeBuild},
+ "zip_safe": False,
+ })
diff --git a/cfgs/algorithm/APPO.yaml b/cfgs/algorithm/APPO.yaml
deleted file mode 100644
index 5c83b6e5..00000000
--- a/cfgs/algorithm/APPO.yaml
+++ /dev/null
@@ -1,208 +0,0 @@
-algo: APPO
-experiments_root: null
- # If not None, store experiment data in the specified subfolder of train_dir. Useful for groups of experiments (e.g. gridsearch) (default: None)
-train_dir: null
- # Root for all experiments (default: /private/home/eugenevinitsky/Code/nocturne/examples/train_dir)
- # if null use the hydra default position
-device: gpu # CPU training is only recommended for smaller e.g. MLP policies (default: gpu)
-save_every_sec: 120 # Checkpointing rate (default: 120)
-keep_checkpoints: 3 #Number of model checkpoints to keep (default: 3)
-save_milestones_sec: -1 #Save intermediate checkpoints in a separate folder for later evaluation (default=never) (default: -1)
-stats_avg: 100 #How many episodes to average to measure performance (avg. reward etc) (default: 100)
-learning_rate: 0.0001 # LR (default: 0.0001)
-train_for_env_steps: 3000000000 # Stop after all policies are trained for this many env steps (default: 10000000000)
-train_for_seconds: 10000000000 #Stop training after this many seconds (default: 10000000000)
-lr_schedule: constant #Learning rate schedule to use. Constant keeps constant learning rate throughout training.
- # kl_adaptive* schedulers look at --lr_schedule_kl_threshold and if KL-divergence with behavior policy'
- # after the last minibatch/epoch significantly deviates from this threshold, lr is apropriately'
- # increased or decreased
- # options are 'constant', 'kl_adaptive_minibatch', 'kl_adaptive_epoch'
-lr_schedule_kl_threshold: 0.008 #Used with kl_adaptive_* schedulers
-obs_subtract_mean: 0.0 # Observation preprocessing, mean value to subtract from observation (e.g. 128.0 for 8-bit RGB) (default: 0.0)
-obs_scale: 10.0 # Observation preprocessing, divide observation tensors by this scalar (e.g. 128.0 for 8-bit RGB) (default: 1.0)
-gamma: 0.99 # Discount factor (default: 0.99)
-reward_scale: 1.0
- # Multiply all rewards by this factor before feeding into RL algorithm.Sometimes the overall scale of rewards is too high which makes value estimation a
- # harder regression task.Loss values become too high which requires a smaller learning rate, etc. (default: 1.0)
-reward_clip: 10.0 # Clip rewards between [-c, c]. Default [-10, 10] virtually means no clipping for most envs (default: 10.0)
-encoder_type: mlp # Type of the encoder. Supported: conv, mlp, resnet (feel free to define more) (default: conv)
-encoder_subtype: mlp_mujoco # Specific encoder design (see model.py) (default: convnet_simple)
-encoder_custom: custom_env_encoder # Use custom encoder class from the registry (see model_utils.py) (default: null, options {null, custom_env_encoder})
-encoder_extra_fc_layers: 1 # Number of fully-connected layers of size "hidden size" to add after the basic encoder (e.g. convolutional) (default: 1)
-encoder_hidden_size: 256
-hidden_size: 256 # Size of hidden layer in the model, or the size of RNN hidden state in recurrent model (e.g. GRU) (default: 128)
-nonlinearity: tanh # {elu,relu,tanh}
- # Type of nonlinearity to use (default: elu)
-policy_initialization: orthogonal # {orthogonal,xavier_uniform}
- # NN weight initialization (default: orthogonal)
-policy_init_gain: 1.0 # Gain parameter of PyTorch initialization schemas (i.e. Xavier) (default: 1.0)
-actor_critic_share_weights: True # Whether to share the weights between policy and value function (default: True)
-use_spectral_norm: False # Use spectral normalization to smoothen the gradients and stabilize training. Only supports fully connected layers (default: False)
-adaptive_stddev: True # Only for continuous action distributions, whether stddev is state-dependent or just a single learned parameter (default: True)
-initial_stddev: 1.0 # Initial value for non-adaptive stddev. Only makes sense for continuous action spaces (default: 1.0)
-experiment_summaries_interval: 20 # How often in seconds we write avg. statistics about the experiment (reward, episode length, extra stats...) (default: 20)
-adam_eps: 1e-06 # Adam epsilon parameter (1e-8 to 1e-5 seem to reliably work okay, 1e-3 and up does not work) (default: 1e-06)
-adam_beta1: 0.9 # Adam momentum decay coefficient (default: 0.9)
-adam_beta2: 0.999 # Adam second momentum decay coefficient (default: 0.999)
-gae_lambda: 0.95 # Generalized Advantage Estimation discounting (only used when V-trace is False (default: 0.95)
-rollout: 20
-# Length of the rollout from each environment in timesteps.Once we collect this many timesteps on actor worker, we send this trajectory to the learner.The
-# length of the rollout will determine how many timesteps are used to calculate bootstrappedMonte-Carlo estimates of discounted rewards, advantages, GAE,
-# or V-trace targets. Shorter rolloutsreduce variance, but the estimates are less precise (bias vs variance tradeoff).For RNN policies, this should be a
-# multiple of --recurrence, so every rollout will be splitinto (n = rollout / recurrence) segments for backpropagation. V-trace algorithm currently
-# requires thatrollout == recurrence, which what you want most of the time anyway.Rollout length is independent from the episode length. Episode length
-# can be both shorter or longer thanrollout, although for PBT training it is currently recommended that rollout << episode_len(see function
-# finalize_trajectory in actor_worker.py) (default: 32)
-num_workers: 80 # Number of parallel environment workers. Should be less than num_envs and should divide num_envs (default: 80)
-recurrence: 20 # Trajectory length for backpropagation through time. If recurrence=1 there is no backpropagation through time, and experience is shuffled completely
- # randomlyFor V-trace recurrence should be equal to rollout length. (default: 32)
-use_rnn: True # Whether to use RNN core in a policy or not (default: True)
-rnn_type: gru # {gru,lstm}
- # Type of RNN cell to use if use_rnn is True (default: gru)
-rnn_num_layers: 1 # Number of RNN layers to use if use_rnn is True (default: 1)
-ppo_clip_ratio: 0.1 # We use unbiased clip(x, 1+e, 1/(1+e)) instead of clip(x, 1+e, 1-e) in the paper (default: 0.1)
-ppo_clip_value: 1.0 # Maximum absolute change in value estimate until it is clipped. Sensitive to value magnitude (default: 1.0)
-batch_size: 7180 # Minibatch size for SGD (default: 1024)
-num_batches_per_iteration: 1
-# How many minibatches we collect before training on the collected experience. It is generally recommended to set this to 1 for most experiments, because
-# any higher value will increase the policy lag.But in some specific circumstances it can be beneficial to have a larger macro-batch in order to shuffle
-# and decorrelate the minibatches.Here and throughout the codebase: macro batch is the portion of experience that learner processes per iteration
-# (consisting of 1 or several minibatches) (default: 1)
-ppo_epochs: 1 # Number of training epochs before a new batch of experience is collected (default: 1)
-num_minibatches_to_accumulate: -1
-# This parameter governs the maximum number of minibatches the learner can accumulate before further experience collection is stopped.The default value
-# (-1) will set this to 2 * num_batches_per_iteration, so if the experience collection is faster than the training,the learner will accumulate enough
-# minibatches for 2 iterations of training (but no more). This is a good balance between policy-lag and throughput.When the limit is reached, the learner
-# will notify the actor workers that they ought to stop the experience collection until accumulated minibatchesare processed. Set this parameter to 1 *
-# num_batches_per_iteration to further reduce policy-lag.If the experience collection is very non-uniform, increasing this parameter can increase overall
-# throughput, at the cost of increased policy-lag.A value of 0 is treated specially. This means the experience accumulation is turned off, and all
-# experience collection will be halted during training.This is the regime with potentially lowest policy-lag.When this parameter is 0 and num_workers *
-# num_envs_per_worker * rollout == num_batches_per_iteration * batch_size, the algorithm is similar toregular synchronous PPO. (default: -1)
-max_grad_norm: 4.0 # Max L2 norm of the gradient vector (default: 4.0)
-exploration_loss_coeff: 0.001 # Coefficient for the exploration component of the loss function. (default: 0.001)
-value_loss_coeff: 0.5 # Coefficient for the critic loss (default: 0.5)
-kl_loss_coeff: 0.0 #Coefficient for fixed KL loss (as used by Schulman et al. in https://arxiv.org/pdf/1707.06347.pdf). Highly recommended for environments with continuous
- # action spaces. (default: 0.0)
-exploration_loss: entropy
- # {entropy,symmetric_kl}
- # Usually the exploration loss is based on maximizing the entropy of the probability distribution. Note that mathematically maximizing entropy of the
- # categorical probability distribution is exactly the same as minimizing the (regular) KL-divergence between this distribution and a uniform prior. The
- # downside of using the entropy term (or regular asymmetric KL-divergence) is the fact that penalty does not increase as probabilities of some actions
- # approach zero. I.e. numerically, there is almost no difference between an action distribution with a probability epsilon > 0 for some action and an
- # action distribution with a probability = zero for this action. For many tasks the first (epsilon) distribution is preferrable because we keep some
- # (albeit small) amount of exploration, while the second distribution will never explore this action ever again.Unlike the entropy term, symmetric KL
- # divergence between the action distribution and a uniform prior approaches infinity when entropy of the distribution approaches zero, so it can prevent
- # the pathological situations where the agent stops exploring. Empirically, symmetric KL-divergence yielded slightly better results on some problems.
- # (default: entropy)
-max_entropy_coeff: 0.0, # Coefficient for max entropy term added directly to rewards. 0 means no max entropy term to env rewards. '
- # Note that this is different from exploration loss (see https://arxiv.org/abs/1805.00909)'
-num_envs_per_worker: 2
- # Number of envs on a single CPU actor, in high-throughput configurations this should be in 10-30 range for Atari/VizDoomMust be even for double-buffered
- # sampling! (default: 2)
-worker_num_splits: 2
- # Typically we split a vector of envs into two parts for "double buffered" experience collectionSet this to 1 to disable double buffering. Set this to 3
- # for triple buffering! (default: 2)
-num_policies: 1
- # Number of policies to train jointly (default: 1)
-policy_workers_per_policy: 1
- # Number of policy workers that compute forward pass (per policy) (default: 1)
-max_policy_lag: 10000
- # Max policy lag in policy versions. Discard all experience that is older than this. This should be increased for configurations with multiple epochs of
- # SGD because naturallypolicy-lag may exceed this value. (default: 10000)
-traj_buffers_excess_ratio: 1.3
- # Increase this value to make sure the system always has enough free trajectory buffers (can be useful when i.e. a lot of inactive agents in multi-agent
- # envs)Decrease this to 1.0 to save as much RAM as possible. (default: 1.3)
-decorrelate_experience_max_seconds: 10
- # Decorrelating experience serves two benefits. First: this is better for learning because samples from workers come from random moments in the episode,
- # becoming more "i.i.d".Second, and more important one: this is good for environments with highly non-uniform one-step times, including long and expensive
- # episode resets. If experience is not decorrelatedthen training batches will come in bursts e.g. after a bunch of environments finished resets and many
- # iterations on the learner might be required,which will increase the policy-lag of the new experience collected. The performance of the Sample Factory is
- # best when experience is generated as more-or-lessuniform stream. Try increasing this to 100-200 seconds to smoothen the experience distribution in time
- # right from the beginning (it will eventually spread out and settle anyway) (default: 10)
-decorrelate_envs_on_one_worker: True
- # In addition to temporal decorrelation of worker processes, also decorrelate envs within one worker processFor environments with a fixed episode length
- # it can prevent the reset from happening in the same rollout for all envs simultaneously, which makes experience collection more uniform. (default: True)
-with_vtrace: True
- # Enables V-trace off-policy correction. If this is True, then GAE is not used (default: True)
-vtrace_rho: 1.0
- # rho_hat clipping parameter of the V-trace algorithm (importance sampling truncation) (default: 1.0)
-vtrace_c: 1.0
- # c_hat clipping parameter of the V-trace algorithm. Low values for c_hat can reduce variance of the advantage estimates (similar to GAE lambda < 1)
- # (default: 1.0)
-set_workers_cpu_affinity: True
- # Whether to assign workers to specific CPU cores or not. The logic is beneficial for most workloads because prevents a lot of context switching.However
- # for some environments it can be better to disable it, to allow one worker to use all cores some of the time. This can be the case for some DMLab
- # environments with very expensive episode resetthat can use parallel CPU cores for level generation. (default: True)
-force_envs_single_thread: True
- # Some environments may themselves use parallel libraries such as OpenMP or MKL. Since we parallelize environments on the level of workers, there is no
- # need to keep this parallel semantic.This flag uses threadpoolctl to force libraries such as OpenMP and MKL to use only a single thread within the
- # environment.Default value (True) is recommended unless you are running fewer workers than CPU cores. (default: True)
-reset_timeout_seconds: 120
- # Fail worker on initialization if not a single environment was reset in this time (worker probably got stuck) (default: 120)
-default_niceness: 0
- # Niceness of the highest priority process (the learner). Values below zero require elevated privileges. (default: 0)
-train_in_background_thread: True
- # Using background thread for training is faster and allows preparing the next batch while training is in progress.Unfortunately debugging can become very
- # tricky in this case. So there is an option to use only a single thread on the learner to simplify the debugging. (default: True)
-learner_main_loop_num_cores: 1
- # When batching on the learner is the bottleneck, increasing the number of cores PyTorch uses can improve the performance (default: 1)
-actor_worker_gpus: []
- # [ACTOR_WORKER_GPUS [ACTOR_WORKER_GPUS ...]]
- # By default, actor workers only use CPUs. Changes this if e.g. you need GPU-based rendering on the actors (default: [])
-with_pbt: False # Enables population-based training basic features (default: False)
-pbt_mix_policies_in_one_env: True
- # For multi-agent envs, whether we mix different policies in one env. (default: True)
-pbt_period_env_steps: 5000000
- # Periodically replace the worst policies with the best ones and perturb the hyperparameters (default: 5000000)
-pbt_start_mutation: 20000000
- # Allow initial diversification, start PBT after this many env steps (default: 20000000)
-pbt_replace_fraction: 0.3
- # A portion of policies performing worst to be replace by better policies (rounded up) (default: 0.3)
-pbt_mutation_rate: 0.15
- # Probability that a parameter mutates (default: 0.15)
-pbt_replace_reward_gap: 0.1
- # Relative gap in true reward when replacing weights of the policy with a better performing one (default: 0.1)
-pbt_replace_reward_gap_absolute: 1e-06
- # Absolute gap in true reward when replacing weights of the policy with a better performing one (default: 1e-06)
-pbt_optimize_batch_size: False
- # Whether to optimize batch size or not (experimental) (default: False)
-pbt_optimize_gamma: False
- # Whether to optimize gamma, discount factor, or not (experimental) (default: False)
-pbt_target_objective: true_reward
- # Policy stat to optimize with PBT. true_reward (default) is equal to raw env reward if not specified, but can also be any other per-policy stat.For
- # DMlab-30 use value "dmlab_target_objective" (which is capped human normalized score) (default: true_reward)
-pbt_perturb_min: 1.05
- # When PBT mutates a float hyperparam, it samples the change magnitude randomly from the uniform distribution [pbt_perturb_min, pbt_perturb_max] (default:
- # 1.05)
-pbt_perturb_max: 1.5
- # When PBT mutates a float hyperparam, it samples the change magnitude randomly from the uniform distribution [pbt_perturb_min, pbt_perturb_max] (default:
- # 1.5)
-use_cpc: False # Use CPC|A as an auxiliary loss durning learning (default: False)
-cpc_forward_steps: 8
- # Number of forward prediction steps for CPC (default: 8)
-cpc_time_subsample: 6
- # Number of timesteps to sample from each batch. This should be less than recurrence to decorrelate experience. (default: 6)
-cpc_forward_subsample: 2
- # Number of forward steps to sample for loss computation. This should be less than cpc_forward_steps to decorrelate gradients. (default: 2)
-with_wandb: ${wandb}
- # Enables Weights and Biases integration (default: False)
-wandb_user: null
- # WandB username (entity). Must be specified from command line! Also see https://docs.wandb.ai/quickstart#1.-set-up-wandb (default: None)
-wandb_project: ${wandb_project}
- # WandB "Project" (default: sample_factory)
-wandb_group: ${wandb_group}
- # WandB "Group" (to group your experiments). By default this is the name of the env. (default: None)
-wandb_job_type: SF
- # WandB job type (default: SF)
-wandb_tags: [] # [WANDB_TAGS [WANDB_TAGS ...]]
- # Tags can help with finding experiments in WandB web console (default: [])
-benchmark: False
- # Benchmark mode (default: False)
-sampler_only: False
- # Do not send experience to the learner, measuring sampling throughput (default: False)
-env_frameskip: null
- # Number of frames for action repeat (frame skipping). Default (None) means use default environment value (default: None)
-env_framestack: 4
- # Frame stacking (only used in Atari?) (default: 4)
-pixel_format: CHW
- # PyTorch expects CHW by default, Ray & TensorFlow expect HWC (default: CHW)
\ No newline at end of file
diff --git a/cfgs/algorithm/ppo.yaml b/cfgs/algorithm/ppo.yaml
deleted file mode 100644
index 485f53d9..00000000
--- a/cfgs/algorithm/ppo.yaml
+++ /dev/null
@@ -1,81 +0,0 @@
-algorithm_name: 'rmappo' # choices=["rmappo", "mappo"]
-experiment: ${experiment}
-seed: ${seed}
-device: ${device}
-cuda_deterministic: True
-n_training_threads: 1 # "Number of torch threads for training"
-n_rollout_threads: 1 # Number of parallel envs for training rollouts
-n_eval_rollout_threads: 1 # Number of parallel envs for evaluating rollouts
-n_render_rollout_threads: 1 # Number of parallel envs for rendering rollouts
-num_env_steps: 1e8 # Number of environment steps to train
-wandb: ${wandb}
-use_obs_instead_of_state: True # Whether to use global state or concatenated obs
-episode_length: ${episode_length} # Max length for any episode
-share_policy: True # Whether all agents share the same policy
-use_centralized_V: False # Whether to use a centralized value function
-stacked_frames: 1 # number of stacked observations
-use_stacked_frames: True # whether to use stacked frames
-hidden_size: 64 # Dimension of hidden layers for actor/critic networks
-layer_N: 2 # "Number of layers for actor/critic networks"
-use_ReLU: True # Whether to use ReLU activation or Tanh
-use_popart: False # Use PopART to normalize rewards
-use_valuenorm: True # use running mean and std to normalize rewards
-use_feature_normalization: True # Whether to apply layernorm to the inputs
-use_orthogonal: True # Whether to use Orthogonal initialization for weights and 0 initialization for biases
-gain: 0.01 # The gain # of last action layer
-# recurrent parameters
-use_naive_recurrent_policy: False # Whether to use a naive recurrent policy by stacking states I believe?
-use_recurrent_policy: True # Whether to use a recurrent policy
-recurrent_N: 1 # The number of recurrent layers
-data_chunk_length: 10 # Time length of chunks used to train a recurrent_policy
-
-# optimizer parameters
-lr: 5e-4 # learning rate
-critic_lr: 5e-4 # critic LR
-opti_eps: 1e-5 # RMSprop optimizer epsilon
-weight_decay: 0
-
-# ppo parameters
-ppo_epoch: 10 # number of PPO epochs
-use_clipped_value_loss: True # clip loss value
-clip_param: 0.2 # PPO clipping parameter
-num_mini_batch: 4 # Number of minibatches of the collected data to use
-entropy_coef: 0.00
-value_loss_coef: 0.5 # scaling on the value loss
-use_max_grad_norm: True # use max norm of gradients
-max_grad_norm: 10.0 # max norm of gradients
-use_gae: True # use generalized advantage estimation
-gamma: 0.99 # discount factor
-gae_lambda: 0.95
-use_proper_time_limits: False # compute returns taking into account time limits
-use_huber_loss: True
-use_value_active_masks: True # whether to mask useless data in value loss
-use_policy_active_masks: True # whether to mask useless data in policy loss
-huber_delta: 10.0 # coefficient of huber loss
-use_linear_lr_decay: False
-
-# saving and logging
-save_interval: 1 # time duration between contiunous twice models saving
-log_interval: 5 # time duration between contiunous twice log printing
-use_eval: True
-eval_interval: 25
-eval_episodes: 10
-save_gifs: True
-render_interval: 25 # how often to render
-use_render: False
-render_episodes: 1
-ifi: 0.1 # the play interval of each rendered image in saved video
-model_dir: null
-
-# goal env wrapper stuff
-density_buffer_size: 100000
-density_optim_samples: 1000
-num_goal_samples: 200
-bandwidth: 0.1
-log_figure: True
-kernel: 'gaussian'
-quartile_cutoff: 0.0
-normalize_value: 400.0
-log_every_n_episodes: 50
-# if True, all the agents share the same goal buffer for sampling new goals
-share_goal_buffer: False
\ No newline at end of file
diff --git a/cfgs/config.py b/cfgs/config.py
deleted file mode 100644
index f759c9af..00000000
--- a/cfgs/config.py
+++ /dev/null
@@ -1,52 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.
-#
-# This source code is licensed under the MIT license found in the
-# LICENSE file in the root directory of this source tree.
-"""Set path to all the Waymo data and the parsed Waymo files."""
-import os
-from pathlib import Path
-
-from hydra import compose, initialize
-from hydra.core.global_hydra import GlobalHydra
-from omegaconf import OmegaConf
-from pyvirtualdisplay import Display
-
-VERSION_NUMBER = 2
-
-PROJECT_PATH = Path.resolve(Path(__file__).parent.parent)
-DATA_FOLDER = '/checkpoint/eugenevinitsky/waymo_open/motion_v1p1/uncompressed/scenario/'
-TRAIN_DATA_PATH = os.path.join(DATA_FOLDER, 'training')
-VALID_DATA_PATH = os.path.join(DATA_FOLDER, 'validation')
-TEST_DATA_PATH = os.path.join(DATA_FOLDER, 'testing')
-PROCESSED_TRAIN_NO_TL = os.path.join(
- DATA_FOLDER, f'formatted_json_v{VERSION_NUMBER}_no_tl_train')
-PROCESSED_VALID_NO_TL = os.path.join(
- DATA_FOLDER, f'formatted_json_v{VERSION_NUMBER}_no_tl_valid')
-PROCESSED_TRAIN = os.path.join(DATA_FOLDER,
- f'formatted_json_v{VERSION_NUMBER}_train')
-PROCESSED_VALID = os.path.join(DATA_FOLDER,
- f'formatted_json_v{VERSION_NUMBER}_valid')
-ERR_VAL = -1e4
-
-
-def get_scenario_dict(hydra_cfg):
- """Convert the `scenario` key in the hydra config to a true dict."""
- if isinstance(hydra_cfg['scenario'], dict):
- return hydra_cfg['scenario']
- else:
- return OmegaConf.to_container(hydra_cfg['scenario'], resolve=True)
-
-
-def get_default_scenario_dict():
- """Construct the `scenario` dict without w/o hydra decorator."""
- GlobalHydra.instance().clear()
- initialize(config_path="./")
- cfg = compose(config_name="config")
- return get_scenario_dict(cfg)
-
-
-def set_display_window():
- """Set a virtual display for headless machines."""
- if "DISPLAY" not in os.environ:
- disp = Display()
- disp.start()
diff --git a/cfgs/config.yaml b/cfgs/config.yaml
deleted file mode 100644
index cf123dbf..00000000
--- a/cfgs/config.yaml
+++ /dev/null
@@ -1,122 +0,0 @@
-defaults:
- - algorithm: ppo
- - override hydra/launcher: submitit_local
-
-seed: 0
-device: 'cuda:0'
-debug: False
-experiment: intersection
-env: my_custom_multi_env_v1 # name of the env, hardcoded for now
-
-# WANDB things
-wandb: False
-wandb_project: nocturne4
-wandb_id: null
-wandb_group: ${experiment}
-
-# one of the agents will be randomly tagged as the
-# agent that we control, the rest of the agents will
-# replay trajectories
-single_agent_mode: False
-# all goals are achievable within 90 steps
-episode_length: 80
-# how many files of the total dataset to use. -1 indicates to use all of them
-num_files: -1
-scenario_path: ${oc.env:PROCESSED_TRAIN_NO_TL}
-dt: 0.1
-sims_per_step: 10
-img_as_state: False
-discretize_actions: True
-accel_discretization: 6
-accel_lower_bound: -3
-accel_upper_bound: 2
-steering_lower_bound: -0.7 # corresponds to about 40 degrees of max steering angle
-steering_upper_bound: 0.7 # corresponds to about 40 degrees of max steering angle
-steering_discretization: 21
-head_angle_lower_bound: -1.6
-head_angle_upper_bound: 1.6
-head_angle_discretization: 5
-max_num_vehicles: 20 # we want to upper bound how many agents there can be in the scene
- # this is mostly useful because RL libraries expect it
-# TODO(eugenevinitsky) actually implement this
-randomize_goals: False
-scenario:
- # initial timestep of the scenario (which ranges from timesteps 0 to 90)
- start_time: 0
- # if set to True, non-vehicle objects (eg. cyclists, pedestrians...) will be spawned
- allow_non_vehicles: False
- # for an object to be included into moving_objects
- moving_threshold: 0.2 # its goal must be at least this distance from its initial position
- speed_threshold: 0.05 # its speed must be superior to this value at some point
- # maximum number of each objects visible in the object state
- # if there are more objects, the closest ones are prioritized
- # if there are less objects, the features vector is padded with zeros
- max_visible_objects: 16
- max_visible_road_points: 1000
- max_visible_traffic_lights: 20
- max_visible_stop_signs: 4
- # from the set of road points that comprise each polyline, we take
- # every n-th one of these
- sample_every_n: 1
- # if true we add all the road-edges (the edges you can collide with)
- # to the visible road points first and only add the other points
- # (road lines, lane lines) etc. if we have remaining states after
- road_edge_first: False
-
-# these configs are mostly used for aligning displacement error computations
-# with the standard way of doing it in other libraries i.e. we keep
-# the agent for the whole rollout and compute its distance from the expert
-# at all the points that the expert is valid
-remove_at_goal: True # if true, remove the agent when it reaches its goal
-remove_at_collide: True # if true, remove the agent when it collides
-
-rew_cfg:
- shared_reward: False # agents get the collective reward instead of individual rewards
- goal_tolerance: 0.5
- reward_scaling: 10.0 # rescale all the rewards by this value. This can help w/ some learning algorithms
- collision_penalty: 0
- shaped_goal_distance_scaling: 0.2
- shaped_goal_distance: True
- goal_distance_penalty: False # if shaped_goal_distance is true, then when this is True the goal distance
- # is a penalty for being far from
- # goal instead of a reward for being close
- goal_achieved_bonus: ${episode_length}
- # goal is only achieved if you're within this tolerance on distance from goal
- position_target: True
- position_target_tolerance: 1.0
- # goal is only achieved if you're within this tolerance on final agent speed at goal position
- speed_target: True
- speed_target_tolerance: 1.0
- # goal is only achieved if you're within this tolerance on final agent heading at goal position
- heading_target: True
- heading_target_tolerance: 0.3
-subscriber:
- view_angle: 2.1
- # the distance which the cone extends before agents are not visible
- # TODO(eugenevinitsky) pick the right number
- view_dist: 80
- use_ego_state: True
- use_observations: True
- # if true, we return an observation for agents that have exited the system
- # as well as returning an observation for the extra agents if the number of
- # agents in the system is less than max_num_vehicles
- keep_inactive_agents: False
- # for values greater than 1, we will stack inputs together
- n_frames_stacked: 1
-
-results_dir: ${oc.env:NOCTURNE_LOG_DIR}
-
-hydra:
- run:
- dir: ${results_dir}/test/${now:%Y.%m.%d}/${experiment}/${now:%H.%M.%S}/${hydra.job.override_dirname}
- sweep:
- dir: ${results_dir}/${oc.env:USER}/nocturne/sweep/${now:%Y.%m.%d}/${experiment}/${now:%H.%M.%S}
- subdir: ${hydra.job.num}
- launcher:
- timeout_min: 2880
- cpus_per_task: 10
- gpus_per_node: 1
- tasks_per_node: 1
- mem_gb: 160
- nodes: 1
- submitit_folder: ${results_dir}/sweep/${now:%Y.%m.%d}/${now:%H%M}_${experiment}/.slurm
diff --git a/cfgs/cpp_ b/cfgs/cpp_
deleted file mode 100644
index e69de29b..00000000
diff --git a/cfgs/imitation/config.yaml b/cfgs/imitation/config.yaml
deleted file mode 100644
index 6cf72fc1..00000000
--- a/cfgs/imitation/config.yaml
+++ /dev/null
@@ -1,42 +0,0 @@
-defaults:
- - override hydra/launcher: submitit_local
-
-experiment: test
-path: ${oc.env:PROCESSED_TRAIN_NO_TL}
-num_files: 1000
-n_cpus: 9
-lr: 3e-4
-samples_per_epoch: 50000
-max_visible_road_points: 500
-batch_size: 512
-epochs: 700
-device: cuda
-n_stacked_states: 5
-view_dist: 80
-view_angle: 3.14
-actions_are_positions: False
-discrete: True
-seed: 0
-
-# WANDB things
-wandb: True
-wandb_project: nocturne
-wandb_group: ${experiment}
-
-# tensorboard logs
-write_to_tensorboard: True
-
-hydra:
- run:
- dir: /checkpoint/${oc.env:USER}/nocturne/test/${now:%Y.%m.%d}/${experiment}/${now:%H.%M.%S}/${hydra.job.override_dirname}
- sweep:
- dir: /checkpoint/${oc.env:USER}/nocturne/sweep/imitation/${now:%Y.%m.%d}/${experiment}/${now:%H.%M.%S}
- subdir: ${hydra.job.num}
- launcher:
- timeout_min: 2880
- cpus_per_task: 80
- gpus_per_node: 1
- tasks_per_node: 1
- mem_gb: 160
- nodes: 1
- submitit_folder: /checkpoint/${oc.env:USER}/nocturne/sweep/imitation/${now:%Y.%m.%d}/${experiment}/${now:%H.%M.%S}/.slurm
diff --git a/configs/env_config.yaml b/configs/env_config.yaml
new file mode 100644
index 00000000..adf7a237
--- /dev/null
+++ b/configs/env_config.yaml
@@ -0,0 +1,93 @@
+seed: 0
+device: cuda:0
+debug: false
+experiment: intersection
+env: my_custom_multi_env_v1 # name of the env, hardcoded for now
+
+# all goals are achievable within 90 steps
+episode_length: 80
+# how many files of the total dataset to use. -1 indicates to use all of them
+num_files: 5
+dt: 0.1
+sims_per_step: 10
+img_as_state: false
+discretize_actions: true
+include_head_angle: false # Whether to include the head tilt/angle as part of a vehicle's action
+accel_discretization: 3
+accel_lower_bound: -2
+accel_upper_bound: 2
+steering_lower_bound: -0.25 # corresponds to about 40 degrees of max steering angle
+steering_upper_bound: 0.25 # corresponds to about 40 degrees of max steering angle
+steering_discretization: 3
+max_num_vehicles: 20
+randomize_goals: false
+scenario:
+ # initial timestep of the scenario (which ranges from timesteps 0 to 90)
+ start_time: 0
+ # if set to True, non-vehicle objects (eg. cyclists, pedestrians...) will be spawned
+ allow_non_vehicles: false
+ # for an object to be included into moving_objects
+ moving_threshold: 0.2 # its goal must be at least this distance from its initial position
+ speed_threshold: 0.05 # its speed must be superior to this value at some point
+ # maximum number of each objects visible in the object state
+ # if there are more objects, the closest ones are prioritized
+ # if there are less objects, the features vector is padded with zeros
+ #max_visible_objects: 16
+ max_visible_road_points: 500
+ #max_visible_traffic_lights: 20
+ #max_visible_stop_signs: 4
+ # from the set of road points that comprise each polyline, we take
+ # every n-th one of these
+ sample_every_n: 1
+ # if true we add all the road-edges (the edges you can collide with)
+ # to the visible road points first and only add the other points
+ # (road lines, lane lines) etc. if we have remaining states after
+ road_edge_first: false
+ invalid_position: -10000.0
+ context_length: 10
+
+# these configs are mostly used for aligning displacement error computations
+# with the standard way of doing it in other libraries i.e. we keep
+# the agent for the whole rollout and compute its distance from the expert
+# at all the points that the expert is valid
+remove_at_goal: true # if true, remove the agent when it reaches its goal
+remove_at_collide: true # if true, remove the agent when it collides
+
+# Reward settings
+rew_cfg:
+ shared_reward: false # agents get the collective reward instead of individual rewards
+ goal_tolerance: 0.5
+ reward_scaling: 10.0 # rescale all the rewards by this value. This can help w/ some learning algorithms
+ collision_penalty: 0
+ shaped_goal_distance_scaling: 0.2
+ shaped_goal_distance: true
+ goal_distance_penalty: false # if shaped_goal_distance is true, then when this is True the goal distance
+ # is a penalty for being far from
+ # goal instead of a reward for being close
+ goal_achieved_bonus: 80
+ position_target: true # If True, goal is only achieved if you're within this tolerance on distance from goal
+ position_target_tolerance: 1.0
+ speed_target: true # If True, goal is only achieved if you're within this tolerance on final agent speed at goal position
+ speed_target_tolerance: 1.0
+ heading_target: false # If True, goal is only achieved if you're within this tolerance on final agent heading at goal position
+ heading_target_tolerance: 0.3
+ # we assume that vehicles are never more than 400 meters from their goal which makes
+ # sense as the episodes are 9 seconds long, i.e. we'd have to go more than 40 m/s to get there
+ goal_speed_scaling: 40.0
+
+# Agent settings
+subscriber:
+ view_angle: 3.14 # the distance which the cone extends before agents are not visible; set to pi rad to correct for missing head angle
+ view_dist: 80
+ use_ego_state: true # if True, add information about the ego state
+ use_observations: false # if True, add visible field
+ use_start_position: false # if True, add start (x, y)-position of the agent
+ use_current_position: false # if True, add current (x, y)-position of the agent
+ use_target_position: false # if True, add target (x, y)-position of the agent
+ use_distance_to_target: false # if True, add distance to target (dx, dy) of the agent
+
+ # for values greater than 1, we will stack inputs together (i.e. memory and equivalent of n_stacked_states)
+ n_frames_stacked: 1 # Agent memory
+
+# Path to folder with traffic scene(s) from which to create an environment
+data_path: ../data
diff --git a/examples/example_scenario.json b/data/example_scenario.json
similarity index 100%
rename from examples/example_scenario.json
rename to data/example_scenario.json
diff --git a/data/valid_files.json b/data/valid_files.json
new file mode 100644
index 00000000..7698869b
--- /dev/null
+++ b/data/valid_files.json
@@ -0,0 +1,3 @@
+{
+ "example_scenario.json": []
+}
diff --git a/environment.yml b/environment.yml
index b9e7ae19..33dc0588 100644
--- a/environment.yml
+++ b/environment.yml
@@ -2,29 +2,4 @@ name: nocturne
channels:
- defaults
dependencies:
- - python=3.8
- - pip=21.1.3
- - numpy=1.19.2
- - jupyterlab=3.0.14
- - pip:
- - hydra-core==1.1.0
- - hydra-submitit-launcher==1.1.5
- - ipdb==0.13.9
- - seaborn
- - imageio==2.10.1
- - moviepy==1.0.3
- - opencv-python==4.5.5.64
- - gym==0.20.0
- - wandb==0.12.15
- - imageio==2.10.1
- - setproctitle==1.2.3
- - tensorboardX==2.5
- - pytest==7.1.1
- - flake8==4.0.1
- - pydocstyle==6.1.1
- - pyvirtualdisplay
- - ray==1.11.0
- - dm-tree
- - tabulate
- - torch
- - sample-factory==1.123.0
\ No newline at end of file
+ - python=3.10
diff --git a/examples/01_data_structure.ipynb b/examples/01_data_structure.ipynb
new file mode 100644
index 00000000..d79bf973
--- /dev/null
+++ b/examples/01_data_structure.ipynb
@@ -0,0 +1,4235 @@
+{
+ "cells": [
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Data format of a traffic scene\n",
+ "\n",
+ "This notebook dives into the data format used to create simulations in Nocturne.\n",
+ "\n",
+ "_Last update: 10/2023_"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import json\n",
+ "import matplotlib.pyplot as plt\n",
+ "import seaborn as sns\n",
+ "import pandas as pd\n",
+ "\n",
+ "import os\n",
+ "os.chdir('..')\n",
+ "\n",
+ "cmap = ['r', 'g', 'b', 'y', 'c'] \n",
+ "%config InlineBackend.figure_format = 'svg'\n",
+ "sns.set('notebook', font_scale=1.1, rc={'figure.figsize': (8, 3)})\n",
+ "sns.set_style('ticks', rc={'figure.facecolor': 'none', 'axes.facecolor': 'none'})"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Traffic scenes are constructed by utilizing the [Waymo Open Motion dataset](https://waymo.com/open/). Though every scene is unique, they all have the same basic data structure. \n",
+ "\n",
+ "To load a traffic scene:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "dict_keys(['name', 'objects', 'roads', 'tl_states'])"
+ ]
+ },
+ "execution_count": 11,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Take an example scene\n",
+ "data_path = './data/example_scenario.json'\n",
+ "\n",
+ "with open(data_path) as file:\n",
+ " traffic_scene = json.load(file)\n",
+ "\n",
+ "traffic_scene.keys()"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Global Overview \n",
+ "A traffic scene consists of:\n",
+ "- `name`: the name of the traffic scenario.\n",
+ "- `objects`: the road objects or moving vehicles in the scene.\n",
+ "- `roads`: the road points in the scene, these are all the stationary objects.\n",
+ "- `tl_states`: the states of the traffic lights, which are filtered out for now. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "{}"
+ ]
+ },
+ "execution_count": 12,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "traffic_scene['tl_states']"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'tfrecord-00358-of-01000_65.json'"
+ ]
+ },
+ "execution_count": 13,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "traffic_scene['name']"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "image/svg+xml": [
+ "\n",
+ "\n",
+ "\n"
+ ],
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "pd.Series(\n",
+ " [\n",
+ " traffic_scene['objects'][idx]['type']\n",
+ " for idx in range(len(traffic_scene['objects']))\n",
+ " ]\n",
+ ").value_counts().plot(kind='bar', rot=45, color=cmap);\n",
+ "plt.title(f'Distribution of road objects in traffic scene. Total # objects: {len(traffic_scene[\"objects\"])}')\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "This traffic scenario only contains vehicles and pedestrians, some scenes have cyclists as well."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "image/svg+xml": [
+ "\n",
+ "\n",
+ "\n"
+ ],
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "pd.Series(\n",
+ " [\n",
+ " traffic_scene['roads'][idx]['type']\n",
+ " for idx in range(len(traffic_scene['roads']))\n",
+ " ]\n",
+ ").value_counts().plot(kind='bar', rot=45, color=cmap);\n",
+ "plt.title(f'Distribution of road points in traffic scene. Total # points: {len(traffic_scene[\"roads\"])}')\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### In-Depth: Road Objects\n",
+ "\n",
+ "This is a list of different road objects in the traffic scene. For each road object, we have information about its position, velocity, size, in which direction it's heading, whether it's a valid object, the type, and the final position of the vehicle."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "dict_keys(['position', 'width', 'length', 'heading', 'velocity', 'valid', 'goalPosition', 'type'])"
+ ]
+ },
+ "execution_count": 16,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Take the first object\n",
+ "idx = 0\n",
+ "\n",
+ "# For each object, we have this information:\n",
+ "traffic_scene['objects'][idx].keys()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[\n",
+ " {\n",
+ " \"x\": 9037.7138671875,\n",
+ " \"y\": -2720.373779296875\n",
+ " },\n",
+ " {\n",
+ " \"x\": 9037.7607421875,\n",
+ " \"y\": -2720.306640625\n",
+ " },\n",
+ " {\n",
+ " \"x\": 9037.822265625,\n",
+ " \"y\": -2720.217529296875\n",
+ " },\n",
+ " {\n",
+ " \"x\": 9037.8916015625,\n",
+ " \"y\": -2720.146240234375\n",
+ " },\n",
+ " {\n",
+ " \"x\": 9037.9482421875,\n",
+ " \"y\": -2720.070068359375\n",
+ " },\n",
+ " {\n",
+ " \"x\": 9038.01953125,\n",
+ " \"y\": -2719.994384765625\n",
+ " },\n",
+ " {\n",
+ " \"x\": 9038.1005859375,\n",
+ " \"y\": -2719.903076171875\n",
+ " },\n",
+ " {\n",
+ " \"x\": 9038.1953125,\n",
+ " \"y\": -2719.830810546875\n",
+ " },\n",
+ " {\n",
+ " \"x\": 9038.279296875,\n",
+ " \"y\": -2719.74462890625\n",
+ " },\n",
+ " {\n",
+ " \"x\": 9038.3564453125,\n",
+ " \"y\": -2719.674560546875\n",
+ " }\n",
+ "]\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Position contains the (x, y) coordinates for the vehicle at every time step\n",
+ "print(json.dumps(traffic_scene['objects'][idx]['position'][:10], indent=4))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "(0.6877052187919617, 0.6777269244194031)"
+ ]
+ },
+ "execution_count": 18,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Width and length together make the size of the object, and is used to see if there is a collision \n",
+ "traffic_scene['objects'][idx]['width'], traffic_scene['objects'][idx]['length'] "
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "An object's heading refers to the direction it is pointing or moving in. The default coordinate system in Nocturne is right-handed, where the positive x and y axes point to the right and downwards, respectively. In a right-handed coordinate system, 0 degrees is located on the x-axis and the angle increases counter-clockwise.\n",
+ "\n",
+ "Because the scene is created from the viewpoint of an ego driver, there may be instances where the heading of certain vehicles is not available. These cases are represented by the value `-10_000`, to indicate that these steps should be filtered out or are invalid."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "image/svg+xml": [
+ "\n",
+ "\n",
+ "\n"
+ ],
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "# Heading is the direction in which the vehicle is pointing \n",
+ "plt.plot(traffic_scene['objects'][idx]['heading']);\n",
+ "plt.xlabel('Time step')\n",
+ "plt.ylabel('Heading')\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[\n",
+ " {\n",
+ " \"x\": 0.634765625,\n",
+ " \"y\": 0.72265625\n",
+ " },\n",
+ " {\n",
+ " \"x\": 0.46875,\n",
+ " \"y\": 0.67138671875\n",
+ " },\n",
+ " {\n",
+ " \"x\": 0.615234375,\n",
+ " \"y\": 0.89111328125\n",
+ " },\n",
+ " {\n",
+ " \"x\": 0.693359375,\n",
+ " \"y\": 0.712890625\n",
+ " },\n",
+ " {\n",
+ " \"x\": 0.56640625,\n",
+ " \"y\": 0.76171875\n",
+ " },\n",
+ " {\n",
+ " \"x\": 0.712890625,\n",
+ " \"y\": 0.7568359375\n",
+ " },\n",
+ " {\n",
+ " \"x\": 0.810546875,\n",
+ " \"y\": 0.9130859375\n",
+ " },\n",
+ " {\n",
+ " \"x\": 0.947265625,\n",
+ " \"y\": 0.72265625\n",
+ " },\n",
+ " {\n",
+ " \"x\": 0.83984375,\n",
+ " \"y\": 0.86181640625\n",
+ " },\n",
+ " {\n",
+ " \"x\": 0.771484375,\n",
+ " \"y\": 0.70068359375\n",
+ " }\n",
+ "]\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Velocity shows the velocity in the x- and y- directions\n",
+ "print(json.dumps(traffic_scene['objects'][idx]['velocity'][:10], indent=4))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 21,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "image/svg+xml": [
+ "\n",
+ "\n",
+ "\n"
+ ],
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "# Valid indicates if the state of the vehicle was observed for each timepoint\n",
+ "plt.xlabel('Time step')\n",
+ "plt.ylabel('IS VALID');\n",
+ "plt.plot(traffic_scene['objects'][idx]['valid'], '_', lw=5)\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "{'x': 9041.1259765625, 'y': -2716.647216796875}"
+ ]
+ },
+ "execution_count": 22,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Each object has a goalPosition, an (x, y) position within the scene\n",
+ "traffic_scene['objects'][idx]['goalPosition']"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 23,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'pedestrian'"
+ ]
+ },
+ "execution_count": 23,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Finally, we have the type of the vehicle\n",
+ "traffic_scene['objects'][idx]['type']"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### In-Depth: Road Points\n",
+ "\n",
+ "Road points are static objects in the scene."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 24,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "dict_keys(['geometry', 'type'])"
+ ]
+ },
+ "execution_count": 24,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "traffic_scene['roads'][idx].keys()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 25,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'road_edge'"
+ ]
+ },
+ "execution_count": 25,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# This point represents the edge of a road\n",
+ "traffic_scene['roads'][idx]['type']"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 26,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[\n",
+ " {\n",
+ " \"x\": 8922.911733810946,\n",
+ " \"y\": -2849.426741530589\n",
+ " },\n",
+ " {\n",
+ " \"x\": 8923.216436260553,\n",
+ " \"y\": -2849.038518766975\n",
+ " },\n",
+ " {\n",
+ " \"x\": 8923.50673911804,\n",
+ " \"y\": -2848.63941352788\n",
+ " },\n",
+ " {\n",
+ " \"x\": 8923.782254084921,\n",
+ " \"y\": -2848.2299596442986\n",
+ " },\n",
+ " {\n",
+ " \"x\": 8924.042612639492,\n",
+ " \"y\": -2847.8107047886665\n",
+ " },\n",
+ " {\n",
+ " \"x\": 8924.287466537296,\n",
+ " \"y\": -2847.382209743547\n",
+ " },\n",
+ " {\n",
+ " \"x\": 8924.516488266596,\n",
+ " \"y\": -2846.945047650609\n",
+ " },\n",
+ " {\n",
+ " \"x\": 8924.729371495881,\n",
+ " \"y\": -2846.49980324385\n",
+ " },\n",
+ " {\n",
+ " \"x\": 8924.91688626026,\n",
+ " \"y\": -2846.067714357487\n",
+ " },\n",
+ " {\n",
+ " \"x\": 8925.087545312272,\n",
+ " \"y\": -2845.6286986979553\n",
+ " }\n",
+ "]\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Geometry contains the (x, y) position(s) for a road point\n",
+ "# Note that this will be a list for road lanes and edges but a single (x, y) tuple for stop signs and alike\n",
+ "print(json.dumps(traffic_scene['roads'][idx]['geometry'][:10], indent=4));"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "nocturne-research",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.10.12"
+ },
+ "orig_nbformat": 4
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/examples/02_nocturne_concepts.ipynb b/examples/02_nocturne_concepts.ipynb
new file mode 100644
index 00000000..0863e19b
--- /dev/null
+++ b/examples/02_nocturne_concepts.ipynb
@@ -0,0 +1,785 @@
+{
+ "cells": [
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Nocturne concepts\n",
+ "\n",
+ "This page introduces the most basic elements of nocturne. You can find further information about these [in Section 3 of the Nocturne paper](https://arxiv.org/abs/2206.09889).\n",
+ "\n",
+ "_Last update: 10/2023_"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import numpy as np\n",
+ "\n",
+ "import os\n",
+ "os.chdir('..')\n",
+ "\n",
+ "data_path = './data/example_scenario.json'"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Summary\n",
+ "\n",
+ "- Nocturne simulations are **discretized traffic scenarios**. A scenario is a constructed snapshot of traffic situation at a particular timepoint.\n",
+ "- The state of the vehicle of focus is referred to as the **ego state**. Each vehicle has their **own partial view of the traffic scene**; and a visible state is constructed by parameterizing the view distance, head angle and cone radius of the driver. The action for each vehicle is a `(1, 3)` tuple with the acceleration, steering and head angle of the vehicle. \n",
+ "- The **step method advances the simulation** with a desired step size. By default, the dynamics of vehicles are driven by a kinematic bicycle model. If a vehicle is set to expert-controlled mode, its position, heading, and speed will be updated according to a trajectory recorded from a human driver."
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Simulation\n",
+ "\n",
+ "In Nocturne, a simulation discretizes an existing traffic scenario. At the moment, Nocturne supports traffic scenarios from the Waymo Open Dataset, but can be further extended to work with other driving datasets. \n",
+ "\n",
+ "\n",
+ "
\n",
+ "\n",
+ "An example of a set of traffic scenario's in Nocturne. Upon initialization, a start time is chosen. After each iteration we take a step in the simulation, which gets us to the next scenario. This is done until we reach the end of the simulation.
\n",
+ "\n",
+ "\n",
+ "We show an example of this using `example_scenario.json`, where our traffic data is extracted from the Waymo open motion dataset:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from nocturne import Simulation\n",
+ "\n",
+ "scenario_config = {\n",
+ " 'start_time': 0, # When to start the simulation\n",
+ " 'allow_non_vehicles': True, # Whether to include cyclists and pedestrians \n",
+ " 'max_visible_road_points': 10, # Maximum number of road points for a vehicle\n",
+ " 'max_visible_objects': 10, # Maximum number of road objects for a vehicle\n",
+ " 'max_visible_traffic_lights': 10, # Maximum number of traffic lights in constructed view\n",
+ " 'max_visible_stop_signs': 10, # Maximum number of stop signs in constructed view\n",
+ "}\n",
+ "\n",
+ "# Create simulation\n",
+ "sim = Simulation(data_path, scenario_config)"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Scenario\n",
+ "\n",
+ "A simulation consists of a set of scenarios. A scenario is a snapshot of the traffic scene at a particular timepoint. \n",
+ "\n",
+ "Here is how to create a scenario object:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Get traffic scenario at timepoint\n",
+ "scenario = sim.getScenario()"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The `scenario` objects holds information we are interested in. Here are a couple of examples:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "33"
+ ]
+ },
+ "execution_count": 6,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# The number of road objects in the scene\n",
+ "len(scenario.getObjects())"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Total # moving objects: 15\n",
+ "\n",
+ "Object IDs of moving vehicles: \n",
+ " [0, 1, 2, 3, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32] \n"
+ ]
+ }
+ ],
+ "source": [
+ "# The road objects that moved at a particular timepoint\n",
+ "objects_that_moved = scenario.getObjectsThatMoved()\n",
+ "\n",
+ "print(f'Total # moving objects: {len(objects_that_moved)}\\n')\n",
+ "print(f'Object IDs of moving vehicles: \\n {[obj.getID() for obj in objects_that_moved]} ')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "128"
+ ]
+ },
+ "execution_count": 8,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Number of road lines\n",
+ "len(scenario.road_lines())"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[,\n",
+ " ,\n",
+ " ,\n",
+ " ,\n",
+ " ]"
+ ]
+ },
+ "execution_count": 9,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "scenario.getVehicles()[:5]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[]"
+ ]
+ },
+ "execution_count": 10,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# No cyclists in this scene\n",
+ "scenario.getCyclists()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Found 2 moving vehicles in scene: [3, 32]\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Select all moving vehicles that move \n",
+ "moving_vehicles = [obj for obj in scenario.getVehicles() if obj in objects_that_moved]\n",
+ "\n",
+ "print(f'Found {len(moving_vehicles)} moving vehicles in scene: {[vehicle.getID() for vehicle in moving_vehicles]}')"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### Ego state\n",
+ "\n",
+ "The **ego state** is an array with features that describe the current vehicle. This array holds the following information: \n",
+ "- 0: length of ego vehicle\n",
+ "- 1: width of ego vehicle\n",
+ "- 2: speed of ego vehicle\n",
+ "- 3: distance to the goal position of ego vehicle\n",
+ "- 4: angle to the goal (target azimuth) \n",
+ "- 5: desired heading at goal position\n",
+ "- 6: desired speed at goal position\n",
+ "- 7: current acceleration\n",
+ "- 8: current steering position\n",
+ "- 9: current head angle"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Selected vehicle # 3\n"
+ ]
+ },
+ {
+ "data": {
+ "text/plain": [
+ "array([ 4.4936213 , 1.9770377 , 0.07662283, 4.24219 , -0.05617166,\n",
+ " -0.05909407, 1.6792779 , 0. , 0. , 0. ],\n",
+ " dtype=float32)"
+ ]
+ },
+ "execution_count": 12,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Select an arbitrary vehicle\n",
+ "ego_vehicle = moving_vehicles[0]\n",
+ "\n",
+ "print(f'Selected vehicle # {ego_vehicle.getID()}')\n",
+ "\n",
+ "# Get the state for ego vehicle\n",
+ "scenario.ego_state(ego_vehicle)"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### Visible state\n",
+ "\n",
+ "We use the ego vehicle state, together with a view distance (how far the vehicle can see) and a view angle to construct the **visible state**. The figure below shows this procedure for a simplified traffic scene. \n",
+ "\n",
+ "Calling `scenario.visible_state()` returns a dictionary with four matrices:\n",
+ "- `stop_signs`: The visible stop signs \n",
+ "- `traffic_lights`: The states for the traffic lights from the perspective of the ego driver(red, yellow, green).\n",
+ "- `road_points`: The observable road points (static elements in the scene).\n",
+ "- `objects`: The observable road objects (vehicles, pedestrians and cyclists).\n",
+ "\n",
+ "\n",
+ "
\n",
+ "\n",
+ "To investigate coordination under partial observability, agents in Nocturne can only see an obstructed view of their environment. In this simplified traffic scene, we construct the state for the red ego driver. Note that Nocturne assumes that stop signs can be viewed, even if they are behind another driver.