Sequential Skill Preservation with Curiosity-driven Reinforcement Learning

This project implements an online PPO training system that uses pre-trained VAE and HMM models to learn sequential skills in NetHack environments with curiosity-driven intrinsic rewards.

Features

🧠 Pre-trained VAE + HMM Models: Load skill representations from HuggingFace
🎮 MiniHack Integration: 161+ NetHack-based RL environments
🔍 Curiosity-Driven Learning: Multiple intrinsic motivation mechanisms
📊 Experiment Tracking: Weights & Biases integration
🤗 Model Sharing: Automatic HuggingFace model uploads
🚀 Production Ready: Clean, tested, and maintainable codebase

You can find the full report here.

Available Pre-trained Models

All the latest trained PPO models are available on Hugging Face under the namespace CatkinChen/nethack-*. These models include:

🔗 Model Repository: https://huggingface.co/CatkinChen

Available Models

All models follow the naming convention CatkinChen/nethack-* and include:

PPO Policy Networks: Trained on various MiniHack environments
VAE + HMM Models: For skill representation and sequential learning
Complete Training Artifacts: Including training curves, configurations, and logs

To use any of these models, simply reference them by their full repository ID (e.g., CatkinChen/nethack-vae-hmm, CatkinChen/nethack-hmm) in your training scripts.

Requirements

System Requirements

Operating System: Ubuntu 20.04+ (tested), macOS 10.15+, or other Unix-like systems
Python: 3.10 or higher (required)
Memory: 40GB+ recommended for training
Storage: 30GB+ free disk space for dependencies and model storage
GPU: CUDA-compatible GPU recommended for training (optional for inference)

Hardware Recommendations

Training: NVIDIA GPU with 40GB+ VRAM for optimal performance
CPU: Multi-core processor (4+ cores recommended)
Development: Any modern machine with 8GB+ RAM for code development and testing

Software Dependencies

Core: Python 3.10+, Poetry package manager
System Libraries: CMake, Boost, SDL2, X11 development headers
Python Packages: PyTorch 2.7+, Transformers, Gymnasium, W&B, HuggingFace Hub
Optional: CUDA toolkit for GPU acceleration

Supported Environments

Primary: 161+ MiniHack NetHack-based environments
Tested On: MiniHack-Room-5x5-v0, MiniHack-Quest-Hard-v0, MiniHack-KeyRoom-S15-v0
Platform: Works on Linux, macOS (with some additional setup for system dependencies)

Setup Instructions

1. Clone Repository and Submodules

git clone https://github.com/XuChenCatkin/SequentialSkillRL.git
cd SequentialSkillRL
git submodule update --init --recursive

2. Install System Dependencies

sudo apt-get update
sudo apt-get install -y build-essential libboost-context-dev python3-dev libsdl2-dev libx11-dev cmake bison flex pkg-config

3. Install Poetry

sudo apt install -y pipx
pipx install poetry
pipx ensurepath

⚠️ Important: After installing Poetry, you MUST restart your terminal or source your shell profile:

source ~/.bashrc   # For bash users
# OR
source ~/.zshrc    # For zsh users
# OR simply close and reopen your terminal

Verify Poetry installation:

poetry --version

4. Build MiniHack Wheel (Required for Proper Environment Registration)

cd SequentialSkillRL/minihack
python setup.py bdist_wheel
cd ..

5. Install Dependencies

# Option 1: Use the provided installation script (recommended - installs everything)
./install_minihack.sh

# Option 2: Install all dependencies with Poetry (simple one-command approach)
poetry install

# Option 3: Manual step-by-step installation
# First build and install MiniHack wheel (bypasses Poetry hash issues)
cd minihack && python setup.py bdist_wheel && cd ..
pip install minihack/dist/minihack-1.0.2+95b11cc-py3-none-any.whl --force-reinstall
# Then install all other dependencies
poetry install

# Option 4: If Poetry fails due to lock file issues, update and install
poetry lock && poetry install

6. Verify Installation

# Test that MiniHack environments are properly registered
poetry run python -c "
import gymnasium as gym
import minihack
envs = [env for env in gym.envs.registry.keys() if 'MiniHack' in env]
print(f'✅ Found {len(envs)} MiniHack environments')
assert len(envs) > 0, 'MiniHack environments not found!'
print('✅ Installation successful!')
"

7. Environment Activation (Optional)

# Change to the project directory first
cd /workspace/SequentialSkillRL
# Get the environment path and activate it
source $(poetry env info --path)/bin/activate

8. Login to External Services

# Login to Weights & Biases for experiment tracking
wandb login

# Login to Hugging Face CLI for model uploads
hf auth login

Troubleshooting

Poetry Command Not Found

If you see "poetry: command not found" when running ./install_minihack.sh:

Ensure Poetry is installed: pipx install poetry
Update your PATH: pipx ensurepath
Restart your terminal or run: source ~/.bashrc
Verify: poetry --version

MiniHack Environments Not Found

If you see "0 MiniHack environments found", ensure you:

Built the MiniHack wheel: cd minihack && python setup.py bdist_wheel
Installed with pip: pip install minihack/dist/minihack-1.0.2+95b11cc-py3-none-any.whl --force-reinstall
Updated submodules: git submodule update --init --recursive

Poetry Installation Issues

If Poetry fails to install dependencies:

# Option 1: Simple Poetry install (works in most cases)
poetry install

# Option 2: Use the fixed installation script (recommended)
./install_minihack.sh

# Option 3: Manual step-by-step (for MiniHack issues)
cd minihack && python setup.py bdist_wheel && cd ..
pip install minihack/dist/minihack-1.0.2+95b11cc-py3-none-any.whl --force-reinstall
poetry install

# Option 4: Update Poetry lock file if there are hash mismatches
poetry lock && poetry install

# Option 5: Clear Poetry cache and reinstall
poetry cache clear --all .
poetry install

CMake Issues

If you encounter cmake-related errors during NLE compilation:

pip install --upgrade cmake

Quick Start

Training an Agent

from training.online_rl import train_online_ppo_with_pretrained_models

# Train PPO with pre-trained VAE and HMM models from HuggingFace
# All models are available at https://huggingface.co/CatkinChen with naming convention 'nethack-*'
results = train_online_ppo_with_pretrained_models(
    vae_repo_id="CatkinChen/nethack-vae-hmm",
    hmm_repo_id="CatkinChen/nethack-hmm", 
    env_name="MiniHack-Room-5x5-v0",
    total_timesteps=50000,
    use_wandb=True,
    wandb_project="SequentialSkillRL",
    push_to_hub=True,  # Upload all components to unified repo
    hub_repo_id_vae_hmm="your-username/nethack-complete-model",
    device="cuda"
)
print(f"Training completed! Run: {results['run_name']}")

Loading Pre-trained Models

# Load the latest trained PPO models from CatkinChen's HuggingFace repository
# Browse all available models at: https://huggingface.co/CatkinChen
# All NetHack models follow the pattern: CatkinChen/nethack-*

# Example: Load VAE and HMM models
results = train_online_ppo_with_pretrained_models(
    vae_repo_id=None,
    hmm_repo_id=None,
    # ... other parameters
    ppo_repo_id=CatkinChen/nethack-ppo-ablation-baseline_full_curiosity,
    reset_global_steps=True,
    # ... other parameters
)

Quick Test Mode

# Quick test with minimal steps
results = train_online_ppo_with_pretrained_models(
    vae_repo_id=None,
    hmm_repo_id=None,
    ppo_repo_id=CatkinChen/nethack-ppo-ablation-baseline_full_curiosity,
    test_mode=True,
    test_episodes=10,
    use_wandb=False,
    push_to_hub=False
)

Custom Configurations

from rl.ppo import PPOConfig, CuriosityConfig

# Custom PPO configuration
ppo_config = PPOConfig(
    num_envs=16,
    rollout_len=256,
    learning_rate=1e-4,
    clip_coef=0.1
)

# Custom curiosity configuration
curiosity_config = CuriosityConfig(
    use_dyn_kl=True,
    use_skill_entropy=True,
    use_rnd=False,
    eta0_dyn=0.5,
    tau_dyn=1e6
)

results = train_online_ppo_with_pretrained_models(
    vae_repo_name="your-username/nethack-vae",
    hmm_repo_name="your-username/nethack-hmm",
    ppo_config=ppo_config,
    curiosity_config=curiosity_config,
    total_env_steps=1000000
)

Command Line Usage

# Basic training with latest pre-trained models from CatkinChen's HuggingFace repository
# All models available at: https://huggingface.co/CatkinChen (pattern: nethack-*)
python main.py rl baseline full_curiosity \
  --env MiniHack-Room-5x5-v0 \
  --steps 50000
  --resume CatkinChen/nethack-ppo-ablation-baseline_full_curiosity
  --reset_step

# Training with different model and reward configurations
python main.py rl baseline curiosity_dyn_only \
  --env MiniHack-Quest-Hard-v0 \
  --steps 100000
  --resume CatkinChen/nethack-ppo-ablation-baseline_curiosity_dyn_only
  --reset_step

# Training without HMM (VAE only)
python main.py rl no_hmm curiosity_dyn_only \
  --env MiniHack-Room-Random-15x15-v0 \
  --steps 50000
  --resume CatkinChen/nethack-ppo-ablation-no_hmm_curiosity_dyn_only
  --reset_step

# Training with Random Network Distillation
python main.py rl baseline rnd \
  --env MiniHack-KeyRoom-S15-v0 \
  --steps 100000
  --resume CatkinChen/nethack-ppo-ablation-baseline_rnd
  --reset_step

# Training with no intrinsic rewards (extrinsic only)
python main.py rl baseline no_intrinsic \
  --env MiniHack-River-Narrow-v0 \
  --steps 50000
  --resume CatkinChen/nethack-ppo-ablation-no_hmm_no_intrinsic
  --reset_step

# Custom seed for reproducibility
python main.py rl baseline full_curiosity \
  --env MiniHack-Room-5x5-v0 \
  --steps 50000 \
  --seed 123
  --resume CatkinChen/nethack-ppo-ablation-baseline_full_curiosity
  --reset_step

PPO Ablation Highlights (from the Thesis)

Section 4.3 of the master's thesis details a comprehensive PPO ablation comparing VAE+PPO baselines against the proposed VAE+HMM+PPO agent across MiniHack environments. The tables and discussion below condense those findings and are accompanied by the plots in ppo_analysis/.

MiniHack Room (Random 15x15)

HMM prior improves stability: Even without intrinsic bonuses, adding the sticky HDP-HMM prior raises success from 37.08% to 42.99% by enforcing persistent latent skills and reducing dithering in partially observed rooms.
Dynamics surprise drives exploration: The dynamics-only bonus is the dominant curiosity signal, lifting success to 45.10% with HMMs (41.78% without) before decaying as the world model becomes confident.
Full curiosity = best completion rate: Combining all curiosity terms yields the highest success (45.30%) and shortest episodes (217.8 ± 106.9 steps), although the extra exploration penalties mean the extrinsic return is slightly lower than dynamics-only runs.
Skill entropy & transition novelty are gated off: In single-skill rooms these signals rarely activate, so performance gains over dynamics-only curiosity are marginal.

Configuration	Success Rate (%)	Extrinsic Return	Episode Length
No HMM, no intrinsic	37.08	0.243 ± 0.548	230.2 ± 103.8
No HMM, dynamics only	41.78	0.258 ± 0.583	223.2 ± 106.0
No HMM, RND	41.94	0.259 ± 0.576	220.5 ± 108.4
HMM, no intrinsic	42.99	0.168 ± 0.618	226.1 ± 102.7
HMM, dynamics only	45.10	0.190 ± 0.630	221.3 ± 105.4
HMM, skill entropy only	40.05	0.133 ± 0.622	230.1 ± 101.9
HMM, transition novelty only	42.52	0.174 ± 0.626	223.2 ± 106.2
HMM, full curiosity	45.30	0.195 ± 0.630	217.8 ± 106.9
HMM, RND	43.20	0.165 ± 0.645	223.3 ± 105.5

_{Success rates for HMM + curiosity variants.}	_{Success rates without intrinsic bonuses (HMM vs. no HMM).}
_{Dynamics surprise shaping early exploration.}	_{Random Network Distillation inducing longer wandering.}
_{Sticky HDP-HMM prior encouraging consistent skills.}	_{Transition novelty rarely activates in single-skill rooms.}
_{Intrinsic reward decomposition showing dynamics dominance.}

MiniHack River (Narrow)

Dynamics + HMM wins: Transferring the pretrained models into MiniHack River shows the HMM with dynamics-only curiosity achieving the best success (46.93%), beating both no-HMM baselines (38.58% / 37.22%) and the full curiosity variant (42.69%).
Skill-aware representation matters: The sticky HMM stabilises PPO inputs, enabling faster transfer from Room training and more reliable execution of the navigation→push skill sequence.
Targeted novelty beats generic exploration: The dynamics KL bonus focuses on contact uncertainty (e.g., boulder pushes), providing low-interference guidance, whereas RND encourages wandering, yielding the longest episodes (263.6 steps) and the lowest mean returns (0.145).
Success rate is the most faithful metric: Sparse rewards and penalty accumulation mean extrinsic returns lag behind completion rates; monitoring success is more indicative of real progress on this contact-heavy task.

Configuration	Success Rate (%)	Extrinsic Return	Episode Length
No HMM, no intrinsic	38.58	0.323 ± 0.486	242.2 ± 140.6
No HMM, dynamics only	37.22	0.326 ± 0.485	251.3 ± 133.6
HMM, dynamics only	46.93	0.240 ± 0.544	246.9 ± 122.6
HMM, full curiosity	42.69	0.230 ± 0.522	252.1 ± 124.5
HMM, RND	37.90	0.145 ± 0.527	263.6 ± 121.6

_{Success rates for HMM curiosity combinations in River.}	_{Success rates contrasting dynamics vs. RND bonuses.}
_{No-HMM baselines highlighting transfer gap.}	_{Dynamics-only transfer across checkpoints.}
_{Dynamics surprise focusing on contact interactions.}	_{RND exploration yielding longer episodes.}
_{HDP-HMM prior stabilising latent skill transitions.}	_{Transition novelty emphasising skill sequencing.}
_{Intrinsic reward decomposition showing dynamics dominance.}

Project Structure

SequentialSkillRL/
├── src/                     # Core source code
│   ├── model.py            # VAE and HMM model definitions
│   ├── skill_space.py      # Skill space management
│   └── data_collection.py  # Data collection utilities
├── training/                # Training pipeline
│   ├── train.py            # Main training script
│   ├── online_rl.py        # Online PPO training system
│   ├── training_utils.py   # Training utilities
│   └── README_online_rl.md # Training documentation
├── rl/                      # Reinforcement learning components
│   └── ppo.py              # PPO implementation
├── utils/                   # Utility functions
│   ├── env_utils.py        # Environment utilities
│   ├── action_utils.py     # Action space utilities
│   ├── analysis.py         # Analysis and visualization
│   └── math_utils.py       # Mathematical utilities
├── nle/                     # NetHack Learning Environment (submodule)
├── minihack/               # MiniHack environments (submodule)
├── runs/                    # Training run outputs and logs
├── logs/                    # Training and experiment logs
├── wandb/                   # Weights & Biases experiment tracking
├── checkpoints_hmm/         # HMM model checkpoints
├── bin_count_analysis/      # Action frequency analysis
├── hmm_analysis/            # HMM analysis results
├── vae_analysis/           # VAE analysis results
├── vae_hmm_analysis/       # Combined VAE+HMM analysis
├── main.py                 # Main entry point
├── analyze_ablations.py    # Ablation study analysis
├── run_ablations.py        # Ablation study runner
├── run_experiments.sh      # Experiment automation script
├── install_minihack.sh     # Installation helper script
├── evaluation.ipynb        # Evaluation notebook
├── nld_tutorial.ipynb      # NLD tutorial notebook
├── trial.ipynb             # Trial experiments notebook
├── pyproject.toml          # Poetry project configuration
└── poetry.lock             # Poetry dependency lock file

Key Components

1. VAE + HMM Models

VAE: Encodes NetHack observations into latent skill representations
HMM: Models sequential skill transitions and dynamics
Integration: Combined for curiosity-driven exploration

2. Online PPO Training

Environment: MiniHack-based NetHack environments
Policy: Uses skill-aware policy networks
Intrinsic Rewards: Dynamic KL divergence, skill entropy, RND

3. HuggingFace Integration

Model Upload: Automatically uploads trained models (PPO policy, VAE, HMM) to unified repositories
Training Artifacts: Uploads training curves, logs, and configuration files
Model Cards: Generates comprehensive model documentation
Separate Repositories: Supports loading VAE and HMM from different repositories

Training Artifacts Include:

Training Curves: Reward progression and performance metrics over time
Configuration Files: Complete training hyperparameters and settings
Model Cards: Detailed documentation with usage examples
Training Logs: Step-by-step training metrics and evaluation results

Usage Example:

# Train and upload complete model with training artifacts
# Note: HuggingFace integration is built into the training pipeline
python main.py rl baseline full_curiosity \
  --env MiniHack-Room-5x5-v0 \
  --steps 100000

# Models are automatically loaded from CatkinChen's repositories
# Training results and checkpoints are saved locally in runs/ directory

Tracking: Real-time metrics and model uploading

3. Experiment Management

W&B Integration: Automatic experiment tracking
HuggingFace Hub: Model versioning and sharing
Checkpointing: Resume training from any point

Contributing

Fork the repository
Create your feature branch: git checkout -b feature/amazing-feature
Commit your changes: git commit -m 'Add amazing feature'
Push to the branch: git push origin feature/amazing-feature
Open a Pull Request

Acknowledgments

NetHack Learning Environment (NLE) team
MiniHack team for the extensive environment suite
HuggingFace for model hosting and sharing infrastructure

Name		Name	Last commit message	Last commit date
Latest commit History 261 Commits
bin_count_analysis		bin_count_analysis
checkpoints_hmm		checkpoints_hmm
hmm_analysis		hmm_analysis
logs		logs
minihack @ 95b11cc		minihack @ 95b11cc
nle @ e389f6d		nle @ e389f6d
ppo_analysis		ppo_analysis
rl		rl
runs		runs
src		src
training		training
utils		utils
vae_analysis		vae_analysis
vae_hmm_analysis		vae_hmm_analysis
wandb		wandb
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
analyze_ablations.py		analyze_ablations.py
evaluation.ipynb		evaluation.ipynb
install_minihack.sh		install_minihack.sh
main.py		main.py
nld_tutorial.ipynb		nld_tutorial.ipynb
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
run_ablations.py		run_ablations.py
run_experiments.sh		run_experiments.sh
thesis.pdf		thesis.pdf
trial.ipynb		trial.ipynb
ttyrecs.db		ttyrecs.db

Folders and files

Latest commit

History

Repository files navigation

Sequential Skill Preservation with Curiosity-driven Reinforcement Learning

Features

Available Pre-trained Models

Available Models

Requirements

System Requirements

Hardware Recommendations

Software Dependencies

Supported Environments

Setup Instructions

1. Clone Repository and Submodules

2. Install System Dependencies

3. Install Poetry

4. Build MiniHack Wheel (Required for Proper Environment Registration)

5. Install Dependencies

6. Verify Installation

7. Environment Activation (Optional)

8. Login to External Services

Troubleshooting

Poetry Command Not Found

MiniHack Environments Not Found

Poetry Installation Issues

CMake Issues

Quick Start

Training an Agent

Loading Pre-trained Models

Quick Test Mode

Custom Configurations

Command Line Usage

PPO Ablation Highlights (from the Thesis)

MiniHack Room (Random 15x15)

MiniHack River (Narrow)

Project Structure

Key Components

1. VAE + HMM Models

2. Online PPO Training

3. HuggingFace Integration

Training Artifacts Include:

Usage Example:

3. Experiment Management

Contributing

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages