MineExplorer: Evaluating Open-World Exploration of MLLM Agents in Minecraft

MineExplorer is a benchmark for evaluating the open-world exploration capabilities of multimodal large language model (MLLM) agents in Minecraft. We first filter atomic tasks whose solutions rely heavily on Minecraft-specific knowledge to better reflect general open-world reasoning, then organize the benchmark around a ReAct-style capability formulation and compose atomic tasks into implicit multi-hop tasks. To construct reliable instances, MineExplorer uses a multi-agent synthesis workflow that jointly designs task graphs, sandbox scenes, and rule-based milestone evaluators. Experiments show that open-world exploration remains challenging: strong models handle many single-hop tasks but degrade sharply when hidden prerequisites must be coordinated over longer trajectories, and larger models or thinking modes do not consistently translate into better performance.

1. Minecraft Sandbox Environment

MineExplorer uses a Minecraft sandbox service built on top of MineStudio, an open-source framework that provides a Minecraft simulator engine controllable via HTTP API. The sandbox allows you to programmatically create and control Minecraft game environments — spawning scenes, issuing commands, resetting episodes, and capturing first-person screenshots.

We release a ready-to-use Docker image: davidzhth/mineexplorer:0.0.1

Image Specifications

Component	Version / Details
Base OS	Ubuntu 22.04
Python	3.10
Java	OpenJDK 8 (Minecraft runtime dependency)
Framework	MineStudio (bundled Minecraft simulator engine)
Rendering	Xvfb virtual framebuffer (headless rendering, no display required)

Starting the Sandbox

docker run -d --name mineexplorer -p 8000:8000 davidzhth/mineexplorer:0.0.1

The service listens on port 8000. On first launch, Minecraft needs to load the world — this typically takes 60–120 seconds. You can check readiness with:

curl http://localhost:8000/monitor/alive
# Returns: {"status":"alive", ...}

Local Environment Configuration

Once the sandbox is running, set the following environment variable so that both generate_benchmark.py and eval_benchmark.py connect to it automatically:

export MC_SANDBOX_URL=http://localhost:8000

You are then ready to generate and evaluate the benchmark as described in the sections below.

2. Environment Setup

Install the required Python packages:

pip install gymnasium numpy requests pillow loguru python-dotenv typer fastapi uvicorn pydantic imageio imageio-ffmpeg

Set the required environment variables:

export AGENT_API_KEY="your_api_key"
export AGENT_API_BASE="https://your-api-endpoint/v1/openai/native"

3. Generating the Benchmark

Use generate_benchmark.py to generate Minecraft evaluation tasks. The benchmark directory contains the benchmark used in the paper, covering single-hop to 4-hop tasks.

python generate_benchmark.py multi \
    --model aws.claude-opus-4.6 \
    --num-samples 10 \
    --k-min 1 \
    --k-max 1 \
    --candidate-num 1 \
    --output benchmark_new

Key Arguments

Argument	Description
`multi` / `single`	Multi-agent or single-agent benchmark generation. The paper uses multi-agent mode, which produces more reliable instances but is slower due to sandbox interaction.
`--model`	Model name to use for generation
`--num-samples`	Number of samples to generate
`--k-min` / `--k-max`	Range of subtask hops per sample (e.g., set both to `1` for single-hop tasks only)
`--candidate-num`	Number of candidate atomic tasks
`--output`	Output directory

Output Structure

benchmark_new/
├── 0000/
│   └── multi-agent/
│       ├── metadata.json        # Scene configuration
│       ├── milestones.json      # Milestone definitions
│       ├── reasoning_graph.json # Dependency graph
│       └── debate_log.json      # Agent dialogue log
├── 0001/
│   └── multi-agent/
│       └── ...

Want to Generate Harder Tasks?

You can generate extremely challenging tasks by increasing --k-min, --k-max, and --candidate-num. For example, the following command generates tasks with 8–12 prerequisite hops and 15 candidate atomic tasks:

python generate_benchmark.py multi \
    --model aws.claude-opus-4.6 \
    --num-samples 10 \
    --k-min 8 \
    --k-max 12 \
    --candidate-num 15 \
    --output benchmark_hard

This produces tasks with deeply nested, multi-branch dependency graphs — far more complex than standard benchmark instances:

4. Evaluating the Benchmark

Use eval_benchmark.py to run an agent on the generated benchmark and evaluate its performance.

Using an OpenAI-compatible API

python eval_benchmark.py \
    --model aws.claude-opus-4.6 \
    --benchmark-dir benchmark_new \
    --output-dir results \
    --num-workers 10 \
    --resume

Using a Local vLLM Service

Start the vLLM server first:

python -m vllm.entrypoints.openai.api_server \
    --model Qwen2.5-7B \
    --port 8000

Then run evaluation:

python eval_benchmark.py \
    --model Qwen2.5-7B \
    --benchmark-dir benchmark_new \
    --output-dir results \
    --num-workers 10 \
    --use-vllm

Common Arguments

Argument	Description
`--model`	Model to use for evaluation
`--benchmark-dir`	Path to the benchmark directory
`--output-dir`	Directory to save results
`--num-workers`	Number of parallel sandbox workers
`--resume`	Resume from checkpoint (skip completed tasks)
`--limit`	Limit number of evaluation samples (for testing)

Output Structure

results/
└── aws.claude-opus-4.6/
    ├── 0000/
    │   ├── result.json       # Evaluation result
    │   ├── episode.mp4       # Episode replay video
    │   └── messages/         # Conversation logs
    ├── 0001/
    │   └── ...
    └── eval_summary.json     # Aggregated statistics

Results

Citation

If you find this work useful, please cite:

@misc{ju2026mineexplorerevaluatingopenworldexploration,
      title={MineExplorer: Evaluating Open-World Exploration of MLLM Agents in Minecraft}, 
      author={Tianjie Ju and Yueqing Sun and Zheng Wu and Wei Zhang and Yaqi Huo and Xi Su and Qi Gu and Xunliang Cai and Gongshen Liu and Zhuosheng Zhang},
      year={2026},
      eprint={2605.30931},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.30931}, 
}

License

This project is released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
benchmark		benchmark
benchmark_gen		benchmark_gen
benchmark_hard		benchmark_hard
env		env
figures		figures
mc_agent		mc_agent
videos		videos
.DS_Store		.DS_Store
README.md		README.md
eval_benchmark.py		eval_benchmark.py
generate_benchmark.py		generate_benchmark.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MineExplorer: Evaluating Open-World Exploration of MLLM Agents in Minecraft

1. Minecraft Sandbox Environment

Image Specifications

Starting the Sandbox

Local Environment Configuration

2. Environment Setup

3. Generating the Benchmark

Key Arguments

Output Structure

Want to Generate Harder Tasks?

4. Evaluating the Benchmark

Using an OpenAI-compatible API

Using a Local vLLM Service

Common Arguments

Output Structure

Results

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MineExplorer: Evaluating Open-World Exploration of MLLM Agents in Minecraft

1. Minecraft Sandbox Environment

Image Specifications

Starting the Sandbox

Local Environment Configuration

2. Environment Setup

3. Generating the Benchmark

Key Arguments

Output Structure

Want to Generate Harder Tasks?

4. Evaluating the Benchmark

Using an OpenAI-compatible API

Using a Local vLLM Service

Common Arguments

Output Structure

Results

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages