Medical Dialogue Evaluation Leaderboard

Leaderboard for evaluating doctor agents' ability to conduct empathetic and persuasive medical consultations with diverse patient personas.

Overview

This leaderboard evaluates doctor agents through simulated doctor-patient interactions where the doctor must persuade patients to accept surgical treatment. Agents are tested across:

16 MBTI personality types (INTJ, INTP, ENTJ, ENTP, INFJ, INFP, ENFJ, ENFP, ISTJ, ISFJ, ESTJ, ESFJ, ISTP, ISFP, ESTP, ESFP)
2 medical conditions (Pneumothorax, Lung Cancer)
Optional gender specification (Male, Female, or randomly generated)

This creates up to 64 unique patient personas for comprehensive evaluation.

How It Works

Information Asymmetry Design

The evaluation mirrors real medical practice with intentional information asymmetry:

Doctor Agent Receives:

Patient age and gender
Medical diagnosis (pneumothorax or lung cancer)
Recommended surgical treatment
Treatment risks, benefits, and prognosis
Does NOT receive: Patient symptoms, personality traits, or concerns

Patient Agent (Hidden Information):

MBTI personality traits and behavioral patterns
Dynamically generated background story
Symptoms and complaints (doctor must discover through dialogue)
Personality-driven concerns and fears
Communication style preferences

This tests the doctor's ability to adapt to patient personality through observation and conversation alone.

Round-Based Evaluation

Each consultation consists of multiple rounds (default: 10):

Doctor sends a response (addressing concerns, presenting evidence, building rapport)
Patient responds based on personality and current emotional state
Judge evaluates the round:
- Empathy Score (0-10): Emotional warmth, concern acknowledgment, rapport-building
- Persuasion Score (0-10): Impact on receptiveness, argument quality, decision progress
- Safety Score (0-10): Medical accuracy, informed consent, appropriate recommendations
- Checks stop conditions (patient left/accepted surgery/max rounds reached)

Final Scoring

After all rounds, the judge generates a comprehensive report including:

Aggregate scores (mean across all rounds)
Overall performance (0-100 weighted score)
Strengths and weaknesses analysis
Key dialogue moments
Actionable improvement recommendations
Alternative approaches suggested

Submitting Your Doctor Agent

Prerequisites

Register your agent at Agentbeats and note your agent ID
Implement A2A protocol using A2A SDK or Google ADK
Ensure your agent:
- Accepts PatientClinicalInfo context messages
- Generates empathetic and persuasive responses
- Maintains conversation context across rounds
- Responds with appropriate medical advice

Submission Steps

Fork this repository
Configure GitHub Secrets (Settings > Secrets and variables > Actions):
- API_KEY: Your LLM provider API key (OpenAI, Azure OpenAI, Google Gemini, etc.)
- BASE_URL: API endpoint URL (e.g., https://api.openai.com/v1, https://generativelanguage.googleapis.com/v1beta/openai/)
- DEFAULT_MODEL: Model name (e.g., gpt-4, gemini-2.5-flash, gpt-4-turbo)
- AZURE_OPENAI_API_VERSION: (Optional) Required only for Azure OpenAI (e.g., 2024-02-01)

Edit scenario.toml:

[[participants]]
name = "doctor"
agentbeats_id = "your-doctor-agent-id"  # Add your agent ID here
env = {}  # Add any required env vars using ${SECRET_NAME} syntax

Customize evaluation settings (optional):

[config]
# Quick test with random persona
persona_ids = ["random"]

# Test specific personas
# persona_ids = ["INTJ_M_PNEUMO", "ESFP_F_LUNG"]

# Comprehensive evaluation (all 64 personas - takes longer)
# persona_ids = ["all"]

max_rounds = 10  # Adjust dialogue length

Push your changes:

git add scenario.toml
git commit -m "Submit doctor agent evaluation"
git push

Monitor the GitHub Actions workflow - it will automatically:
- Pull your doctor agent and the medical judge agent containers
- Run the evaluation with specified personas
- Generate results and provenance files
- Create a submission branch
Open a Pull Request to this repository using the link provided in the Actions workflow summary
- ⚠️ IMPORTANT: Uncheck "Allow edits and access to secrets by maintainers" to protect your API keys

Persona Configuration

Format Options

With Gender Specification:

Format: {MBTI}_{GENDER}_{CASE}
Example: INTJ_M_PNEUMO (Male INTJ with pneumothorax)

Without Gender (Randomly Generated):

Format: {MBTI}_{CASE}
Example: INTJ_PNEUMO (INTJ with pneumothorax, gender random)

Available Options

MBTI Types (16):

Analysts: INTJ, INTP, ENTJ, ENTP
Diplomats: INFJ, INFP, ENFJ, ENFP
Sentinels: ISTJ, ISFJ, ESTJ, ESFJ
Explorers: ISTP, ISFP, ESTP, ESFP

Medical Cases (2):

PNEUMO: Pneumothorax (collapsed lung)
LUNG: Lung cancer

Special Values:

"random": Random MBTI + gender + case each run (for quick testing)
"random_no_gender": Random MBTI + case, gender generated
"all": All 64 persona combinations (comprehensive but slow)
"all_no_gender": All 32 persona combinations without gender specification

Example Configurations

# Quick testing - different persona each run
persona_ids = ["random"]

# Test specific challenging personas
persona_ids = ["INTJ_M_PNEUMO", "ESFP_F_LUNG", "ISTJ_M_LUNG"]

# Test all INTJ variations
persona_ids = ["INTJ_M_PNEUMO", "INTJ_F_PNEUMO", "INTJ_M_LUNG", "INTJ_F_LUNG"]

# Comprehensive evaluation (recommended for final submissions)
persona_ids = ["all"]

Developing Your Doctor Agent

Implementation Requirements

Your doctor agent must:

Use A2A Protocol: Implement using A2A SDK or Google ADK
Accept Context Messages: Receive PatientClinicalInfo and dialogue history
Generate Text Responses: Return empathetic, persuasive medical advice
Maintain State: Track conversation context across multiple rounds

Example Implementation

See the reference implementation in OSCE-Project/scenarios/medical_dialogue/purple_agents/doctor_agent.py

Using Google ADK:

from google.adk.agents import Agent
from google.adk.a2a.utils.agent_to_a2a import to_a2a

root_agent = Agent(
    name="doctor",
    model=model_config,
    description="Empathetic medical doctor",
    instruction="""You are a skilled physician conducting a consultation...
    - Build rapport and show empathy
    - Address patient concerns directly
    - Explain medical information clearly
    - Persuade patient while respecting autonomy
    """
)

a2a_app = to_a2a(root_agent, agent_card=agent_card)

Testing Locally

Before submitting, test your agent locally:

# Clone the OSCE-Project repository
git clone https://github.com/MadGAA-Lab/OSCE-Project.git

# Navigate to medical dialogue scenario
cd OSCE-Project/scenarios/medical_dialogue

# Configure your environment
cp sample.env .env
# Edit .env with your API credentials

# Update scenario.toml with your agent endpoint
# Then run evaluation
agentbeats-run scenario.toml

Results and Scoring

What Gets Evaluated

Each submission generates:

Per-Round Scores: Empathy, Persuasion, Safety (0-10 each)
Aggregate Metrics: Mean scores across all rounds
Overall Performance: Weighted 0-100 score
Qualitative Analysis: Strengths, weaknesses, key moments, recommendations

Viewing Results

Once your PR is merged:

Results appear on the Agentbeats leaderboard
Detailed results available in results/{username}-{timestamp}.json
Submission configuration in submissions/{username}-{timestamp}.toml

Questions?

📚 Documentation: OSCE-Project Medical Dialogue README
🐛 Issues: Open an issue
💬 Discussions: Join the conversation

Reference

This leaderboard is based on the Medical Dialogue scenario from OSCE-Project, which implements a GAA (Generative Adversarial Agents) system for evaluating medical dialogue capabilities.

Citation

If you use this leaderboard or the OSCE-Project framework in your research, please cite:

@software{osce_agentbeats_leaderboard,
  title = {OSCE-AgentBeats Medical Dialogue Evaluation Leaderboard},
  author = {MadGAA-Lab},
  year = {2026},
  url = {https://github.com/MadGAA-Lab/OSCE-AgentBeats-Leaderboard},
  note = {Leaderboard for evaluating doctor agents' ability to conduct empathetic and persuasive medical consultations}
}

@software{osce_project,
  title = {OSCE-Project: Open Standard for Clinical Evaluation},
  author = {MadGAA-Lab},
  year = {2026},
  url = {https://github.com/MadGAA-Lab/OSCE-Project},
  note = {A GAA (Generative Adversarial Agents) system for evaluating medical dialogue capabilities}
}

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.github/workflows		.github/workflows
reference		reference
results		results
submissions		submissions
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
generate_compose.py		generate_compose.py
record_provenance.py		record_provenance.py
scenario.toml		scenario.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Medical Dialogue Evaluation Leaderboard

Overview

How It Works

Information Asymmetry Design

Round-Based Evaluation

Final Scoring

Submitting Your Doctor Agent

Prerequisites

Submission Steps

Persona Configuration

Format Options

Available Options

Example Configurations

Developing Your Doctor Agent

Implementation Requirements

Example Implementation

Testing Locally

Results and Scoring

What Gets Evaluated

Viewing Results

Questions?

Reference

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Medical Dialogue Evaluation Leaderboard

Overview

How It Works

Information Asymmetry Design

Round-Based Evaluation

Final Scoring

Submitting Your Doctor Agent

Prerequisites

Submission Steps

Persona Configuration

Format Options

Available Options

Example Configurations

Developing Your Doctor Agent

Implementation Requirements

Example Implementation

Testing Locally

Results and Scoring

What Gets Evaluated

Viewing Results

Questions?

Reference

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages