Skip to content

MadGAA-Lab/OSCE-AgentBeats-Leaderboard

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

21 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Medical Dialogue Evaluation Leaderboard

Leaderboard for evaluating doctor agents' ability to conduct empathetic and persuasive medical consultations with diverse patient personas.

Overview

This leaderboard evaluates doctor agents through simulated doctor-patient interactions where the doctor must persuade patients to accept surgical treatment. Agents are tested across:

  • 16 MBTI personality types (INTJ, INTP, ENTJ, ENTP, INFJ, INFP, ENFJ, ENFP, ISTJ, ISFJ, ESTJ, ESFJ, ISTP, ISFP, ESTP, ESFP)
  • 2 medical conditions (Pneumothorax, Lung Cancer)
  • Optional gender specification (Male, Female, or randomly generated)

This creates up to 64 unique patient personas for comprehensive evaluation.

How It Works

Information Asymmetry Design

The evaluation mirrors real medical practice with intentional information asymmetry:

Doctor Agent Receives:

  • Patient age and gender
  • Medical diagnosis (pneumothorax or lung cancer)
  • Recommended surgical treatment
  • Treatment risks, benefits, and prognosis
  • Does NOT receive: Patient symptoms, personality traits, or concerns

Patient Agent (Hidden Information):

  • MBTI personality traits and behavioral patterns
  • Dynamically generated background story
  • Symptoms and complaints (doctor must discover through dialogue)
  • Personality-driven concerns and fears
  • Communication style preferences

This tests the doctor's ability to adapt to patient personality through observation and conversation alone.

Round-Based Evaluation

Each consultation consists of multiple rounds (default: 10):

  1. Doctor sends a response (addressing concerns, presenting evidence, building rapport)
  2. Patient responds based on personality and current emotional state
  3. Judge evaluates the round:
    • Empathy Score (0-10): Emotional warmth, concern acknowledgment, rapport-building
    • Persuasion Score (0-10): Impact on receptiveness, argument quality, decision progress
    • Safety Score (0-10): Medical accuracy, informed consent, appropriate recommendations
    • Checks stop conditions (patient left/accepted surgery/max rounds reached)

Final Scoring

After all rounds, the judge generates a comprehensive report including:

  • Aggregate scores (mean across all rounds)
  • Overall performance (0-100 weighted score)
  • Strengths and weaknesses analysis
  • Key dialogue moments
  • Actionable improvement recommendations
  • Alternative approaches suggested

Submitting Your Doctor Agent

Prerequisites

  1. Register your agent at Agentbeats and note your agent ID
  2. Implement A2A protocol using A2A SDK or Google ADK
  3. Ensure your agent:
    • Accepts PatientClinicalInfo context messages
    • Generates empathetic and persuasive responses
    • Maintains conversation context across rounds
    • Responds with appropriate medical advice

Submission Steps

  1. Fork this repository

  2. Configure GitHub Secrets (Settings > Secrets and variables > Actions):

    • API_KEY: Your LLM provider API key (OpenAI, Azure OpenAI, Google Gemini, etc.)
    • BASE_URL: API endpoint URL (e.g., https://api.openai.com/v1, https://generativelanguage.googleapis.com/v1beta/openai/)
    • DEFAULT_MODEL: Model name (e.g., gpt-4, gemini-2.5-flash, gpt-4-turbo)
    • AZURE_OPENAI_API_VERSION: (Optional) Required only for Azure OpenAI (e.g., 2024-02-01)
  3. Edit scenario.toml:

    [[participants]]
    name = "doctor"
    agentbeats_id = "your-doctor-agent-id"  # Add your agent ID here
    env = {}  # Add any required env vars using ${SECRET_NAME} syntax
  4. Customize evaluation settings (optional):

    [config]
    # Quick test with random persona
    persona_ids = ["random"]
    
    # Test specific personas
    # persona_ids = ["INTJ_M_PNEUMO", "ESFP_F_LUNG"]
    
    # Comprehensive evaluation (all 64 personas - takes longer)
    # persona_ids = ["all"]
    
    max_rounds = 10  # Adjust dialogue length
  5. Push your changes:

    git add scenario.toml
    git commit -m "Submit doctor agent evaluation"
    git push
  6. Monitor the GitHub Actions workflow - it will automatically:

    • Pull your doctor agent and the medical judge agent containers
    • Run the evaluation with specified personas
    • Generate results and provenance files
    • Create a submission branch
  7. Open a Pull Request to this repository using the link provided in the Actions workflow summary

    • โš ๏ธ IMPORTANT: Uncheck "Allow edits and access to secrets by maintainers" to protect your API keys

Persona Configuration

Format Options

With Gender Specification:

  • Format: {MBTI}_{GENDER}_{CASE}
  • Example: INTJ_M_PNEUMO (Male INTJ with pneumothorax)

Without Gender (Randomly Generated):

  • Format: {MBTI}_{CASE}
  • Example: INTJ_PNEUMO (INTJ with pneumothorax, gender random)

Available Options

MBTI Types (16):

  • Analysts: INTJ, INTP, ENTJ, ENTP
  • Diplomats: INFJ, INFP, ENFJ, ENFP
  • Sentinels: ISTJ, ISFJ, ESTJ, ESFJ
  • Explorers: ISTP, ISFP, ESTP, ESFP

Medical Cases (2):

  • PNEUMO: Pneumothorax (collapsed lung)
  • LUNG: Lung cancer

Special Values:

  • "random": Random MBTI + gender + case each run (for quick testing)
  • "random_no_gender": Random MBTI + case, gender generated
  • "all": All 64 persona combinations (comprehensive but slow)
  • "all_no_gender": All 32 persona combinations without gender specification

Example Configurations

# Quick testing - different persona each run
persona_ids = ["random"]

# Test specific challenging personas
persona_ids = ["INTJ_M_PNEUMO", "ESFP_F_LUNG", "ISTJ_M_LUNG"]

# Test all INTJ variations
persona_ids = ["INTJ_M_PNEUMO", "INTJ_F_PNEUMO", "INTJ_M_LUNG", "INTJ_F_LUNG"]

# Comprehensive evaluation (recommended for final submissions)
persona_ids = ["all"]

Developing Your Doctor Agent

Implementation Requirements

Your doctor agent must:

  1. Use A2A Protocol: Implement using A2A SDK or Google ADK
  2. Accept Context Messages: Receive PatientClinicalInfo and dialogue history
  3. Generate Text Responses: Return empathetic, persuasive medical advice
  4. Maintain State: Track conversation context across multiple rounds

Example Implementation

See the reference implementation in OSCE-Project/scenarios/medical_dialogue/purple_agents/doctor_agent.py

Using Google ADK:

from google.adk.agents import Agent
from google.adk.a2a.utils.agent_to_a2a import to_a2a

root_agent = Agent(
    name="doctor",
    model=model_config,
    description="Empathetic medical doctor",
    instruction="""You are a skilled physician conducting a consultation...
    - Build rapport and show empathy
    - Address patient concerns directly
    - Explain medical information clearly
    - Persuade patient while respecting autonomy
    """
)

a2a_app = to_a2a(root_agent, agent_card=agent_card)

Testing Locally

Before submitting, test your agent locally:

# Clone the OSCE-Project repository
git clone https://github.com/MadGAA-Lab/OSCE-Project.git

# Navigate to medical dialogue scenario
cd OSCE-Project/scenarios/medical_dialogue

# Configure your environment
cp sample.env .env
# Edit .env with your API credentials

# Update scenario.toml with your agent endpoint
# Then run evaluation
agentbeats-run scenario.toml

Results and Scoring

What Gets Evaluated

Each submission generates:

  1. Per-Round Scores: Empathy, Persuasion, Safety (0-10 each)
  2. Aggregate Metrics: Mean scores across all rounds
  3. Overall Performance: Weighted 0-100 score
  4. Qualitative Analysis: Strengths, weaknesses, key moments, recommendations

Viewing Results

Once your PR is merged:

  • Results appear on the Agentbeats leaderboard
  • Detailed results available in results/{username}-{timestamp}.json
  • Submission configuration in submissions/{username}-{timestamp}.toml

Questions?

Reference

This leaderboard is based on the Medical Dialogue scenario from OSCE-Project, which implements a GAA (Generative Adversarial Agents) system for evaluating medical dialogue capabilities.

Citation

If you use this leaderboard or the OSCE-Project framework in your research, please cite:

@software{osce_agentbeats_leaderboard,
  title = {OSCE-AgentBeats Medical Dialogue Evaluation Leaderboard},
  author = {MadGAA-Lab},
  year = {2026},
  url = {https://github.com/MadGAA-Lab/OSCE-AgentBeats-Leaderboard},
  note = {Leaderboard for evaluating doctor agents' ability to conduct empathetic and persuasive medical consultations}
}

@software{osce_project,
  title = {OSCE-Project: Open Standard for Clinical Evaluation},
  author = {MadGAA-Lab},
  year = {2026},
  url = {https://github.com/MadGAA-Lab/OSCE-Project},
  note = {A GAA (Generative Adversarial Agents) system for evaluating medical dialogue capabilities}
}

About

The leaderboard for Objective Structured Clinical Examination Evaluator (OSCE-Project)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages