[FEAT] Add gpqa diamond by shruthan · Pull Request #17 · ServiceNow/AU-Harness

shruthan · 2025-09-24T06:41:40Z

📌 Description

Adds GPQA Diamond Audio.

Has 155 of 198 speakable samples converted to speech for evaluation at ServiceNow-AI/gpqa_audio.

On parallel text run, GPT 4o mini scores 39 (reported 40.8 at https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/)

On audio:
GPT 4o mini: 28.9 +- 0.86 (5 runs)
Voxtral Small: 27.1
Phi 4 Multimodal Instruct: 22.58

🛠️ Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality including new tasks)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Refactor / Code cleanup
Maintenance / Chore / Task
Other (please describe):

✅ How Has This Been Tested?

Unit tests
Integration tests
Manual testing

Test Results / Screenshots (if applicable):

📸 Screenshots / Demos

📋 Checklist

Code follows project style guidelines
Tests have been added/updated (if applicable)
Documentation has been updated (if applicable)
Linked relevant issue(s)
Self-reviewed my code

🙌 Additional Notes

akshaykalkunte

LGTM

shruthan · 2025-09-24T21:22:14Z

With some more filtering of samples for audio quality, the dataset now has 147 samples.
Scores are largely similar except Voxtral Small that now scores 29.93

updating turn handling for multi-turn evals

#22) * added phonetics, speech_disorder, and speech_enhancement tasks - still in need of full model scoring. Fixed small inconsistency bug in config by changing judge_properties to judge_settings. * Update the correct HF path for noise_detection task * updated scores --------- Co-authored-by: hoang <huuhoang.nguyen@servicenow.com>

… into scratch/gpqa

nhhoang96

LGTM

Resolving documentation conflicts before merging to main

shruthan added 2 commits September 23, 2025 23:29

add gpqa diamond

fd34c10

Merge branch 'main' into scratch/gpqa

1ac3d58

akshaykalkunte approved these changes Sep 24, 2025

View reviewed changes

oluwanifemibamgbose and others added 11 commits September 24, 2025 20:29

Update constants.py (#18)

cdf4ca4

updating turn handling for multi-turn evals

54e393e

Merge pull request #23 from ServiceNow/feat/update_multi_turn

6b962df

updating turn handling for multi-turn evals

feat: Add Gemini support (#15)

8099646

add spokenwoz speech and text (#24)

a572abe

add vllm configs and readme (#21)

e4d2203

voxtral and phi4 guidance (#25)

99ac7bc

Keeping normalizer up-to-date with Whisper-normalizer for ASR (#27)

daa0616

add gpqa diamond

39a314c

Merge branch 'scratch/gpqa' of https://github.com/ServiceNow/AU-Harness…

e896556

… into scratch/gpqa

nhhoang96 self-requested a review April 18, 2026 15:48

nhhoang96 approved these changes Apr 18, 2026

View reviewed changes

nhhoang96 merged commit 5011dd8 into main Apr 18, 2026

nhhoang96 deleted the scratch/gpqa branch April 18, 2026 15:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] Add gpqa diamond#17

[FEAT] Add gpqa diamond#17
nhhoang96 merged 13 commits intomainfrom
scratch/gpqa

shruthan commented Sep 24, 2025

Uh oh!

akshaykalkunte left a comment

Uh oh!

shruthan commented Sep 24, 2025

Uh oh!

nhhoang96 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

shruthan commented Sep 24, 2025

📌 Description

🛠️ Type of Change

✅ How Has This Been Tested?

📸 Screenshots / Demos

📋 Checklist

🙌 Additional Notes

Uh oh!

akshaykalkunte left a comment

Choose a reason for hiding this comment

Uh oh!

shruthan commented Sep 24, 2025

Uh oh!

nhhoang96 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants