Integration test suite implementation by Henrrypg · Pull Request #228 · openedx/openedx-ai-extensions

Henrrypg · 2026-06-10T14:11:37Z

No description provided.

openedx-webhooks · 2026-06-10T14:11:44Z

Thanks for the pull request, @Henrrypg!

This repository is currently maintained by @felipemontoya.

Once you've gone through the following steps feel free to tag them in a comment and let them know that your changes are ready for engineering review.

🔘 Get product approval

If you haven't already, check this list to see if your contribution needs to go through the product review process.

If it does, you'll need to submit a product proposal for your contribution, and have it reviewed by the Product Working Group.
- This process (including the steps you'll need to take) is documented here.
If it doesn't, simply proceed with the next step.

🔘 Provide context

To help your reviewers and other members of the community understand the purpose and larger context of your changes, feel free to add as much of the following information to the PR description as you can:

Dependencies

This PR must be merged before / after / at the same time as ...
Blockers

This PR is waiting for OEP-1234 to be accepted.
Timeline information

This PR must be merged by XX date because ...
Partner information

This is for a course on edx.org.
Supporting documentation
Relevant Open edX discussion forum threads

🔘 Get a green build

If one or more checks are failing, continue working on your changes until this is no longer the case and your build turns green.

Details

Where can I find more information?

If you'd like to get more details on all aspects of the review process for open source pull requests (OSPRs), check out the following resources:

When can I expect my changes to be merged?

Our goal is to get community contributions seen and reviewed as efficiently as possible.

However, the amount of time that it takes to review and merge a PR can vary significantly based on factors such as:

The size and impact of the changes that it introduces
The need for product review
Maintenance status of the parent repository

💡 As a result it may take up to several weeks or months to complete a review and merge your PR.

felipemontoya · 2026-06-10T21:44:19Z

+
+    if hasattr(settings, "AI_EXTENSIONS"):
+        configs = getattr(settings, "AI_EXTENSIONS", {})
+        if "openai" in configs and "model" not in configs["openai"]:


can you point to the place where this "latest cheapest" model is defined or can be searched and updated?

codecov · 2026-06-10T22:29:47Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 95.32%. Comparing base (5e53541) to head (497ed86).

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #228      +/-   ##
==========================================
- Coverage   95.33%   95.32%   -0.01%     
==========================================
  Files          69       69              
  Lines        8075     8083       +8     
  Branches      429      432       +3     
==========================================
+ Hits         7698     7705       +7     
  Misses        283      283              
- Partials       94       95       +1

Flag	Coverage Δ
unittests	`95.32% <ø> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

felipemontoya · 2026-06-11T19:25:06Z

Ran this locally with:

export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-....
make test-integration

Results:

make test-integration
DJANGO_SETTINGS_MODULE=integration_test_settings pytest tests/integration/ -m live_llm -v
=================================================================================== test session starts ===================================================================================
platform linux -- Python 3.12.5, pytest-8.3.5, pluggy-1.5.0 -- /data/eduNEXT/ws-community/2025/aiext-azimut/aiext/src/openedx-ai-extensions/backend/venv/bin/python
cachedir: .pytest_cache
django: version: 5.2.15, settings: integration_test_settings (from env)
rootdir: /data/eduNEXT/ws-community/2025/aiext-azimut/aiext/src/openedx-ai-extensions/backend
configfile: tox.ini
plugins: cov-6.1.1, Faker-25.8.0, anyio-4.11.0, django-4.11.1
collected 49 items                                                                                                                                                                        

tests/integration/test_fault_tolerance.py::test_invalid_api_key_does_not_return_completed[openai] PASSED                                                                            [  2%]
tests/integration/test_fault_tolerance.py::test_invalid_api_key_does_not_return_completed[anthropic] PASSED                                                                         [  4%]
tests/integration/test_fault_tolerance.py::test_wrong_model_name_does_not_return_completed[openai] PASSED                                                                           [  6%]
tests/integration/test_fault_tolerance.py::test_wrong_model_name_does_not_return_completed[anthropic] PASSED                                                                        [  8%]
tests/integration/test_live_llm_providers.py::test_provider_returns_non_empty_response[openai] PASSED                                                                               [ 10%]
tests/integration/test_live_llm_providers.py::test_provider_returns_non_empty_response[anthropic] PASSED                                                                            [ 12%]
tests/integration/test_live_llm_providers.py::test_response_format_json_schema[openai] PASSED                                                                                       [ 14%]
tests/integration/test_live_llm_providers.py::test_response_format_json_schema[anthropic] PASSED                                                                                    [ 16%]
tests/integration/test_live_llm_providers.py::test_threaded_context_maintained_openai PASSED                                                                                        [ 18%]
tests/integration/test_profile_coverage.py::test_custom_prompt_profile_returns_rephrased_content[openai] PASSED                                                                     [ 20%]
tests/integration/test_profile_coverage.py::test_custom_prompt_profile_returns_rephrased_content[anthropic] PASSED                                                                  [ 22%]
tests/integration/test_profile_coverage.py::test_library_creator_profile_returns_quiz_problems[openai] PASSED                                                                       [ 24%]
tests/integration/test_profile_coverage.py::test_library_creator_profile_returns_quiz_problems[anthropic] PASSED                                                                    [ 26%]
tests/integration/test_profile_coverage.py::test_box_hello_profile_returns_greeting PASSED                                                                                          [ 28%]
tests/integration/test_profile_coverage.py::test_chat_profile_non_streaming_returns_response PASSED                                                                                 [ 30%]
tests/integration/test_response_format.py::test_response_format_no_extra_keys[openai] PASSED                                                                                        [ 32%]
tests/integration/test_response_format.py::test_response_format_no_extra_keys[anthropic] PASSED                                                                                     [ 34%]
tests/integration/test_response_format.py::test_response_format_required_array_non_empty[openai] PASSED                                                                             [ 36%]
tests/integration/test_response_format.py::test_response_format_required_array_non_empty[anthropic] PASSED                                                                          [ 38%]
tests/integration/test_response_format.py::test_anthropic_streaming_with_strict_schema_no_crash PASSED                                                                              [ 40%]
tests/integration/test_semantic_quality.py::test_response_language_matches_content[openai] XPASS (LLM-as-judge verdict depends on the target model's reasoning quality; weaker-...) [ 42%]
tests/integration/test_semantic_quality.py::test_response_language_matches_content[anthropic] XPASS (LLM-as-judge verdict depends on the target model's reasoning quality; weak...) [ 44%]
tests/integration/test_semantic_quality.py::test_response_does_not_hallucinate_beyond_content[openai] XPASS (LLM-as-judge verdict depends on the target model's reasoning quali...) [ 46%]
tests/integration/test_semantic_quality.py::test_response_does_not_hallucinate_beyond_content[anthropic] XPASS (LLM-as-judge verdict depends on the target model's reasoning qu...) [ 48%]
tests/integration/test_semantic_quality.py::test_response_not_truncated_mid_list[openai] XPASS (LLM-as-judge verdict depends on the target model's reasoning quality; weaker-re...) [ 51%]
tests/integration/test_semantic_quality.py::test_response_not_truncated_mid_list[anthropic] XPASS (LLM-as-judge verdict depends on the target model's reasoning quality; weaker...) [ 53%]
tests/integration/test_streaming_edge_cases.py::test_streaming_handles_empty_delta_chunks[openai] PASSED                                                                            [ 55%]
tests/integration/test_streaming_edge_cases.py::test_streaming_handles_empty_delta_chunks[anthropic] PASSED                                                                         [ 57%]
tests/integration/test_streaming_edge_cases.py::test_streaming_long_response_arrives_completely[openai] PASSED                                                                      [ 59%]
tests/integration/test_streaming_edge_cases.py::test_streaming_long_response_arrives_completely[anthropic] PASSED                                                                   [ 61%]
tests/integration/test_streaming_edge_cases.py::test_streaming_with_response_format_openai PASSED                                                                                   [ 63%]
tests/integration/test_streaming_edge_cases.py::test_streaming_with_response_format_anthropic_clean_outcome PASSED                                                                  [ 65%]
tests/integration/test_streaming_edge_cases.py::test_healthy_stream_has_no_error_marker[openai] PASSED                                                                              [ 67%]
tests/integration/test_streaming_edge_cases.py::test_healthy_stream_has_no_error_marker[anthropic] PASSED                                                                           [ 69%]
tests/integration/test_threading.py::test_stale_thread_id_triggers_recovery FAILED                                                                                                  [ 71%]
tests/integration/test_threading.py::test_conversation_clean_after_stale_thread_recovery FAILED                                                                                     [ 73%]
tests/integration/test_threading.py::test_three_turn_context_chain PASSED                                                                                                           [ 75%]
tests/integration/test_educator_assistant.py::test_quiz_generation_returns_non_empty_problems[openai] PASSED                                                                        [ 77%]
tests/integration/test_educator_assistant.py::test_quiz_generation_returns_non_empty_problems[anthropic] PASSED                                                                     [ 79%]
tests/integration/test_educator_assistant.py::test_quiz_generation_response_is_valid_json[openai] PASSED                                                                            [ 81%]
tests/integration/test_educator_assistant.py::test_quiz_generation_response_is_valid_json[anthropic] PASSED                                                                         [ 83%]
tests/integration/test_live_llm_providers.py::test_threaded_stores_remote_response_id PASSED                                                                                        [ 85%]
tests/integration/test_threading.py::test_anthropic_cache_hit_on_second_call PASSED                                                                                                 [ 87%]
tests/integration/test_threading.py::test_anthropic_cache_short_prompt_no_crash PASSED                                                                                              [ 89%]
tests/integration/test_tool_calls.py::test_tool_call_pipeline_completes[openai] PASSED                                                                                              [ 91%]
tests/integration/test_tool_calls.py::test_tool_call_pipeline_completes[anthropic] PASSED                                                                                           [ 93%]
tests/integration/test_tool_calls.py::test_unknown_tool_name_returns_error_string PASSED                                                                                            [ 95%]
tests/integration/test_tool_calls.py::test_empty_available_tools_does_not_crash[openai] PASSED                                                                                      [ 97%]
tests/integration/test_tool_calls.py::test_empty_available_tools_does_not_crash[anthropic] FAILED                                                                                   [100%]

felipemontoya · 2026-06-11T19:32:34Z

The failing tests I suppose is what you found and fix/reverted in the last commit in favor of a separate PR.

felipemontoya

This is a very interesting PR. Good work overall. I do have a lot of inline comments and I think that in order to solve them we can split the PR into two or more.

There is one thing that I'm going over in my head about where/how we can publish the results of the test in a way that captures what the providers are doing and lets us fail the tests on a less strict threshold than all-must-pass.

Next steps for me would be to split this PR, address the feedback that we can for some of the files and get it merged. Then we can work on CI and then again go back to the other files testing some of the more difficult capabilities.

felipemontoya · 2026-06-11T17:04:19Z

+    (stream overridden to False) returns a non-empty completed response.
+    """
+    create_profile_and_scope(
+        "test_openai", course_key, "examples/openai/chat.json", slug_suffix="chat"


this profile has "stream": true

create_profile_and_scope function sets all profiles with "stream": false by default

felipemontoya · 2026-06-11T19:51:29Z

+_JUDGE_BASE_SYSTEM = "You are a strict evaluator. Answer with valid JSON only, no extra text."
+
+
+def judge(system_question, user_content):


How about we move this whole judge subsystem into its own file?

I removed whole judge system and semantic test to have it in a split PR

felipemontoya · 2026-06-11T19:54:21Z

+_settings.SERVICE_VARIANT = "lms"
+
+JUDGE_MODEL = "gpt-4.1-mini"
+_JUDGE_BASE_SYSTEM = "You are a strict evaluator. Answer with valid JSON only, no extra text."


Can we refactor this into a schema_definition instead of asking the model for json?

I removed whole judge system and semantic test to have it in a split PR

openedx-webhooks added open-source-contribution PR author is not from Axim or 2U core contributor PR author is a Core Contributor (who may or may not have write access to this repo). labels Jun 10, 2026

openedx-webhooks added this to Contributions Jun 10, 2026

github-project-automation Bot moved this to Needs Triage in Contributions Jun 10, 2026

Henrrypg added 4 commits June 10, 2026 14:33

chore: set config defaults in python

c07cf49

feat: implement test suite

6d03d55

fix: add semantic quality tests optional

6eae28c

fix: qa and fix tests

c0d6768

Henrrypg force-pushed the hpg/testing-implementation branch from 906b6d4 to c0d6768 Compare June 10, 2026 20:57

felipemontoya reviewed Jun 10, 2026

View reviewed changes

felipemontoya changed the title ~~chore: integration test suite implementation~~ Integration test suite implementation Jun 10, 2026

feat: undo changes with integration test failing

62abea0

mphilbrick211 moved this from Needs Triage to Waiting on Author in Contributions Jun 11, 2026

felipemontoya requested changes Jun 11, 2026

View reviewed changes

Henrrypg added 3 commits June 12, 2026 08:40

chore: address comments

1d3edc5

chore: remove all semantic tests and judge functionality

598a4e9

chore: address comments

497ed86

		_JUDGE_BASE_SYSTEM = "You are a strict evaluator. Answer with valid JSON only, no extra text."


		def judge(system_question, user_content):

Conversation

Henrrypg commented Jun 10, 2026

Uh oh!

openedx-webhooks commented Jun 10, 2026

Uh oh!

felipemontoya Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

felipemontoya commented Jun 11, 2026

Uh oh!

felipemontoya commented Jun 11, 2026

Uh oh!

felipemontoya left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

felipemontoya Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Henrrypg Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

felipemontoya Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Henrrypg Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

felipemontoya Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Henrrypg Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov Bot commented Jun 10, 2026 •

edited

Loading