[FEATURE] LLM A/B Testing Framework — Systematic prompt and model comparison

## Description

Build a robust A/B testing framework for comparing prompts, models, and parameters
to systematically improve LLM performance.

## Scope

Create statistical experimentation platform for LLM optimization.

## Files to Touch/Create

- `astroml/llm/experimentation/__init__.py`
- `astroml/llm/experimentation/ab_test.py` — A/B test runner
- `astroml/llm/experimentation/analyzer.py` — Statistical analysis
- `astroml/llm/experimentation/assigner.py` — Traffic assignment
- `astroml/llm/experimentation/reporter.py` — Results reporting
- `astroml/llm/experimentation/guardrails.py` — Safety checks
- `api/routers/experiments.py` — Experiment API
- `web/src/components/ExperimentDashboard.tsx` — Dashboard UI

## Experiment Types

1. **Prompt Variants**: Test different prompt phrasings
2. **Model Comparison**: GPT-4 vs Claude vs Llama
3. **Parameter Tuning**: Temperature, top_p, max_tokens
4. **System Prompt**: Different system instructions
5. **Context Strategies**: Different RAG approaches

## Implementation Details

- Traffic splitting: 50/50, 90/10, etc.
- Minimum sample size calculation
- Statistical significance testing (t-test, chi-square)
- Confidence intervals
- Winner declaration automation
- Automatic rollback on regression

## Acceptance Criteria

- Experiments run with proper randomization
- Statistical significance calculated correctly
- Results reproducible
- Winner deployed automatically when significant
- Safety metrics monitored during experiments
- Historical experiment data retained

## Metrics Tracked

- Primary: task success rate, user satisfaction
- Secondary: latency, cost, token usage
- Safety: hallucination rate, toxicity

## Labels

`enhancement`, `llm`, "experimentation", "optimization"


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEATURE] LLM A/B Testing Framework — Systematic prompt and model comparison #472

Description

Scope

Files to Touch/Create

Experiment Types

Implementation Details

Acceptance Criteria

Metrics Tracked

Labels

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[FEATURE] LLM A/B Testing Framework — Systematic prompt and model comparison #472

Description

Description

Scope

Files to Touch/Create

Experiment Types

Implementation Details

Acceptance Criteria

Metrics Tracked

Labels

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions