Skip to content

asjchat/AI-Native-Exploratory-Data-Analysis

Repository files navigation

Exploratory Data Analysis — Claude Skill

A Claude Code skill that turns data exploration into a strategic conversation. Upload any tabular dataset (CSV, Excel, TSV) and get domain-aware analysis that challenges your assumptions, flags data limitations, and delivers prioritized next steps — not just statistics.

What Makes This Different

Generic EDA This Skill
Context Blind statistics Asks about your domain and goals first
Hypotheses Confirms what you expect Actively seeks contradicting evidence
Limitations Rarely mentioned Always explicit
Output Descriptive stats Prioritized, actionable recommendations

The 4-Phase Workflow

Phase 1: Context Gathering   (interactive)
         → 4 questions about your domain, objective, and hypotheses

Phase 2: Data Profiling      (automated)
         → Missing data, duplicates, outliers, type mismatches

Phase 3: Domain Exploration  (analytical)
         → Universal + domain-specific analysis
         → Devil's advocate: alternative explanations, Simpson's Paradox

Phase 4: Synthesis           (strategic)
         → Executive summary, confirming vs. contradicting evidence,
           explicit limitations, 3–5 prioritized next steps

Example Output

The skill generates an interactive HTML report with embedded visualizations. The example below is from an e-commerce sales analysis where the user suspected a shipping policy change caused a revenue drop — the skill found the real cause was a product mix shift.

Overview Revenue Trend
Dataset overview Price/revenue analysis
Customer Segments Correlation Matrix
Segment breakdown Feature correlations

Supported Domains

The skill adapts its analysis lens based on your answers in Phase 1:

  • Financial / Banking — transaction patterns, fraud indicators, portfolio concentration
  • Retail / E-commerce — customer behavior, conversion funnels, seasonality, product mix
  • Manufacturing / Supply Chain — defect rates, process capability, bottleneck identification
  • Healthcare / Medical — patient cohorts, treatment outcomes, comorbidity patterns
  • Marketing / Advertising — campaign ROI, channel attribution, audience segmentation
  • General — works on any tabular dataset

Installation

Requirements: Claude Code CLI (claude) and Python 3.8+.

# 1. Clone the repo
git clone https://github.com/YOUR_USERNAME/eda-skill.git
cd eda-skill

# 2. Copy to your Claude skills directory
cp -r . ~/.claude/skills/exploratory-data-analysis/

# 3. Install Python dependencies
./setup.sh

Or install dependencies manually:

pip install pandas numpy matplotlib seaborn scipy

Quick Start

  1. Start a Claude Code conversation
  2. Upload a CSV or Excel file
  3. Say any of the following:
    • "Analyze this dataset"
    • "Help me explore this data"
    • "What patterns do you see?"
    • "Run EDA on this file"
  4. Answer 4 context questions about your domain and objectives
  5. Receive a full analysis + interactive HTML report

Try the included sample datasets

E-commerce dataset (generated, ~5,000 rows — includes intentional quality issues):

python test-cases/generate_sample_data.py
# Then upload test-cases/sample_ecommerce_data.csv and say:
# "Our revenue dropped last quarter. I think it was our new shipping policy. Can you analyze this."

Expected: the skill confirms the drop but reveals product mix shift — not the shipping policy — is the real driver.

Car sales dataset (real data, 50,000 rows):

Upload test-cases/car_sales_data.csv and say:
"Analyze this car sales data. I want to understand what factors most influence price."

Repository Structure

.
├── SKILL.md                        # Claude skill definition (what Claude reads)
├── WORKFLOW.md                     # Visual workflow diagram
├── QUICK_REFERENCE.md              # One-page cheat sheet
├── INSTALLATION.md                 # Detailed installation and customization guide
├── setup.sh                        # Dependency installer
├── scripts/
│   ├── data_profiler.py            # Standalone data quality checker
│   └── generate_report.py          # Interactive HTML report generator
├── test-cases/
│   ├── test-prompts.md             # 10 test scenarios for evaluating the skill
│   ├── generate_sample_data.py     # Generates realistic e-commerce test data
│   ├── sample_ecommerce_data.csv   # Pre-generated e-commerce dataset (5,000+ transactions)
│   └── car_sales_data.csv          # Real car sales dataset (50,000 rows)
└── examples/
    ├── fig1_overview.png
    ├── fig2_price.png
    ├── fig3_segments.png
    └── fig4_correlation.png

Customization

Add a domain template — edit SKILL.md Phase 3 (~line 170):

#### Your Domain (e.g., "Education / Learning Analytics")
- Student engagement rates
- Learning path completion rates
- Drop-off points in courses

Change report styling — edit the CSS in scripts/generate_report.py (line ~39):

--primary-color: #9b59b6;   /* default: #3498db */

Adjust context questions — edit Phase 1 of SKILL.md.

See INSTALLATION.md for the full customization guide and troubleshooting.

Running the Scripts Standalone

# Profile any CSV
python scripts/data_profiler.py your_data.csv --output report.json

# Generate a report (called programmatically from the skill)
python scripts/generate_report.py --profile report.json --output report.html

License

MIT — use it, fork it, adapt it.

About

A domain-aware EDA skill that gathers context before touching data, challenges hypotheses with devil's advocate analysis, and delivers structured findings with explicit limitations and prioritized next steps.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors