DataWizz is a data platform inspired by Databricks, Snowflake, ClickHouse Cloud, Airflow, Superset, and Metabase. It combines file ingestion, SQL exploration, Delta Lake publishing, multi-engine notebooks, low-code orchestration, and business dashboards in one local-first workspace.
The current version is intentionally built as a serious MVP rather than a toy demo:
- Upload, preview, and profile raw files
- Query raw and curated data with DuckDB
- Run notebooks with DuckDB, PySpark, and DataFusion
- Publish transformed outputs as Delta tables
- Build and validate visual pipelines
- Track runs, retries, and logs
- Create semantic datasets, governed metrics, charts, dashboards, and scheduled reports
- Embed Superset and auto-provision a shared DataWizz analytics connection
- Switch between dark and light workspace themes
- Run locally with one script or as a Docker demo stack
DataWizz is designed for analytics engineering and platform demos where you want a believable lakehouse product surface without needing a full distributed stack on day one.
It is especially useful when you want to demonstrate:
- Raw-to-curated data workflows
- SQL-first transformation on local or object-backed files
- Notebook-driven prototyping across multiple execution engines
- Delta Lake publishing with metadata tracking
- Airflow-like orchestration without leaving the app
- In-app BI dashboards on top of curated outputs
- File upload, preview, schema inference, and deletion
- SQL querying over CSV, JSON, Parquet, and curated Delta tables
- Write query outputs to Delta Lake with append or overwrite modes
- Catalog browsing with metadata, freshness, ownership, tags, and lineage hints
- Role-aware row filters and column masking for governed curated-table access
- Reusable data-quality suites with selectable native DuckDB or Great Expectations execution
- Theme-aware workspace shell with dark and light presentation modes
- Multi-cell saved notebooks in the
Engine Lab - Real local execution for DuckDB, PySpark, and DataFusion
- Run-all, run-single-cell, and run-from-here execution flows
- Notebook duplicate, delete, rename, and run history support
- Source-aware asset browser with one-click SQL or Python snippet insertion
- Persisted per-cell outputs so reopened notebooks restore the latest visible state
- Visual pipeline builder powered by React Flow
- File source, Delta source, filter, select, join, aggregate, SQL, validate, write, and schedule nodes
- DAG validation, node guardrails, run history, retries, and detailed logs
- Airflow DAG code generation and export
- Backend recurring scheduler for saved cron pipelines
- Semantic dataset explorer
- Governed semantic metrics layer with certified aggregate definitions and DuckDB previews
- Natural-language chart generation that maps plain-English prompts to datasets, SQL, and chart configs
- Dataset-driven chart builder
- Saved chart library with traceability into dashboards and report schedules
- Dashboard builder and dashboard viewer
- Report scheduler with stored artifacts and snapshot history
- Optional Superset integration surface for demo storytelling
See the deeper system walkthrough in docs/architecture.md.
At a high level:
Users
-> React + TypeScript frontend
-> FastAPI application layer
-> DuckDB execution services
-> Delta Lake curated storage
-> PostgreSQL metadata store
-> Optional MinIO object storage
Project layout:
frontend/ React app for the workspace UI
backend/ FastAPI APIs, services, models, and migrations
docs/ Architecture, API, demo workflow, and screenshots
sample_data/ CSV fixtures and sample pipeline JSON
storage/ raw/, curated/, and temp/ runtime zones
docker-compose.yml Demo stack for frontend, backend, PostgreSQL, MinIO, Superset
run.sh One-command local launcher
From the project root:
./run.shThis launcher:
- Reuses healthy local frontend and backend processes when they are already running
- Starts the app in local demo mode when Docker is unavailable
- Starts the managed Superset runtime automatically by default
- Can bootstrap Superset natively without Docker when Docker is unavailable
- Supports a Docker-based stack when Docker is installed
Local endpoints:
- App:
http://localhost:5173 - API:
http://localhost:8000 - API docs:
http://localhost:8000/docs - Embedded Superset page:
http://localhost:5173/bi/superset
Demo credentials:
- Email:
admin@datawizz.local - Password:
datawizz123
./run.sh local
./run.sh local nosuperset
./run.sh local superset native
./run.sh local --restart
./run.sh auto nosuperset
./run.sh docker
./run.sh docker nosupersetcd backend
python3 -m venv .venv
source .venv/bin/activate
pip install -e '.[dev]'
cp .env.example .env
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000Notes:
- The backend targets PostgreSQL by default.
- For quick local demos, the launcher can use SQLite-backed metadata automatically.
- Local SQLite metadata is stored outside the repository at
${DATAWIZZ_LOCAL_DATABASE_PATH:-${DATAWIZZ_CACHE_DIR:-$HOME/.cache/datawizz}/local/metadata.db}. This prevents Git operations and cloud-sync clients from replacing a database while the backend is running. - Superset is pinned to version
6.1.0. On the first no-Docker launch,run.shinstalls the native runtime in the machine-local DataWizz cache and initializes its metadata before reporting it healthy. - Later launches skip migrations and admin bootstrap, start the prepared runtime directly, verify the process and health endpoint, and provision the shared DuckDB catalog only when it is missing.
- The native Superset package cache lives outside the repository (
${XDG_CACHE_HOME:-$HOME/.cache}/datawizzby default), so cloud-synced workspaces and moved clones do not repeatedly download or slowly import hundreds of megabytes of Python packages. SetDATAWIZZ_CACHE_DIRto override it.
cd frontend
npm install
cp .env.example .env
npm run devdocker compose up --buildIncluded services:
- Frontend
- FastAPI backend
- PostgreSQL
- MinIO
- Optional Superset profile
Optional Superset:
./run.sh
# force the no-Docker runtime
./run.sh local superset native
# skip Superset when you want only the core workspace
./run.sh local nosupersetFor a complete scripted walkthrough, see docs/demo-workflow.md.
Suggested first demo:
- Upload
sample_data/sales.csvandsample_data/customers.csv - Query
raw_salesin the SQL workspace - Write
sales_curatedas a Delta table - Open the catalog and inspect the curated asset
- Open
Engine Laband run a DuckDB, Spark, or DataFusion notebook cell - Run the sample visual pipeline
- Build charts and review the published BI dashboard
Regional revenue:
SELECT
region,
SUM(revenue) AS total_revenue
FROM raw_sales
GROUP BY region
ORDER BY total_revenue DESC;Monthly revenue:
SELECT
strftime(order_date, '%Y-%m') AS month,
SUM(revenue) AS total_revenue
FROM raw_sales
GROUP BY 1
ORDER BY 1;Top customers:
SELECT
customer_id,
SUM(revenue) AS total_revenue
FROM raw_sales
GROUP BY customer_id
ORDER BY total_revenue DESC
LIMIT 10;Every pull request and push to main runs GitHub Actions checks for repository integrity, frontend lint/build, backend compilation and smoke tests, and Docker Compose configuration.
Run the equivalent checks locally with:
./scripts/ci/check-repository.sh
cd backend
python -m compileall app alembic
pytest -q
cd ../frontend
npm run lint
npm run build
cd ..
docker compose config --quiet
docker compose --profile superset config --quietThe repository-integrity check specifically verifies that critical frontend library files are tracked and that relative frontend imports resolve from a clean checkout.
- DuckDB remains the primary SQL workspace engine
- Spark and DataFusion are available through the notebook runtime surface
- Delta publishing is implemented through the backend write services
- Scheduling is now active in-app for saved cron pipelines
- Notebook outputs persist per cell and restore when a notebook is reopened
- The BI layer is intentionally lightweight and app-native; Superset is now available as an embedded managed runtime with an auto-provisioned shared DuckDB connection
- Real login, sessions, seeded users, and role-aware API and UI RBAC for
admin,analyst, andviewer - Dark and light workspace themes with a polished shared shell, search, and page-level UX cleanup
- File Explorer drag-and-drop uploads, schema and row preview, deep column profiling, and profile-driven recommendations
- SQL Workspace querying, export, and Delta publishing backed by DuckDB
- Catalog governance editing, row filters, column masking, quality and freshness signals, data contract guardrails, lineage relationships, and mini lineage graph drill-down
- Curated-table quality suites with native DuckDB or Great Expectations execution, persisted run history, cron schedules, and pipeline quality gates
- Visual pipeline builder validation, join and aggregation guardrails, retries, logs filtering, and recurring scheduler execution
- OpenLineage-compatible pipeline and notebook lifecycle events with local retention, dataset inputs/outputs, and optional HTTP delivery
- Engine Lab notebooks with DuckDB, PySpark, and DataFusion runtimes, saved snippets, collaboration basics, and persisted cell outputs
- BI dataset explorer, semantic metrics layer, natural-language chart generation, chart builder, saved charts, dashboard builder and viewer, filters, and report scheduler with stored artifacts
- Embedded Superset runtime with a shared serving catalog and auto-provisioned
DataWizz Serving Catalogconnection - Clean-clone CI gates for repository integrity, frontend build, backend smoke tests, and Compose validation
- Flink streaming support
- Transactional quarantine/remediation actions for failed quality gates
- OpenLineage coverage for SQL and reports plus additional external transport authentication modes
- Hive Metastore or Nessie-backed catalog options
- Notebook export artifacts and richer collaboration flows
- Dashboard sharing and permissions
- Alerts, subscriptions, and richer export delivery
- Deployment automation, monitoring, and Kubernetes packaging
DataWizz is built to show what a modern internal analytics platform can look like when lakehouse workflows, orchestration, and BI are treated as one cohesive product surface.








