GitHub - rohankumardubey/DataWizz: DataWizz is a data platform which combines file ingestion, SQL exploration, Delta Lake publishing, multi-engine notebooks, low-code orchestration, and business dashboards in one local-first workspace.

A lakehouse, orchestration, and BI workspace for modern analytics teams.

DataWizz is a data platform inspired by Databricks, Snowflake, ClickHouse Cloud, Airflow, Superset, and Metabase. It combines file ingestion, SQL exploration, Delta Lake publishing, multi-engine notebooks, low-code orchestration, and business dashboards in one local-first workspace.

The current version is intentionally built as a serious MVP rather than a toy demo:

Upload, preview, and profile raw files
Query raw and curated data with DuckDB
Run notebooks with DuckDB, PySpark, and DataFusion
Publish transformed outputs as Delta tables
Build and validate visual pipelines
Track runs, retries, and logs
Create semantic datasets, governed metrics, charts, dashboards, and scheduled reports
Embed Superset and auto-provision a shared DataWizz analytics connection
Switch between dark and light workspace themes
Run locally with one script or as a Docker demo stack

Product Tour


Workspace access A polished dark-mode entry experience with demo credentials, platform positioning, and a more presentation-ready first impression.

Lakehouse home Monitor files, Delta assets, pipeline health, and workspace activity from a single landing page.	SQL workspace Query raw files and curated outputs, inspect history, and write results back to Delta Lake.

Curated catalog Browse governed Delta assets with ownership, freshness, schema, and preview data.	Notebook runtime lab Build saved multi-cell notebooks, switch between DuckDB, Spark, and DataFusion, insert source-aware snippets, and persist per-cell outputs.

Data quality operations Choose native DuckDB or Great Expectations execution, retain validation evidence, schedule checks, and apply pipeline quality gates.	Operational lineage Inspect OpenLineage lifecycle events with resolved inputs, Delta outputs, run facets, and external delivery status.

Pipeline builder Design low-code flows, validate graph rules, schedule recurring runs, and export Airflow-style DAGs.	BI dashboard layer Publish chart-driven dashboards, apply shared filters, and generate JSON or mock snapshot exports for stakeholder-ready reporting surfaces.

Why DataWizz

DataWizz is designed for analytics engineering and platform demos where you want a believable lakehouse product surface without needing a full distributed stack on day one.

It is especially useful when you want to demonstrate:

Raw-to-curated data workflows
SQL-first transformation on local or object-backed files
Notebook-driven prototyping across multiple execution engines
Delta Lake publishing with metadata tracking
Airflow-like orchestration without leaving the app
In-app BI dashboards on top of curated outputs

Core Capabilities

Lakehouse

File upload, preview, schema inference, and deletion
SQL querying over CSV, JSON, Parquet, and curated Delta tables
Write query outputs to Delta Lake with append or overwrite modes
Catalog browsing with metadata, freshness, ownership, tags, and lineage hints
Role-aware row filters and column masking for governed curated-table access
Reusable data-quality suites with selectable native DuckDB or Great Expectations execution
Theme-aware workspace shell with dark and light presentation modes

Notebook Runtime

Multi-cell saved notebooks in the Engine Lab
Real local execution for DuckDB, PySpark, and DataFusion
Run-all, run-single-cell, and run-from-here execution flows
Notebook duplicate, delete, rename, and run history support
Source-aware asset browser with one-click SQL or Python snippet insertion
Persisted per-cell outputs so reopened notebooks restore the latest visible state

Orchestration

Visual pipeline builder powered by React Flow
File source, Delta source, filter, select, join, aggregate, SQL, validate, write, and schedule nodes
DAG validation, node guardrails, run history, retries, and detailed logs
Airflow DAG code generation and export
Backend recurring scheduler for saved cron pipelines

BI Layer

Semantic dataset explorer
Governed semantic metrics layer with certified aggregate definitions and DuckDB previews
Natural-language chart generation that maps plain-English prompts to datasets, SQL, and chart configs
Dataset-driven chart builder
Saved chart library with traceability into dashboards and report schedules
Dashboard builder and dashboard viewer
Report scheduler with stored artifacts and snapshot history
Optional Superset integration surface for demo storytelling

Architecture

See the deeper system walkthrough in docs/architecture.md.

At a high level:

Users
  -> React + TypeScript frontend
  -> FastAPI application layer
  -> DuckDB execution services
  -> Delta Lake curated storage
  -> PostgreSQL metadata store
  -> Optional MinIO object storage

Project layout:

frontend/           React app for the workspace UI
backend/            FastAPI APIs, services, models, and migrations
docs/               Architecture, API, demo workflow, and screenshots
sample_data/        CSV fixtures and sample pipeline JSON
storage/            raw/, curated/, and temp/ runtime zones
docker-compose.yml  Demo stack for frontend, backend, PostgreSQL, MinIO, Superset
run.sh              One-command local launcher

Quick Start

One command

From the project root:

./run.sh

This launcher:

Reuses healthy local frontend and backend processes when they are already running
Starts the app in local demo mode when Docker is unavailable
Starts the managed Superset runtime automatically by default
Can bootstrap Superset natively without Docker when Docker is unavailable
Supports a Docker-based stack when Docker is installed

Local endpoints:

App: http://localhost:5173
API: http://localhost:8000
API docs: http://localhost:8000/docs
Embedded Superset page: http://localhost:5173/bi/superset

Demo credentials:

Email: admin@datawizz.local
Password: datawizz123

Other launcher modes

./run.sh local
./run.sh local nosuperset
./run.sh local superset native
./run.sh local --restart
./run.sh auto nosuperset
./run.sh docker
./run.sh docker nosuperset

Local Development

Backend

cd backend
python3 -m venv .venv
source .venv/bin/activate
pip install -e '.[dev]'
cp .env.example .env
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

Notes:

The backend targets PostgreSQL by default.
For quick local demos, the launcher can use SQLite-backed metadata automatically.
Local SQLite metadata is stored outside the repository at ${DATAWIZZ_LOCAL_DATABASE_PATH:-${DATAWIZZ_CACHE_DIR:-$HOME/.cache/datawizz}/local/metadata.db}. This prevents Git operations and cloud-sync clients from replacing a database while the backend is running.
Superset is pinned to version 6.1.0. On the first no-Docker launch, run.sh installs the native runtime in the machine-local DataWizz cache and initializes its metadata before reporting it healthy.
Later launches skip migrations and admin bootstrap, start the prepared runtime directly, verify the process and health endpoint, and provision the shared DuckDB catalog only when it is missing.
The native Superset package cache lives outside the repository (${XDG_CACHE_HOME:-$HOME/.cache}/datawizz by default), so cloud-synced workspaces and moved clones do not repeatedly download or slowly import hundreds of megabytes of Python packages. Set DATAWIZZ_CACHE_DIR to override it.

Frontend

cd frontend
npm install
cp .env.example .env
npm run dev

Docker Demo Stack

docker compose up --build

Included services:

Frontend
FastAPI backend
PostgreSQL
MinIO
Optional Superset profile

Optional Superset:

./run.sh
# force the no-Docker runtime
./run.sh local superset native
# skip Superset when you want only the core workspace
./run.sh local nosuperset

Demo Flow

For a complete scripted walkthrough, see docs/demo-workflow.md.

Suggested first demo:

Upload sample_data/sales.csv and sample_data/customers.csv
Query raw_sales in the SQL workspace
Write sales_curated as a Delta table
Open the catalog and inspect the curated asset
Open Engine Lab and run a DuckDB, Spark, or DataFusion notebook cell
Run the sample visual pipeline
Build charts and review the published BI dashboard

Sample SQL

Regional revenue:

SELECT
  region,
  SUM(revenue) AS total_revenue
FROM raw_sales
GROUP BY region
ORDER BY total_revenue DESC;

Monthly revenue:

SELECT
  strftime(order_date, '%Y-%m') AS month,
  SUM(revenue) AS total_revenue
FROM raw_sales
GROUP BY 1
ORDER BY 1;

Top customers:

SELECT
  customer_id,
  SUM(revenue) AS total_revenue
FROM raw_sales
GROUP BY customer_id
ORDER BY total_revenue DESC
LIMIT 10;

Documentation

Verification

Every pull request and push to main runs GitHub Actions checks for repository integrity, frontend lint/build, backend compilation and smoke tests, and Docker Compose configuration.

Run the equivalent checks locally with:

./scripts/ci/check-repository.sh

cd backend
python -m compileall app alembic
pytest -q

cd ../frontend
npm run lint
npm run build

cd ..
docker compose config --quiet
docker compose --profile superset config --quiet

The repository-integrity check specifically verifies that critical frontend library files are tracked and that relative frontend imports resolve from a clean checkout.

Current MVP Notes

DuckDB remains the primary SQL workspace engine
Spark and DataFusion are available through the notebook runtime surface
Delta publishing is implemented through the backend write services
Scheduling is now active in-app for saved cron pipelines
Notebook outputs persist per cell and restore when a notebook is reopened
The BI layer is intentionally lightweight and app-native; Superset is now available as an embedded managed runtime with an auto-provisioned shared DuckDB connection

Roadmap Status

Completed

Real login, sessions, seeded users, and role-aware API and UI RBAC for admin, analyst, and viewer
Dark and light workspace themes with a polished shared shell, search, and page-level UX cleanup
File Explorer drag-and-drop uploads, schema and row preview, deep column profiling, and profile-driven recommendations
SQL Workspace querying, export, and Delta publishing backed by DuckDB
Catalog governance editing, row filters, column masking, quality and freshness signals, data contract guardrails, lineage relationships, and mini lineage graph drill-down
Curated-table quality suites with native DuckDB or Great Expectations execution, persisted run history, cron schedules, and pipeline quality gates
Visual pipeline builder validation, join and aggregation guardrails, retries, logs filtering, and recurring scheduler execution
OpenLineage-compatible pipeline and notebook lifecycle events with local retention, dataset inputs/outputs, and optional HTTP delivery
Engine Lab notebooks with DuckDB, PySpark, and DataFusion runtimes, saved snippets, collaboration basics, and persisted cell outputs
BI dataset explorer, semantic metrics layer, natural-language chart generation, chart builder, saved charts, dashboard builder and viewer, filters, and report scheduler with stored artifacts
Embedded Superset runtime with a shared serving catalog and auto-provisioned DataWizz Serving Catalog connection
Clean-clone CI gates for repository integrity, frontend build, backend smoke tests, and Compose validation

Flink streaming support
Transactional quarantine/remediation actions for failed quality gates
OpenLineage coverage for SQL and reports plus additional external transport authentication modes
Hive Metastore or Nessie-backed catalog options
Notebook export artifacts and richer collaboration flows
Dashboard sharing and permissions
Alerts, subscriptions, and richer export delivery
Deployment automation, monitoring, and Kubernetes packaging

DataWizz is built to show what a modern internal analytics platform can look like when lakehouse workflows, orchestration, and BI are treated as one cohesive product surface.

Name		Name	Last commit message	Last commit date
Latest commit History 140 Commits
.github/workflows		.github/workflows
backend		backend
docker/superset		docker/superset
docs		docs
frontend		frontend
sample_data		sample_data
scripts/ci		scripts/ci
storage		storage
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Product Tour

Why DataWizz

Core Capabilities

Lakehouse

Notebook Runtime

Orchestration

BI Layer

Architecture

Quick Start

One command

Other launcher modes

Local Development

Backend

Frontend

Docker Demo Stack

Demo Flow

Sample SQL

Documentation

Verification

Current MVP Notes

Roadmap Status

Completed

Next

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Product Tour

Why DataWizz

Core Capabilities

Lakehouse

Notebook Runtime

Orchestration

BI Layer

Architecture

Quick Start

One command

Other launcher modes

Local Development

Backend

Frontend

Docker Demo Stack

Demo Flow

Sample SQL

Documentation

Verification

Current MVP Notes

Roadmap Status

Completed

Next

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages