Skip to content

rohankumardubey/DataWizz

Repository files navigation

DataWizz_Logo

A lakehouse, orchestration, and BI workspace for modern analytics teams.

Workspace Backend Frontend Notebook Engines Format Theme

DataWizz is a data platform inspired by Databricks, Snowflake, ClickHouse Cloud, Airflow, Superset, and Metabase. It combines file ingestion, SQL exploration, Delta Lake publishing, multi-engine notebooks, low-code orchestration, and business dashboards in one local-first workspace.

The current version is intentionally built as a serious MVP rather than a toy demo:

  • Upload, preview, and profile raw files
  • Query raw and curated data with DuckDB
  • Run notebooks with DuckDB, PySpark, and DataFusion
  • Publish transformed outputs as Delta tables
  • Build and validate visual pipelines
  • Track runs, retries, and logs
  • Create semantic datasets, governed metrics, charts, dashboards, and scheduled reports
  • Embed Superset and auto-provision a shared DataWizz analytics connection
  • Switch between dark and light workspace themes
  • Run locally with one script or as a Docker demo stack

Product Tour

DataWizz login
Workspace access
A polished dark-mode entry experience with demo credentials, platform positioning, and a more presentation-ready first impression.
DataWizz dashboard DataWizz SQL workspace
Lakehouse home
Monitor files, Delta assets, pipeline health, and workspace activity from a single landing page.
SQL workspace
Query raw files and curated outputs, inspect history, and write results back to Delta Lake.
DataWizz catalog DataWizz engine lab
Curated catalog
Browse governed Delta assets with ownership, freshness, schema, and preview data.
Notebook runtime lab
Build saved multi-cell notebooks, switch between DuckDB, Spark, and DataFusion, insert source-aware snippets, and persist per-cell outputs.
DataWizz data quality suite DataWizz OpenLineage events
Data quality operations
Choose native DuckDB or Great Expectations execution, retain validation evidence, schedule checks, and apply pipeline quality gates.
Operational lineage
Inspect OpenLineage lifecycle events with resolved inputs, Delta outputs, run facets, and external delivery status.
DataWizz pipeline builder DataWizz dashboard viewer
Pipeline builder
Design low-code flows, validate graph rules, schedule recurring runs, and export Airflow-style DAGs.
BI dashboard layer
Publish chart-driven dashboards, apply shared filters, and generate JSON or mock snapshot exports for stakeholder-ready reporting surfaces.

Why DataWizz

DataWizz is designed for analytics engineering and platform demos where you want a believable lakehouse product surface without needing a full distributed stack on day one.

It is especially useful when you want to demonstrate:

  • Raw-to-curated data workflows
  • SQL-first transformation on local or object-backed files
  • Notebook-driven prototyping across multiple execution engines
  • Delta Lake publishing with metadata tracking
  • Airflow-like orchestration without leaving the app
  • In-app BI dashboards on top of curated outputs

Core Capabilities

Lakehouse

  • File upload, preview, schema inference, and deletion
  • SQL querying over CSV, JSON, Parquet, and curated Delta tables
  • Write query outputs to Delta Lake with append or overwrite modes
  • Catalog browsing with metadata, freshness, ownership, tags, and lineage hints
  • Role-aware row filters and column masking for governed curated-table access
  • Reusable data-quality suites with selectable native DuckDB or Great Expectations execution
  • Theme-aware workspace shell with dark and light presentation modes

Notebook Runtime

  • Multi-cell saved notebooks in the Engine Lab
  • Real local execution for DuckDB, PySpark, and DataFusion
  • Run-all, run-single-cell, and run-from-here execution flows
  • Notebook duplicate, delete, rename, and run history support
  • Source-aware asset browser with one-click SQL or Python snippet insertion
  • Persisted per-cell outputs so reopened notebooks restore the latest visible state

Orchestration

  • Visual pipeline builder powered by React Flow
  • File source, Delta source, filter, select, join, aggregate, SQL, validate, write, and schedule nodes
  • DAG validation, node guardrails, run history, retries, and detailed logs
  • Airflow DAG code generation and export
  • Backend recurring scheduler for saved cron pipelines

BI Layer

  • Semantic dataset explorer
  • Governed semantic metrics layer with certified aggregate definitions and DuckDB previews
  • Natural-language chart generation that maps plain-English prompts to datasets, SQL, and chart configs
  • Dataset-driven chart builder
  • Saved chart library with traceability into dashboards and report schedules
  • Dashboard builder and dashboard viewer
  • Report scheduler with stored artifacts and snapshot history
  • Optional Superset integration surface for demo storytelling

Architecture

See the deeper system walkthrough in docs/architecture.md.

At a high level:

Users
  -> React + TypeScript frontend
  -> FastAPI application layer
  -> DuckDB execution services
  -> Delta Lake curated storage
  -> PostgreSQL metadata store
  -> Optional MinIO object storage

Project layout:

frontend/           React app for the workspace UI
backend/            FastAPI APIs, services, models, and migrations
docs/               Architecture, API, demo workflow, and screenshots
sample_data/        CSV fixtures and sample pipeline JSON
storage/            raw/, curated/, and temp/ runtime zones
docker-compose.yml  Demo stack for frontend, backend, PostgreSQL, MinIO, Superset
run.sh              One-command local launcher

Quick Start

One command

From the project root:

./run.sh

This launcher:

  • Reuses healthy local frontend and backend processes when they are already running
  • Starts the app in local demo mode when Docker is unavailable
  • Starts the managed Superset runtime automatically by default
  • Can bootstrap Superset natively without Docker when Docker is unavailable
  • Supports a Docker-based stack when Docker is installed

Local endpoints:

  • App: http://localhost:5173
  • API: http://localhost:8000
  • API docs: http://localhost:8000/docs
  • Embedded Superset page: http://localhost:5173/bi/superset

Demo credentials:

  • Email: admin@datawizz.local
  • Password: datawizz123

Other launcher modes

./run.sh local
./run.sh local nosuperset
./run.sh local superset native
./run.sh local --restart
./run.sh auto nosuperset
./run.sh docker
./run.sh docker nosuperset

Local Development

Backend

cd backend
python3 -m venv .venv
source .venv/bin/activate
pip install -e '.[dev]'
cp .env.example .env
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

Notes:

  • The backend targets PostgreSQL by default.
  • For quick local demos, the launcher can use SQLite-backed metadata automatically.
  • Local SQLite metadata is stored outside the repository at ${DATAWIZZ_LOCAL_DATABASE_PATH:-${DATAWIZZ_CACHE_DIR:-$HOME/.cache/datawizz}/local/metadata.db}. This prevents Git operations and cloud-sync clients from replacing a database while the backend is running.
  • Superset is pinned to version 6.1.0. On the first no-Docker launch, run.sh installs the native runtime in the machine-local DataWizz cache and initializes its metadata before reporting it healthy.
  • Later launches skip migrations and admin bootstrap, start the prepared runtime directly, verify the process and health endpoint, and provision the shared DuckDB catalog only when it is missing.
  • The native Superset package cache lives outside the repository (${XDG_CACHE_HOME:-$HOME/.cache}/datawizz by default), so cloud-synced workspaces and moved clones do not repeatedly download or slowly import hundreds of megabytes of Python packages. Set DATAWIZZ_CACHE_DIR to override it.

Frontend

cd frontend
npm install
cp .env.example .env
npm run dev

Docker Demo Stack

docker compose up --build

Included services:

  • Frontend
  • FastAPI backend
  • PostgreSQL
  • MinIO
  • Optional Superset profile

Optional Superset:

./run.sh
# force the no-Docker runtime
./run.sh local superset native
# skip Superset when you want only the core workspace
./run.sh local nosuperset

Demo Flow

For a complete scripted walkthrough, see docs/demo-workflow.md.

Suggested first demo:

  1. Upload sample_data/sales.csv and sample_data/customers.csv
  2. Query raw_sales in the SQL workspace
  3. Write sales_curated as a Delta table
  4. Open the catalog and inspect the curated asset
  5. Open Engine Lab and run a DuckDB, Spark, or DataFusion notebook cell
  6. Run the sample visual pipeline
  7. Build charts and review the published BI dashboard

Sample SQL

Regional revenue:

SELECT
  region,
  SUM(revenue) AS total_revenue
FROM raw_sales
GROUP BY region
ORDER BY total_revenue DESC;

Monthly revenue:

SELECT
  strftime(order_date, '%Y-%m') AS month,
  SUM(revenue) AS total_revenue
FROM raw_sales
GROUP BY 1
ORDER BY 1;

Top customers:

SELECT
  customer_id,
  SUM(revenue) AS total_revenue
FROM raw_sales
GROUP BY customer_id
ORDER BY total_revenue DESC
LIMIT 10;

Documentation

Verification

Every pull request and push to main runs GitHub Actions checks for repository integrity, frontend lint/build, backend compilation and smoke tests, and Docker Compose configuration.

Run the equivalent checks locally with:

./scripts/ci/check-repository.sh

cd backend
python -m compileall app alembic
pytest -q

cd ../frontend
npm run lint
npm run build

cd ..
docker compose config --quiet
docker compose --profile superset config --quiet

The repository-integrity check specifically verifies that critical frontend library files are tracked and that relative frontend imports resolve from a clean checkout.

Current MVP Notes

  • DuckDB remains the primary SQL workspace engine
  • Spark and DataFusion are available through the notebook runtime surface
  • Delta publishing is implemented through the backend write services
  • Scheduling is now active in-app for saved cron pipelines
  • Notebook outputs persist per cell and restore when a notebook is reopened
  • The BI layer is intentionally lightweight and app-native; Superset is now available as an embedded managed runtime with an auto-provisioned shared DuckDB connection

Roadmap Status

Completed

  • Real login, sessions, seeded users, and role-aware API and UI RBAC for admin, analyst, and viewer
  • Dark and light workspace themes with a polished shared shell, search, and page-level UX cleanup
  • File Explorer drag-and-drop uploads, schema and row preview, deep column profiling, and profile-driven recommendations
  • SQL Workspace querying, export, and Delta publishing backed by DuckDB
  • Catalog governance editing, row filters, column masking, quality and freshness signals, data contract guardrails, lineage relationships, and mini lineage graph drill-down
  • Curated-table quality suites with native DuckDB or Great Expectations execution, persisted run history, cron schedules, and pipeline quality gates
  • Visual pipeline builder validation, join and aggregation guardrails, retries, logs filtering, and recurring scheduler execution
  • OpenLineage-compatible pipeline and notebook lifecycle events with local retention, dataset inputs/outputs, and optional HTTP delivery
  • Engine Lab notebooks with DuckDB, PySpark, and DataFusion runtimes, saved snippets, collaboration basics, and persisted cell outputs
  • BI dataset explorer, semantic metrics layer, natural-language chart generation, chart builder, saved charts, dashboard builder and viewer, filters, and report scheduler with stored artifacts
  • Embedded Superset runtime with a shared serving catalog and auto-provisioned DataWizz Serving Catalog connection
  • Clean-clone CI gates for repository integrity, frontend build, backend smoke tests, and Compose validation

Next

  • Flink streaming support
  • Transactional quarantine/remediation actions for failed quality gates
  • OpenLineage coverage for SQL and reports plus additional external transport authentication modes
  • Hive Metastore or Nessie-backed catalog options
  • Notebook export artifacts and richer collaboration flows
  • Dashboard sharing and permissions
  • Alerts, subscriptions, and richer export delivery
  • Deployment automation, monitoring, and Kubernetes packaging

DataWizz is built to show what a modern internal analytics platform can look like when lakehouse workflows, orchestration, and BI are treated as one cohesive product surface.

About

DataWizz is a data platform which combines file ingestion, SQL exploration, Delta Lake publishing, multi-engine notebooks, low-code orchestration, and business dashboards in one local-first workspace.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors