Add dataproc tpcds example notebook by viadea · Pull Request #607 · NVIDIA/spark-rapids-examples

viadea · 2025-11-19T22:24:43Z

Add an example tpcds notebook for GCP dataproc.

Signed-off-by: Hao Zhu <hazhu@hazhu-mlt.client.nvidia.com>

greptile-apps · 2025-11-19T22:27:23Z

Greptile Overview

Greptile Summary

Adds a new TPC-DS benchmark notebook for GCP Dataproc along with comprehensive setup documentation in the README.

Added detailed Dataproc cluster creation instructions with proper environment variables and Spark configurations
Created new notebook TPCDS-SF3K-Dataproc.ipynb that benchmarks TPC-DS queries comparing CPU vs GPU performance
Notebook follows similar structure to existing TPCDS-SF10.ipynb but adapted for Dataproc environment
Includes proper cell output clearing and follows repository standards

The PR has syntax issues that need correction before merging (string interpolation bug and placeholder paths).

Confidence Score: 4/5

This PR is safe to merge after addressing syntax issues
The PR adds valuable documentation and example code with no logical errors, but contains syntax bugs (string interpolation and regex) that were already flagged and need fixing before merge
Pay attention to examples/SQL+DF-Examples/tpcds/notebooks/TPCDS-SF3K-Dataproc.ipynb which contains syntax errors that need correction

Important Files Changed

File Analysis

Filename	Score	Overview
examples/SQL+DF-Examples/tpcds/README.md	5/5	Added Dataproc cluster creation instructions and notebook reference
examples/SQL+DF-Examples/tpcds/notebooks/TPCDS-SF3K-Dataproc.ipynb	4/5	New TPC-DS benchmark notebook for Dataproc with placeholder issues and string interpolation bugs

Sequence Diagram

sequenceDiagram
    participant User
    participant Jupyter as Jupyter Notebook
    participant Spark as Spark Session
    participant GCS as Google Cloud Storage
    participant GPU as GPU Accelerator
    
    User->>Jupyter: Install packages (tpcds_pyspark, sparkmeasure)
    User->>Jupyter: Import modules
    User->>Jupyter: Detect Scala version from JAR
    Jupyter->>Spark: Create SparkSession with RAPIDS config
    Spark-->>Jupyter: Session created
    User->>Spark: Verify GPU acceleration with test query
    Spark->>GPU: Execute test query
    GPU-->>Spark: Results
    Spark-->>User: Show query plan
    User->>Jupyter: Initialize TPCDS with GCS data path
    User->>Spark: Register TPC-DS tables
    Spark->>GCS: Load table metadata
    GCS-->>Spark: Table schemas
    User->>Spark: Run TPC-DS queries (GPU mode)
    Spark->>GPU: Execute queries
    GPU-->>Spark: GPU results
    Spark-->>Jupyter: Store GPU benchmark results
    User->>Spark: Run TPC-DS queries (CPU mode)
    Spark->>Spark: Execute queries on CPU
    Spark-->>Jupyter: Store CPU benchmark results
    User->>Jupyter: Merge and analyze results
    Jupyter->>Jupyter: Calculate speedup factors
    Jupyter-->>User: Display comparison charts

greptile-apps

_{2 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}
_{React with 👍 or 👎 to share your feedback on this new summary format}

greptile-apps · 2025-11-19T22:27:21Z

examples/SQL+DF-Examples/tpcds/notebooks/TPCDS-SF3K-Dataproc.ipynb

+    "]\n",
+    "\n",
+    "demo_start = time.time()\n",
+    "tpcds = TPCDS(data_path='gs://gcs_bucket/parquet_sf3k_decimal/', num_runs=1, queries_repeat_times=1, queries=queries)"


syntax: gs://gcs_bucket is a placeholder - should be updated to match the $GCS_BUCKET variable pattern used in the README

Suggested change

"tpcds = TPCDS(data_path='gs://gcs_bucket/parquet_sf3k_decimal/', num_runs=1, queries_repeat_times=1, queries=queries)"

"tpcds = TPCDS(data_path='gs://$GCS_BUCKET/parquet_sf3k_decimal/', num_runs=1, queries_repeat_times=1, queries=queries)"

greptile-apps · 2025-11-19T22:27:22Z

examples/SQL+DF-Examples/tpcds/notebooks/TPCDS-SF3K-Dataproc.ipynb

+   "source": [
+    "%pip install --quiet \\\n",
+    "  tpcds_pyspark \\\n",
+    "  pandas \\\n",


syntax: String formatting error - sparkmeasure_version variable won't be interpolated inside single quotes. Use f-string

Suggested change

"source": [

"%pip install --quiet \\\n",

" tpcds_pyspark \\\n",

" pandas \\\n",

%pip install --quiet \

tpcds_pyspark \

pandas \

sparkmeasure==f"{sparkmeasure_version}.0" \

matplotlib

@gerashegalov I think my GCP dataproc version notebook is matching your original notebook. So I think I should ignore this suggestion. How do you think?

Yes the IPython magic already does variable substitution

greptile-apps · 2025-11-19T22:27:23Z

examples/SQL+DF-Examples/tpcds/notebooks/TPCDS-SF3K-Dataproc.ipynb

+     "output_type": "execute_result"
+    }
+   ],
+   "source": [


syntax: Regex needs escaping for the dot - \d+. should be \d+\. to match a literal period

Suggested change

"source": [

scala_version = re.search(r'^spark-sql_(\d+\.\d+)-.*\.jar$', spark_sql_jar).group(1)

@gerashegalov Same here. This dataproc version is matching your original version for this regexp. I want to ignore this suggestion.

This is a legitimate improvement although it does not change anything in practice. We can fix it in the original notebook

gerashegalov · 2025-11-19T22:42:47Z

examples/SQL+DF-Examples/tpcds/notebooks/TPCDS-SF3K-Dataproc.ipynb

+      "text/html": [
+       "\n",
+       "            <div>\n",
+       "                <p><b>SparkSession - hive</b></p>\n",
+       "                \n",
+       "        <div>\n",
+       "            <p><b>SparkContext</b></p>\n",
+       "\n",
+       "            <p><a href=\"http://testbyhao2-ubuntu22-m.c.rapids-spark.internal:46705\">Spark UI</a></p>\n",
+       "\n",
+       "            <dl>\n",
+       "              <dt>Version</dt>\n",
+       "                <dd><code>v3.5.3</code></dd>\n",
+       "              <dt>Master</dt>\n",
+       "                <dd><code>yarn</code></dd>\n",
+       "              <dt>AppName</dt>\n",
+       "                <dd><code>PySparkShell</code></dd>\n",
+       "            </dl>\n",
+       "        </div>\n",
+       "        \n",
+       "            </div>\n",
+       "        "
+      ],


Please clear the notebook output for the PR

Sure will do.

Cleared the all output.

gerashegalov · 2025-11-19T22:44:08Z

Please add a PR description

Signed-off-by: Hao Zhu <hazhu@hazhu-mlt.client.nvidia.com>

greptile-apps

_{2 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Hao Zhu <hazhu@hazhu-mlt.client.nvidia.com>

greptile-apps

_{2 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

gerashegalov · 2025-11-21T00:09:11Z

Per offline conversation let us try to add knobs for hosted Spark and hosted Data so we can accommodate these use cases in the original TPC-DS notebook instead of adding a clone with few modifications.

We will gradually expand the README in the follow up PRs to explain how to run this notebook in different Cloud providers

sameerz · 2025-12-02T21:40:07Z

Please add a performance benchmark running on the CPU vs. GPU.

sameerz · 2025-12-08T21:50:45Z

Per offline conversation let us try to add knobs for hosted Spark and hosted Data so we can accommodate these use cases in the original TPC-DS notebook instead of adding a clone with few modifications.

We will gradually expand the README in the follow up PRs to explain how to run this notebook in different Cloud providers

Request here is to provide a notebook specific to each environment, so users do not need to make any changes. Make it as simple as possible for the user.

Understand that will create maintenance overhead.

gerashegalov · 2025-12-09T19:40:40Z

Request here is to provide a notebook specific to each environment, so users do not need to make any changes. Make it as simple as possible for the user.

Understand that will create maintenance overhead.

The PR already assumes CSP-specific instructions for launching it if you look at the proposed README changes. I bet that there is already enough specifics in the default environment even without it to make minor adjustments to create minor CSP-specific logic in the notebook. If not it can be part of the command documented for the user anyways.

nvauto · 2026-01-26T02:09:23Z

NOTE: release/26.02 has been created from main. Please retarget your PR to release/26.02 if it should be included in the release.

Add dataproc tpcds example notebook

85db578

Signed-off-by: Hao Zhu <hazhu@hazhu-mlt.client.nvidia.com>

viadea requested a review from gerashegalov November 19, 2025 22:24

greptile-apps bot reviewed Nov 19, 2025

View reviewed changes

gerashegalov reviewed Nov 19, 2025

View reviewed changes

Cleared the notebook output and did some minor change on README.

7eac061

Signed-off-by: Hao Zhu <hazhu@hazhu-mlt.client.nvidia.com>

viadea requested a review from gerashegalov November 20, 2025 23:17

greptile-apps bot reviewed Nov 20, 2025

View reviewed changes

Hao Zhu added 2 commits November 20, 2025 15:28

Modify some format issue for README

6d9aa88

Signed-off-by: Hao Zhu <hazhu@hazhu-mlt.client.nvidia.com>

Clear a cell output

19d0c6e

Signed-off-by: Hao Zhu <hazhu@hazhu-mlt.client.nvidia.com>

greptile-apps bot reviewed Nov 20, 2025

View reviewed changes

	"tpcds = TPCDS(data_path='gs://gcs_bucket/parquet_sf3k_decimal/', num_runs=1, queries_repeat_times=1, queries=queries)"
	"tpcds = TPCDS(data_path='gs://$GCS_BUCKET/parquet_sf3k_decimal/', num_runs=1, queries_repeat_times=1, queries=queries)"

	"source": [
	scala_version = re.search(r'^spark-sql_(\d+\.\d+)-.*\.jar$', spark_sql_jar).group(1)

Conversation

viadea commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps bot commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

viadea Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

gerashegalov Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

viadea Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

gerashegalov Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gerashegalov Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

viadea Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

viadea Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

gerashegalov commented Nov 19, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

gerashegalov commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sameerz commented Dec 2, 2025

Uh oh!

sameerz commented Dec 8, 2025

Uh oh!

gerashegalov commented Dec 9, 2025

Uh oh!

nvauto commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

viadea commented Nov 19, 2025 •

edited

Loading

greptile-apps bot commented Nov 19, 2025 •

edited

Loading

gerashegalov Dec 3, 2025 •

edited

Loading

gerashegalov commented Nov 21, 2025 •

edited

Loading