Skip to content

Add dataproc tpcds example notebook#607

Open
viadea wants to merge 4 commits intoNVIDIA:mainfrom
viadea:tpcds_example_dataproc
Open

Add dataproc tpcds example notebook#607
viadea wants to merge 4 commits intoNVIDIA:mainfrom
viadea:tpcds_example_dataproc

Conversation

@viadea
Copy link
Collaborator

@viadea viadea commented Nov 19, 2025

Add an example tpcds notebook for GCP dataproc.

Signed-off-by: Hao Zhu <hazhu@hazhu-mlt.client.nvidia.com>
@viadea viadea requested a review from gerashegalov November 19, 2025 22:24
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Nov 19, 2025

Greptile Overview

Greptile Summary

Adds a new TPC-DS benchmark notebook for GCP Dataproc along with comprehensive setup documentation in the README.

  • Added detailed Dataproc cluster creation instructions with proper environment variables and Spark configurations
  • Created new notebook TPCDS-SF3K-Dataproc.ipynb that benchmarks TPC-DS queries comparing CPU vs GPU performance
  • Notebook follows similar structure to existing TPCDS-SF10.ipynb but adapted for Dataproc environment
  • Includes proper cell output clearing and follows repository standards

The PR has syntax issues that need correction before merging (string interpolation bug and placeholder paths).

Confidence Score: 4/5

  • This PR is safe to merge after addressing syntax issues
  • The PR adds valuable documentation and example code with no logical errors, but contains syntax bugs (string interpolation and regex) that were already flagged and need fixing before merge
  • Pay attention to examples/SQL+DF-Examples/tpcds/notebooks/TPCDS-SF3K-Dataproc.ipynb which contains syntax errors that need correction

Important Files Changed

File Analysis

Filename Score Overview
examples/SQL+DF-Examples/tpcds/README.md 5/5 Added Dataproc cluster creation instructions and notebook reference
examples/SQL+DF-Examples/tpcds/notebooks/TPCDS-SF3K-Dataproc.ipynb 4/5 New TPC-DS benchmark notebook for Dataproc with placeholder issues and string interpolation bugs

Sequence Diagram

sequenceDiagram
    participant User
    participant Jupyter as Jupyter Notebook
    participant Spark as Spark Session
    participant GCS as Google Cloud Storage
    participant GPU as GPU Accelerator
    
    User->>Jupyter: Install packages (tpcds_pyspark, sparkmeasure)
    User->>Jupyter: Import modules
    User->>Jupyter: Detect Scala version from JAR
    Jupyter->>Spark: Create SparkSession with RAPIDS config
    Spark-->>Jupyter: Session created
    User->>Spark: Verify GPU acceleration with test query
    Spark->>GPU: Execute test query
    GPU-->>Spark: Results
    Spark-->>User: Show query plan
    User->>Jupyter: Initialize TPCDS with GCS data path
    User->>Spark: Register TPC-DS tables
    Spark->>GCS: Load table metadata
    GCS-->>Spark: Table schemas
    User->>Spark: Run TPC-DS queries (GPU mode)
    Spark->>GPU: Execute queries
    GPU-->>Spark: GPU results
    Spark-->>Jupyter: Store GPU benchmark results
    User->>Spark: Run TPC-DS queries (CPU mode)
    Spark->>Spark: Execute queries on CPU
    Spark-->>Jupyter: Store CPU benchmark results
    User->>Jupyter: Merge and analyze results
    Jupyter->>Jupyter: Calculate speedup factors
    Jupyter-->>User: Display comparison charts
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile
React with 👍 or 👎 to share your feedback on this new summary format

"]\n",
"\n",
"demo_start = time.time()\n",
"tpcds = TPCDS(data_path='gs://gcs_bucket/parquet_sf3k_decimal/', num_runs=1, queries_repeat_times=1, queries=queries)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: gs://gcs_bucket is a placeholder - should be updated to match the $GCS_BUCKET variable pattern used in the README

Suggested change
"tpcds = TPCDS(data_path='gs://gcs_bucket/parquet_sf3k_decimal/', num_runs=1, queries_repeat_times=1, queries=queries)"
"tpcds = TPCDS(data_path='gs://$GCS_BUCKET/parquet_sf3k_decimal/', num_runs=1, queries_repeat_times=1, queries=queries)"

Comment on lines +42 to +45
"source": [
"%pip install --quiet \\\n",
" tpcds_pyspark \\\n",
" pandas \\\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: String formatting error - sparkmeasure_version variable won't be interpolated inside single quotes. Use f-string

Suggested change
"source": [
"%pip install --quiet \\\n",
" tpcds_pyspark \\\n",
" pandas \\\n",
%pip install --quiet \
tpcds_pyspark \
pandas \
sparkmeasure==f"{sparkmeasure_version}.0" \
matplotlib

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gerashegalov I think my GCP dataproc version notebook is matching your original notebook. So I think I should ignore this suggestion. How do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the IPython magic already does variable substitution

"output_type": "execute_result"
}
],
"source": [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: Regex needs escaping for the dot - \d+. should be \d+\. to match a literal period

Suggested change
"source": [
scala_version = re.search(r'^spark-sql_(\d+\.\d+)-.*\.jar$', spark_sql_jar).group(1)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gerashegalov Same here. This dataproc version is matching your original version for this regexp. I want to ignore this suggestion.

Copy link
Collaborator

@gerashegalov gerashegalov Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a legitimate improvement although it does not change anything in practice. We can fix it in the original notebook

Comment on lines +144 to +166
"text/html": [
"\n",
" <div>\n",
" <p><b>SparkSession - hive</b></p>\n",
" \n",
" <div>\n",
" <p><b>SparkContext</b></p>\n",
"\n",
" <p><a href=\"http://testbyhao2-ubuntu22-m.c.rapids-spark.internal:46705\">Spark UI</a></p>\n",
"\n",
" <dl>\n",
" <dt>Version</dt>\n",
" <dd><code>v3.5.3</code></dd>\n",
" <dt>Master</dt>\n",
" <dd><code>yarn</code></dd>\n",
" <dt>AppName</dt>\n",
" <dd><code>PySparkShell</code></dd>\n",
" </dl>\n",
" </div>\n",
" \n",
" </div>\n",
" "
],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please clear the notebook output for the PR

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure will do.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleared the all output.

@gerashegalov
Copy link
Collaborator

Please add a PR description

Signed-off-by: Hao Zhu <hazhu@hazhu-mlt.client.nvidia.com>
@viadea viadea requested a review from gerashegalov November 20, 2025 23:17
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Hao Zhu added 2 commits November 20, 2025 15:28
Signed-off-by: Hao Zhu <hazhu@hazhu-mlt.client.nvidia.com>
Signed-off-by: Hao Zhu <hazhu@hazhu-mlt.client.nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@gerashegalov
Copy link
Collaborator

gerashegalov commented Nov 21, 2025

Per offline conversation let us try to add knobs for hosted Spark and hosted Data so we can accommodate these use cases in the original TPC-DS notebook instead of adding a clone with few modifications.

We will gradually expand the README in the follow up PRs to explain how to run this notebook in different Cloud providers

@sameerz
Copy link
Collaborator

sameerz commented Dec 2, 2025

Please add a performance benchmark running on the CPU vs. GPU.

@sameerz
Copy link
Collaborator

sameerz commented Dec 8, 2025

Per offline conversation let us try to add knobs for hosted Spark and hosted Data so we can accommodate these use cases in the original TPC-DS notebook instead of adding a clone with few modifications.

We will gradually expand the README in the follow up PRs to explain how to run this notebook in different Cloud providers

Request here is to provide a notebook specific to each environment, so users do not need to make any changes. Make it as simple as possible for the user.

Understand that will create maintenance overhead.

@gerashegalov
Copy link
Collaborator

Request here is to provide a notebook specific to each environment, so users do not need to make any changes. Make it as simple as possible for the user.

Understand that will create maintenance overhead.

The PR already assumes CSP-specific instructions for launching it if you look at the proposed README changes. I bet that there is already enough specifics in the default environment even without it to make minor adjustments to create minor CSP-specific logic in the notebook. If not it can be part of the command documented for the user anyways.

@nvauto
Copy link
Collaborator

nvauto commented Jan 26, 2026

NOTE: release/26.02 has been created from main. Please retarget your PR to release/26.02 if it should be included in the release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants