Conversation
Signed-off-by: Hao Zhu <hazhu@hazhu-mlt.client.nvidia.com>
Greptile OverviewGreptile SummaryAdds a new TPC-DS benchmark notebook for GCP Dataproc along with comprehensive setup documentation in the README.
The PR has syntax issues that need correction before merging (string interpolation bug and placeholder paths). Confidence Score: 4/5
Important Files ChangedFile Analysis
Sequence DiagramsequenceDiagram
participant User
participant Jupyter as Jupyter Notebook
participant Spark as Spark Session
participant GCS as Google Cloud Storage
participant GPU as GPU Accelerator
User->>Jupyter: Install packages (tpcds_pyspark, sparkmeasure)
User->>Jupyter: Import modules
User->>Jupyter: Detect Scala version from JAR
Jupyter->>Spark: Create SparkSession with RAPIDS config
Spark-->>Jupyter: Session created
User->>Spark: Verify GPU acceleration with test query
Spark->>GPU: Execute test query
GPU-->>Spark: Results
Spark-->>User: Show query plan
User->>Jupyter: Initialize TPCDS with GCS data path
User->>Spark: Register TPC-DS tables
Spark->>GCS: Load table metadata
GCS-->>Spark: Table schemas
User->>Spark: Run TPC-DS queries (GPU mode)
Spark->>GPU: Execute queries
GPU-->>Spark: GPU results
Spark-->>Jupyter: Store GPU benchmark results
User->>Spark: Run TPC-DS queries (CPU mode)
Spark->>Spark: Execute queries on CPU
Spark-->>Jupyter: Store CPU benchmark results
User->>Jupyter: Merge and analyze results
Jupyter->>Jupyter: Calculate speedup factors
Jupyter-->>User: Display comparison charts
|
There was a problem hiding this comment.
2 files reviewed, 3 comments
Edit Code Review Agent Settings | Greptile
React with 👍 or 👎 to share your feedback on this new summary format
| "]\n", | ||
| "\n", | ||
| "demo_start = time.time()\n", | ||
| "tpcds = TPCDS(data_path='gs://gcs_bucket/parquet_sf3k_decimal/', num_runs=1, queries_repeat_times=1, queries=queries)" |
There was a problem hiding this comment.
syntax: gs://gcs_bucket is a placeholder - should be updated to match the $GCS_BUCKET variable pattern used in the README
| "tpcds = TPCDS(data_path='gs://gcs_bucket/parquet_sf3k_decimal/', num_runs=1, queries_repeat_times=1, queries=queries)" | |
| "tpcds = TPCDS(data_path='gs://$GCS_BUCKET/parquet_sf3k_decimal/', num_runs=1, queries_repeat_times=1, queries=queries)" |
| "source": [ | ||
| "%pip install --quiet \\\n", | ||
| " tpcds_pyspark \\\n", | ||
| " pandas \\\n", |
There was a problem hiding this comment.
syntax: String formatting error - sparkmeasure_version variable won't be interpolated inside single quotes. Use f-string
| "source": [ | |
| "%pip install --quiet \\\n", | |
| " tpcds_pyspark \\\n", | |
| " pandas \\\n", | |
| %pip install --quiet \ | |
| tpcds_pyspark \ | |
| pandas \ | |
| sparkmeasure==f"{sparkmeasure_version}.0" \ | |
| matplotlib |
There was a problem hiding this comment.
@gerashegalov I think my GCP dataproc version notebook is matching your original notebook. So I think I should ignore this suggestion. How do you think?
There was a problem hiding this comment.
Yes the IPython magic already does variable substitution
| "output_type": "execute_result" | ||
| } | ||
| ], | ||
| "source": [ |
There was a problem hiding this comment.
syntax: Regex needs escaping for the dot - \d+. should be \d+\. to match a literal period
| "source": [ | |
| scala_version = re.search(r'^spark-sql_(\d+\.\d+)-.*\.jar$', spark_sql_jar).group(1) |
There was a problem hiding this comment.
@gerashegalov Same here. This dataproc version is matching your original version for this regexp. I want to ignore this suggestion.
There was a problem hiding this comment.
This is a legitimate improvement although it does not change anything in practice. We can fix it in the original notebook
| "text/html": [ | ||
| "\n", | ||
| " <div>\n", | ||
| " <p><b>SparkSession - hive</b></p>\n", | ||
| " \n", | ||
| " <div>\n", | ||
| " <p><b>SparkContext</b></p>\n", | ||
| "\n", | ||
| " <p><a href=\"http://testbyhao2-ubuntu22-m.c.rapids-spark.internal:46705\">Spark UI</a></p>\n", | ||
| "\n", | ||
| " <dl>\n", | ||
| " <dt>Version</dt>\n", | ||
| " <dd><code>v3.5.3</code></dd>\n", | ||
| " <dt>Master</dt>\n", | ||
| " <dd><code>yarn</code></dd>\n", | ||
| " <dt>AppName</dt>\n", | ||
| " <dd><code>PySparkShell</code></dd>\n", | ||
| " </dl>\n", | ||
| " </div>\n", | ||
| " \n", | ||
| " </div>\n", | ||
| " " | ||
| ], |
There was a problem hiding this comment.
Please clear the notebook output for the PR
There was a problem hiding this comment.
Cleared the all output.
|
Please add a PR description |
Signed-off-by: Hao Zhu <hazhu@hazhu-mlt.client.nvidia.com>
Signed-off-by: Hao Zhu <hazhu@hazhu-mlt.client.nvidia.com>
Signed-off-by: Hao Zhu <hazhu@hazhu-mlt.client.nvidia.com>
|
Per offline conversation let us try to add knobs for hosted Spark and hosted Data so we can accommodate these use cases in the original TPC-DS notebook instead of adding a clone with few modifications. We will gradually expand the README in the follow up PRs to explain how to run this notebook in different Cloud providers |
|
Please add a performance benchmark running on the CPU vs. GPU. |
Request here is to provide a notebook specific to each environment, so users do not need to make any changes. Make it as simple as possible for the user. Understand that will create maintenance overhead. |
The PR already assumes CSP-specific instructions for launching it if you look at the proposed README changes. I bet that there is already enough specifics in the default environment even without it to make minor adjustments to create minor CSP-specific logic in the notebook. If not it can be part of the command documented for the user anyways. |
|
NOTE: release/26.02 has been created from main. Please retarget your PR to release/26.02 if it should be included in the release. |
Add an example tpcds notebook for GCP dataproc.