This repository contains the Connected Component Finder (CCF) project in both
Python and Scala, using Spark RDD and DataFrame APIs.
The Python entry points are now cluster-safe:
- they only force
local[*]when no Spark master is already configured - they accept
--checkpoint-dir experiments.pyaccepts Dataproc-friendly CLI flags
Set the shared values first:
GCLOUD=./google-cloud-sdk/bin/gcloud
PROJECT_ID=project-7fd33cb0-5f93-49dc-bf4
REGION=europe-west1
CLUSTER=ccf-test
BUCKET=gs://YOUR_BUCKET/ccfgsutil -m cp python/*.py "$BUCKET/python/"
gsutil -m cp data/graph_*.txt "$BUCKET/data/"$GCLOUD dataproc jobs submit pyspark python/ccf_rdd.py \
--project="$PROJECT_ID" \
--region="$REGION" \
--cluster="$CLUSTER" \
-- \
"$BUCKET/data/graph_1m.txt" \
--checkpoint-dir "$BUCKET/checkpoints/rdd"$GCLOUD dataproc jobs submit pyspark python/ccf_dataframe.py \
--project="$PROJECT_ID" \
--region="$REGION" \
--cluster="$CLUSTER" \
-- \
"$BUCKET/data/graph_1m.txt" \
--checkpoint-dir "$BUCKET/checkpoints/df"experiments.py imports ccf_rdd.py and ccf_dataframe.py, so provide them
with --py-files when submitting to Dataproc.
$GCLOUD dataproc jobs submit pyspark python/experiments.py \
--project="$PROJECT_ID" \
--region="$REGION" \
--cluster="$CLUSTER" \
--py-files=python/ccf_rdd.py,python/ccf_dataframe.py \
-- \
--environment dataproc \
--graph-set small \
--data-dir "$BUCKET/data" \
--checkpoint-dir "$BUCKET/checkpoints/experiments"Use --graph-set large once the graph_10m, graph_50m, and graph_200m
datasets exist in GCS.